OpenAI and Broadcom Unveil Jalapeño Chip: Targeting 50% Inference Cost Reduction, Training Still Relies on NVIDIA

OpenAI and Broadcom jointly announced the launch of Jalapeño, the first custom ASIC chip optimized for large language model inference, aiming to reduce per-response costs by approximately 50% and lessen reliance on NVIDIA. Scheduled for deployment by end of 2026 and mass production by 2027-2028.

OpenAI and Broadcom jointly announced the launch of Jalapeño, the first custom ASIC chip optimized for large language model inference. Leveraging OpenAI's own AI assistance, the chip completed design and tape-out in just 9 months, aiming to reduce single-response costs by approximately 50% and lessen dependence on NVIDIA. It is scheduled for deployment by end of 2026, with mass production from 2027 to 2028.

Chip Design and Manufacturing Process

Jalapeño adopts a custom ASIC architecture with hardware optimizations for the attention mechanism and feed-forward networks of Transformer models. It integrates matrix multiplication units with dedicated memory controllers to reduce data transfer latency. During the design phase, OpenAI used internal AI tools to automatically generate portions of the RTL code, shortening the traditional 9- to 12-month verification cycle to 9 months.

Broadcom is responsible for the tape-out stage, employing advanced process nodes to balance power consumption and performance. Test data shows that under typical inference workloads, a single chip delivers approximately 1.8 times the performance per watt compared to general-purpose GPUs.

Principles of Inference Efficiency Improvement

The core of large language model inference is the repeated execution of matrix operations. Jalapeño hard-codes common operators at the hardware level, such as QKV projections in multi-head attention, eliminating software-level scheduling overhead. Combined with model quantization techniques, it replaces floating-point computations with 8-bit integer operations, further reducing power consumption.

The target of a 50% reduction in single-response cost is based on internal benchmarks: on the same model, per-token latency on Jalapeño drops from 12 milliseconds to 6 milliseconds, achieving the expected results after accounting for electricity costs and server depreciation.

Differences from the NVIDIA Ecosystem

Training still relies entirely on NVIDIA GPU clusters. Jalapeño only covers the inference path and cannot perform the gradient computations required for backpropagation. This means OpenAI must maintain a dual-track hardware system: H100/H200 clusters for training, and gradual migration of inference to custom ASICs.

By the end of 2026, the first batch of Jalapeño servers will be deployed in OpenAI's own data centers, with an initial scale of a few thousand chips. Starting in 2027, Broadcom will begin scaled production, with output expected to exceed 100,000 chips by 2028.

Industry Supply Chain Impact

With custom ASICs entering the inference market, NVIDIA's share in inference may gradually decline from the current 85% to around 70%. Broadcom gains stable orders from this, strengthening its position in the AI accelerator foundry business.

Other cloud service providers have started evaluating similar solutions. Amazon and Google have previously launched Inferentia and TPU, respectively.

Future Deployment Roadmap

  • Q4 2026: Internal small-scale verification cluster goes online
  • 2027: Partial API traffic switches to Jalapeño
  • 2028: New model inference defaults to ASIC, with GPUs reserved only for high-precision training tasks

Cost reductions will be directly reflected in API pricing. OpenAI plans to reduce GPT-series inference prices by 30% in 2027 to expand its user base.

Technical Risks and Limitations

With ASIC fixed functionality, model architecture upgrades require a new tape-out cycle. The current design targets existing Transformer models; if novel attention variants emerge in the future, hardware compatibility could become a bottleneck. OpenAI says it will retain 10-15% of GPU capacity as a hot standby.

The power wall remains a long-term challenge. The peak power consumption of a single chip is controlled within 300 watts, but large-scale clusters still require redesign of liquid cooling systems.