OpenAI and Broadcom jointly announced the launch of Jalapeño, the first custom ASIC chip optimized for large language model inference. Leveraging OpenAI's own AI assistance, the chip completed design and tape-out in just 9 months, aiming to reduce single-response costs by approximately 50% and lessen dependence on NVIDIA. It is scheduled for deployment by end of 2026, with mass production from 2027 to 2028.
Chip Design and Manufacturing Process
Jalapeño adopts a custom ASIC architecture with hardware optimizations for the attention mechanism and feed-forward networks of Transformer models. It integrates matrix multiplication units with dedicated memory controllers to reduce data transfer latency. During the design phase, OpenAI used internal AI tools to automatically generate portions of the RTL code, shortening the traditional 9- to 12-month verification cycle to 9 months.
Broadcom is responsible for the tape-out stage, employing advanced process nodes to balance power consumption and performance. Test data shows that under typical inference workloads, a single chip delivers approximately 1.8 times the performance per watt compared to general-purpose GPUs.
Principles of Inference Efficiency Improvement
The core of large language model inference is the repeated execution of matrix operations. Jalapeño hard-codes common operators at the hardware level, such as QKV projections in multi-head attention, eliminating software-level scheduling overhead. Combined with model quantization techniques, it replaces floating-point computations with 8-bit integer operations, further reducing power consumption.
The target of a 50% reduction in single-response cost is based on internal benchmarks: on the same model, per-token latency on Jalapeño drops from 12 milliseconds to 6 milliseconds, achieving the expected results after accounting for electricity costs and server depreciation.
Differences from the NVIDIA Ecosystem
Training still relies entirely on NVIDIA GPU clusters. Jalapeño only covers the inference path and cannot perform the gradient computations required for backpropagation. This means OpenAI must maintain a dual-track hardware system: H100/H200 clusters for training, and gradual migration of inference to custom ASICs.
By the end of 2026, the first batch of Jalapeño servers will be deployed in OpenAI's own data centers, with an initial scale of a few thousand chips. Starting in 2027, Broadcom will begin scaled production, with output expected to exceed 100,000 chips by 2028.
Industry Supply Chain Impact
With custom ASICs entering the inference market, NVIDIA's share in inference may gradually decline from the current 85% to around 70%. Broadcom gains stable orders from this, strengthening its position in the AI accelerator foundry business.
Other cloud service providers have started evaluating similar solutions. Amazon and Google have previously launched Inferentia and TPU, respectively.
Future Deployment Roadmap
- Q4 2026: Internal small-scale verification cluster goes online
- 2027: Partial API traffic switches to Jalapeño
- 2028: New model inference defaults to ASIC, with GPUs reserved only for high-precision training tasks
Cost reductions will be directly reflected in API pricing. OpenAI plans to reduce GPT-series inference prices by 30% in 2027 to expand its user base.
Technical Risks and Limitations
With ASIC fixed functionality, model architecture upgrades require a new tape-out cycle. The current design targets existing Transformer models; if novel attention variants emerge in the future, hardware compatibility could become a bottleneck. OpenAI says it will retain 10-15% of GPU capacity as a hot standby.
The power wall remains a long-term challenge. The peak power consumption of a single chip is controlled within 300 watts, but large-scale clusters still require redesign of liquid cooling systems.
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接