NVIDIA Nemotron Two-Tower Diffusion Model Released, Inference Speed Increased by 2.42 Times with High Fidelity

NVIDIA has released the Nemotron-Labs-TwoTower diffusion language model, which splits the original 30B-parameter monolithic structure into a two-tower architecture, enabling parallel token generation with a measured 2.42x speedup while maintaining 98.7% quality retention.

NVIDIA has officially released the Nemotron-Labs-TwoTower diffusion language model, a technological breakthrough that will significantly improve the inference efficiency of large language models. The model splits the original 30B-parameter monolithic structure into a two-tower architecture, supporting parallel token generation with a measured 2.42x speedup and a quality retention rate as high as 98.7%. This achievement quickly sparked heated discussions on platform X, with NVIDIA's official post receiving thousands of likes.

Technical Core: Two-Tower Parallel Generation Mechanism

Traditional autoregressive models must compute tokens one by one, creating an obvious serial bottleneck. The Nemotron two-tower model innovatively splits the network into two parallel tower structures—one responsible for context modeling and the other focused on token prediction—driven by a diffusion process that advances synchronously, significantly reducing overall latency. Experimental data shows that under the same hardware conditions, the inference throughput of the 30B-scale model more than doubles.

Balancing Quality and Speed

Speed improvements often come with quality degradation, but Nemotron keeps the quality loss within 1.3% through carefully designed alignment training and diffusion scheduling strategies. Benchmark tests cover multiple datasets including MMLU and HumanEval, and the results show that the model maintains high consistency with the original version on tasks such as mathematical reasoning and code generation.

Industry Impact and Application Prospects

This technology brings new possibilities for edge devices and real-time interaction scenarios. Developers can leverage tools like NVIDIA TensorRT for rapid deployment, reducing cloud computing costs. Analysts point out that the two-tower architecture may become the standard paradigm for next-generation diffusion language models, accelerating the deployment of AI products.

Conclusion

NVIDIA's move once again demonstrates its leadership in AI infrastructure. As more parallel optimization techniques emerge in the future, the inference efficiency of large models is expected to continue breaking through, creating greater value for the industry.