Google DeepMind launched the DiffusionGemma model on June 11, 2026, with a total of 26 billion parameters, making it a new member of the Gemma open-weight family. Instead of using the mainstream autoregressive generation method, this model draws inspiration from image diffusion models: it first generates a block of text using placeholder tokens, then refines it through multiple rounds of correction to produce the final output.
Technical Principle Overview
Traditional chatbots predict tokens sequentially one by one, causing text to appear gradually. DiffusionGemma, in contrast, processes up to 256 tokens in parallel at once, followed by refinement. When hardware compute power is sufficient, this approach significantly improves generation speed. Official data shows speeds exceeding 1000 tokens per second on a single NVIDIA H100 and over 700 tokens per second on a GeForce RTX 5090, which is about 4x faster than comparable autoregressive models.
The model adopts a mixture-of-experts architecture, activating only about 3.8 billion parameters per inference. This allows it to run on GPUs with approximately 18GB of VRAM, lowering the barrier for local deployment. The model supports multimodal input and text output, continuing Google's strategy of using locally deployable models to attract developers to its ecosystem.
Practical Application Scenarios
For local AI users, this means they can rely more on their own GPU for text generation in privacy-sensitive or network-unstable scenarios. Google has placed DiffusionGemma within the Gemma open-weight ecosystem, allowing developers to directly download the weights for experimentation.
Google claims that, in low-latency local inference scenarios with dedicated GPUs, its text generation speed can be up to 4x faster than traditional autoregressive models.
Technical Impact Analysis
Diffusion-based text models have not yet become mainstream, mainly because natural language imposes stricter requirements on grammatical ordering and factual consistency. DiffusionGemma demonstrates that the diffusion approach can achieve a clear speed advantage in open-weight text models.
The industry is paying close attention to its potential impact on mobile and multimodal applications. The low activation parameter ratio makes it suitable for running on consumer-grade hardware, which could drive the migration of local AI applications from the cloud to the edge.
- Clear speed advantage: The parallel generation mechanism reduces sequential dependency.
- Lower deployment barrier: 3.8 billion activated parameters are suitable for mid-range GPUs.
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接