Google Open Sources Gemma 4: KV Cache Compressed to 3 Bits, Saving 6 Times Memory; Comprehensive Performance Yet to Be Verified by Third Parties

Apr 21, 2026 424 approx.4min News Factory Verified

Gemma 4 开源AI模型 TurboQuant技术

[Source: Google Official Announcement] Google recently officially launched the open-source multimodal AI model Gemma 4. The new version supports video and image processing capabilities for the first time and is released under the Apache 2.0 open-source license, allowing free use, modification, and redistribution for both personal and commercial users without additional licensing restrictions. The concurrently introduced TurboQuant proprietary quantization technology can compress the KV caches, which are core to large model inference, to 3 bits, achieving over 6 times memory savings.

Technical Analysis: Why is 3-Bit KV Compression Important?

For non-specialists, KV caches can be understood as the "short-term memory" of large models. When generating responses, handling multi-turn dialogues, or processing long texts, large models store already computed contextual features in the form of "Key" and "Value" in video memory, avoiding the need to recalculate the entire context for each generated word. This is a core factor in determining the inference speed of large models and the maximum supported dialogue length.

Previously, the industry standard for KV cache precision was mostly 8-bit or 16-bit, leading to high video memory usage, making it difficult for consumer-grade GPUs to support inference with 32K or longer contexts for models with parameters over 7B. The TurboQuant technology in Gemma 4 compresses KV caches to 3 bits while keeping inference precision loss below 1%, significantly extending the context length supported by the same GPU by 6 times, or allowing models that previously required professional server GPUs to run smoothly on ordinary consumer GPUs.

Community Feedback and Initial Evaluation

[Source: GitHub, Hugging Face Public Community Data] After the release of Gemma 4, the open-source community reacted positively. Many developers indicated that the relaxed Apache 2.0 license, combined with significant memory efficiency improvements, further lowers the barrier for deploying multimodal large models, promoting the democratization of AI technology. As of press time, the related projects of Gemma 4 have surpassed 10,000 stars on GitHub, and cumulative downloads on the Hugging Face platform have exceeded 250,000 times.

Winzheng.com Research Lab conducted an initial evaluation of Gemma 4 based on the YZ Index v6 methodology:

The main list core_overall_display, including executable (execution) and material constraint (grounding) auditable dimensions, is still under testing, with a complete evaluation report expected within 72 hours;
Engineering judgment (side list, AI-assisted evaluation) temporarily ranks Top 3 among open-source multimodal models with the same parameters, and task expression (side list, AI-assisted evaluation) performance aligns with official promotional parameters;
Integrity rating: pass;
Operational signal dimensions: stability and usability data are still being collected.

Uncertainties and Future Outlook

[Source: winzheng.com Research Lab Technical Evaluation Framework] Currently, several indicators of Gemma 4 remain to be verified: comprehensive performance comparison with open-source multimodal models of the same level such as Llama 3 and Qwen 2, deployment performance in complex industry scenarios, and precision loss of 3-bit KV compression in contexts longer than 128K, among others, all lack publicly available third-party test data support.

Winzheng.com, as a neutral AI professional portal, consistently adheres to the technical values of "fact verifiability, opinion traceability, and reproducibility of evaluation," with all technical conclusions produced based on standardized testing frameworks.

The release of Gemma 4 provides AI developers and enterprise users with a new option for open-source multimodal models and offers winzheng.com readers new materials for technical research and evaluation. Subsequently, winzheng.com Research Lab will conduct comprehensive testing on the performance and deployment adaptability of Gemma 4 to provide readers with neutral and professional evaluation results as soon as possible.

Technical Analysis: Why is 3-Bit KV Compression Important?

Community Feedback and Initial Evaluation

Uncertainties and Future Outlook

Related Articles