SpecBundle & SpecForge v0.2: Production-Ready Speculative Decoding Models and Framework Released

TL;DR

The SpecForge team, together with several industry partners—including Ant, Meituan, Nex-AGI, and EigenAI—has released SpecBundle (Phase 1), a collection of production-grade EAGLE-3 model checkpoints trained on large-scale datasets. SpecBundle aims to enhance the usability and practical performance of speculative decoding, with Phase 1 focusing on instruction-tuned models.

At the same time, SpecForge v0.2 introduces significant system upgrades, including a comprehensive refactoring to improve ease of use and support for multiple execution backends, further enhancing scalability and production readiness.

Background

Speculative decoding was first proposed in 2023 as a promising technique to accelerate Large Language Model (LLM) inference by using a lightweight draft model to propose multiple tokens, which are then verified by a stronger target model. This approach can, in principle, significantly reduce decoding latency without sacrificing output quality, making it suitable for both local and enterprise deployments. In recent years, the research community has continuously refined this paradigm, with state-of-the-art methods like EAGLE3 demonstrating strong theoretical guarantees and empirical gains in token acceptance rates and end-to-end acceleration.

Existing Problems

Despite these advances, speculative decoding—particularly SOTA methods like EAGLE3—has not been widely adopted in the open-source community. We believe this gap stems from three main factors.

Factor 1: Lack of user-friendly, production-ready training tools for speculative decoding models. Most existing implementations remain research prototypes that are either poorly maintained, narrowly scoped, or provide only simple implementations without system-level optimizations. These tools struggle to support the diverse model architectures and scales common in today's LLM ecosystem.

Factor 2: The availability of high-quality draft models is a major bottleneck. The effectiveness of speculative decoding heavily depends on the strength of draft models, but such models are scarce in the open-source community. The table below summarizes the current situation. Methods like EAGLE3 require additional training of draft models, and publicly available EAGLE3 checkpoints are mainly limited to those released by the original authors, severely constraining wider adoption.

ModelNative MTPCommunity EAGLE3SpecBundle
meta-llama/Llama-3.1-8B-Instruct
meta-llama/Llama-3.3-70B-Instruct
meta-llama/Llama-4-Scout-17B-16E-Instruct
Qwen/Qwen3-30B-A3B-Instruct-2507
Qwen/Qwen3-235B-A22B-Instruct-2507
Qwen/Qwen3-Next-80B-A3B-Instruct-FP8
Qwen/Qwen3-Coder-30B-A3B-Instruct
Qwen/Qwen3-Coder-480B-A35B-Instruct
inclusionAI/Ling-flash-2.0
moonshotai/Kimi-K2-Instruct
nex-agi/Qwen3-30B-A3B-Nex-N1
nex-agi/Qwen3-32B-Nex-N1

Factor 3: Most existing draft models are trained only on smaller or curated datasets, not scaled to the massive, diverse corpora used in modern LLM training. Consequently, these models have limited generalization capabilities when paired with strong target models, resulting in lower token acceptance rates and diminished practical acceleration. Without large-scale, production-grade draft models, the potential of advanced methods like EAGLE3 cannot be fully realized.

Motivation

The above gaps motivated us to release SpecForge v0.2 and SpecBundle. As a neutral open-source community, the SpecForge team aims to actively advance speculative decoding by providing production-grade training frameworks and high-performance draft models, making the technology more practical and accessible.

This initiative brings several key benefits:

  • Expand research possibilities through standardized, scalable baselines, driving innovation in speculative decoding methods.
  • Enable faster local inference and model serving, supporting lightweight deployment scenarios like Ollama.
  • Reduce enterprise deployment costs using inference engines like SGLang, improving throughput without sacrificing output quality.
  • Provide EAGLE3 checkpoints as strong initialization points for efficient fine-tuning on domain-specific tasks.
  • Enhance reinforcement learning workflow efficiency, supporting integration with frameworks like ReSpec and slime.

SpecForge v0.2

SpecForge has been open-sourced for about five months. Thanks to community support, the system has evolved into a more reliable, efficient, and scalable solution. During the two months of training various models for SpecBundle, we identified several limitations in the original design. These insights drove a comprehensive upgrade of the framework, enhancing performance and ease of use. The main changes in SpecForge v0.2 are as follows.

User-Friendliness Improvements

In earlier versions, some features were developed independently without sufficient consideration for long-term maintenance and user experience, leading to user confusion. Over the past two months, we prioritized optimizing usability and undertook a major refactoring of the framework. Key improvements include:

  • Refactored data processing pipeline, eliminating redundancy and improving efficiency. For example, through data parallelism and asynchronous processing, data regeneration is 10x faster than v0.1.
  • Unified online and offline training scripts into a single implementation, ensuring consistent training logic and avoiding mode divergence.
  • Optimized documentation structure and clarity, providing clearer logical flow and better readability to help users get started quickly and iterate.

Multi-Backend Support

Earlier versions heavily relied on internal target model implementations, making model support cumbersome and error-prone. To address this issue and better leverage the ecosystem, we introduced a unified target model integration interface.

In v0.2, we added the Eagle3TargetModel interface, enabling seamless support for multiple execution backends. Currently integrating SGLang and Hugging Face Transformers as backends. New backends need only implement the Eagle3TargetModel.generate_eagle3_data method, significantly lowering the extension barrier and improving long-term maintainability.

target_model = get_eagle3_target_model(
    pretrained_model_name_or_path="meta-llama/Llama-3.1-8B-Instruct",
    backend="sglang",
    torch_dtype=torch.bfloat16,
    device="cuda",
    cache_dir=args.model_download_dir,
    **target_model_kwargs,
)

These backends not only relieve developers from the burden of model implementation and performance optimization but also provide users with flexible choices to adapt to different training scenarios.

Multi-backend support diagram

SpecBundle Initiative

As mentioned earlier, the open-source community still faces significant bottlenecks in the availability and performance of speculative decoding solutions. SpecBundle is a direct response to these challenges—an initiative jointly driven by the open-source community and industry partners, aimed at equipping mainstream open-source models with high-performance EAGLE3 draft model weights to democratize speculative decoding. To our knowledge, this is the first public effort of its kind.

👉 Check out the SpecBundle documentation.

For this initial release, the SpecBundle roadmap focuses on instruction-tuned models. We believe that extending speculative decoding support to a broader range of models will further reduce costs for local and enterprise deployments.