From Research to Production: EAGLE-3 Accelerates Open Source LLM Inference 2-3x on Vertex AI

EAGLE-3 Vertex cover image

For practitioners working with LLM development, the bottleneck of generating one token at a time is a common pain point. Standard autoregressive generation is inherently a sequential process, leading to a memory bandwidth-limited compute pattern: GPU cores sit idle while time is mainly spent reading massive model weights from memory for each step.

The solution is speculative decoding. This optimization technique accelerates the slow token-by-token generation process of large LLMs (target models) by introducing a draft mechanism. The draft mechanism quickly proposes multiple subsequent tokens, which the target model then validates in parallel batches, accepting the longest matching prefix and continuing generation from there.

But not all draft mechanisms are efficient. The classic draft-target approach requires separately deploying a small LLM as a drafter, increasing hosting resources and costs.

Traditional draft model architecture

Enter EAGLE-3 (Extrapolative Attention Guided LEarning). Instead of requiring a complete independent model, it directly attaches an extremely lightweight draft head (only 2-5% of the target model size) to internal layers of the target model. This head operates at both feature and token levels, using hidden states to infer future token tree structures.

The result? All the advantages of speculative decoding are retained while eliminating the overhead of training and running a second model. Compared to complex multi-hundred-million parameter draft models, EAGLE-3 trains only a lightweight head, achieving 2x-3x decoding acceleration for models like Llama 70B (depending on workloads like multi-turn dialogue, code, long context, etc.).

EAGLE-3 target model architecture

Taking EAGLE-3 from paper to large-scale production cloud services still requires overcoming numerous engineering challenges. This article shares the technical pipeline, key issues, and valuable lessons learned.

Challenge 1: Data Preparation

EAGLE-3 heads require training, but public datasets often have issues:

  • Strict usage terms: Generated models prohibit use for competitive development.
  • PII contamination: Contains names, locations, financial identifiers, etc.
  • Unguaranteed quality: Only suitable for demos, not customer production workloads.

Lesson 1: Build a synthetic data generation pipeline

Solution: Select high-quality datasets based on customer scenarios, extract user prompts, apply DLP and PII filtering, apply chat templates, tokenize, then input to target models (e.g., Llama 3.3 70B) to generate responses. This method produces compliant, clean data matching model distributions, perfectly suited for draft head training.

Data generation pipeline

Challenge 2: Engineering the Training Pipeline

Data input methods are divided into online training (real-time embedding generation) and offline training (pre-computed embeddings). We chose offline due to lower hardware requirements: pre-computed features/embeddings are stored in GCS and used for training. Given EAGLE-3 heads are small, initial training requires only one host for a day; this extends to several days with expanded datasets.

Training pipeline

Lesson 2: Chat templates are indispensable

During instruction-tuned model training, if the target model's chat template (e.g., Llama 3) isn't used, embeddings are incorrect, and the head learns biased distributions.

Lesson 3: Pay attention to masks

Training inputs contain both prompts and responses, but EAGLE-3 only predicts responses. Prompts must be manually masked in the loss function, otherwise the head wastes capacity predicting known prompts, degrading performance.

Mask processing diagram

Challenge 3: Serving and Scaling

Lesson 4: Serving framework is crucial

Working with the SGLang team, we efficiently deployed EAGLE-3 to production. SGLang's tree attention kernel is specifically designed for parallel verification of EAGLE-3's draft tree (branch paths), avoiding performance loss.

Lesson 5: Don't let CPU bottleneck GPU

After EAGLE-3 acceleration, CPU overhead (e.g., kernel launches, metadata management) becomes the new bottleneck. In synchronous schedulers, GPUs idle after Draft while waiting for CPU to prepare Verify.

Normal scheduler

SGLang's Zero-Overhead Overlap Scheduler solves this: using FutureMap, CPU prepares the next Draft/Extend in parallel while GPU executes Verify, eliminating idle time and achieving an additional 10%-20% acceleration.

Overlap scheduler

Benchmark Results

Results are impressive: On SGLang, using Llama 4 Scout 17B Instruct benchmarks, EAGLE-3 achieves 2x-3x reduction in decoding latency and throughput improvement.

Metric 1: Median Time Per Output Token (TPOT)

TPOT benchmark graph

The green EAGLE-3 line shows consistently lower TPOT than the blue baseline across all concurrency levels, indicating superior latency.

Metric 2: Output Throughput

Throughput benchmark graph

The green line consistently outperforms the baseline significantly across concurrency levels, demonstrating EAGLE-3's advantages. Check the complete notebook to run your own benchmarks.