SGLang Optimizes GLM4-MoE Production Deployment: 65% TTFT Improvement

Feb 4, 2026 909 Views - Read Source LMSYS

LMSYS GLM4-MoE SGLang 性能优化 TTFT Suffix Decoding

Quick Overview

Novita AI has developed a series of production-proven, high-impact optimizations for deploying GLM4-MoE models on SGLang. We present an end-to-end performance optimization strategy that addresses bottlenecks across the entire inference pipeline—from kernel execution efficiency to cross-node data transfer scheduling. By integrating Shared Experts Fusion and Suffix Decoding, we achieved significant improvements in key production metrics under agentic coding workloads:

Up to 65% reduction in TTFT
22% improvement in TPOT

All results were validated on H200 clusters with TP8 and FP8 configurations, providing a battle-tested blueprint for high throughput and low latency in demanding production environments.

Implementation of Core Production Optimizations for GLM-MoE

1. Shared Experts Fusion

SGLang PR #13873: Shared Experts Fusion

This optimization was inspired by the original work on Deepseek models. As shown above, MoE models like GLM4.7 route all input tokens to a shared expert, while each token is also routed to the top-k routing experts selected by the model's router. Subsequently, all expert outputs are aggregated with weights. For example, GLM4.7 has 160 routing experts and 1 shared expert, with each token selecting the top 8 routing experts. In early implementations, these two parts were processed independently. However, since they have the same tensor shape and computation flow, they can naturally be unified: integrating the shared expert into the routing MoE structure, selecting top 9 from a total of 161 experts, with the shared expert fixed at the 9th position.

As described in the PR, this optimization yields up to 23.7% improvement in TTFT and 20.8% in ITL. Under TP8 and FP8 configurations (with an intermediate size of only 192, which is small for H200 hardware), the fusion operation significantly improves Streaming Multiprocessor (SM) utilization and substantially reduces memory I/O overhead.

2. Qknorm Fusion

SGLang PR #15141: Qknorm Fusion
SGLang PR #15305: Qknorm Fusion Fix

This optimization is based on migration from Qwen-MOE. The core idea is simple: both are head-wise computations and naturally merge into a single kernel. Our contribution lies in adapting it for the GLM4-MoE variant, which has a special case where only half the dimensions within each head are rotated.

3. Async Transfer

SGLang PR #14782: Async Transfer

In scenarios applying PD separation and overlapping scheduling, while throughput improves by approximately 10%, TTFT significantly degrades. We observed that in the current prefill implementation, data transfer is delayed until after the next batch's kernel launch. For 92-layer models like GLM4.7, kernel launch without CUDA Graph takes considerable time (often hundreds of milliseconds or even over 1 second).

Our modification advances the transfer step to schedule immediately after the corresponding GPU operation completes, placing it in a separate thread. Through careful handling of data race structures, we avoid blocking the main thread.

For models with frequent kernel launches, this optimization has tremendous effects. Under high load, TTFT can save up to 1 second, as shown below.

Production Benchmark Results

After implementing the above optimizations, GLM-MoE model performance improved significantly, as shown in the benchmark results below.

Benchmark Configuration

Input Length: 4096
Output Length: 1000
Request Rate: 14 req/s
Model: GLM-4.7 FP8 (TP8)

These optimizations are no longer experimental—they have been deployed and validated in Novita AI's production inference service.

Suffix Decoding

Agentic coding scenarios (like Cursor and Claude Code) have numerous reusable code patterns, making them suitable for targeted optimizations like Suffix Decoding.

Background: Inference Bottlenecks in Agentic Coding

LLM Agents excel at code generation, but latency remains a challenge. Traditional Speculative Decoding accelerates by pre-predicting multiple tokens but requires training additional draft models, adding engineering complexity.

How Suffix Decoding Works

Suffix Decoding is completely model-agnostic:

No additional model weights required
Leverages historical output sequence patterns to predict subsequent tokens
When current request suffixes match historical patterns, speculation follows the historical sequence

Data Validation: Output Pattern Repetition Analysis

Analysis of 22 Claude Code sessions (17,487 conversation rounds) revealed:

39.3% output pattern repetition: Tool calls and response patterns frequently similar
Highly structured agent behavior: Fixed phrases like "Let me...", "Now let me..." appear frequently

To support further research, we open-sourced the evaluation dataset on Hugging Face: Agentic Code Dataset.

Performance Comparison

Combined with built-in MTP acceleration, Suffix Decoding further reduces TPOT by 22% (from 25.13ms to 19.63ms):

Metric	MTP	Suffix Decoding	Change
Average TPOT	25.13 ms	19.63 ms	-21.90%
Median TPOT	25.95 ms	20.05 ms	-22.70%

Conclusion

These combined optimizations provide comprehensive performance improvements for SGLang deployments:

Optimization	Impact/Benefits
Shared Experts Fusion	Addresses computational efficiency in MoE models
QK-Norm-RoPE Fusion	Reduces kernel launch overhead
Async Transfer	Optimizes data movement for disaggregated deployment
Suffix Decoding	Leverages agentic coding pattern repetition for speculative decoding

Most components have been upstreamed or are being integrated—check the SGLang repository.

Reproduction Guide

Only key performance parameters are listed. Complete launch scripts (baseline vs optimized), benchmark tools, and profiling traces are available on GitHub: novitalabs/sglang (glm_suffix branch).

Core SGLang Runtime Optimization Flags

--tp-size 8
--kv-cache-dtype fp8_e4m3
--attention-backend fa3
--chunked-prefill-size 16384
--enable-flashinfer-allreduce-fusion
--enable-fused-qk-norm-rope
--enable-shared-experts-fusion
--disaggregation-async-transfer

Speculative Decoding Configuration (Agentic Coding Workloads)

--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4

Suffix Decoding Configuration (Optional)

--speculative-algorithm SUFFIX
--speculative-suffix-cache-max-depth 64
--speculative-suffix-max-spec-factor 1.0
--speculative-suffix-min-token-prob 0.1