Quick Overview
Novita AI has developed a series of production-proven, high-impact optimizations for deploying GLM4-MoE models on SGLang. We present an end-to-end performance optimization strategy that addresses bottlenecks across the entire inference pipeline—from kernel execution efficiency to cross-node data transfer scheduling. By integrating Shared Experts Fusion and Suffix Decoding, we achieved significant improvements in key production metrics under agentic coding workloads:
- Up to 65% reduction in TTFT
- 22% improvement in TPOT
All results were validated on H200 clusters with TP8 and FP8 configurations, providing a battle-tested blueprint for high throughput and low latency in demanding production environments.
Implementation of Core Production Optimizations for GLM-MoE
1. Shared Experts Fusion
SGLang PR #13873: Shared Experts Fusion

This optimization was inspired by the original work on Deepseek models. As shown above, MoE models like GLM4.7 route all input tokens to a shared expert, while each token is also routed to the top-k routing experts selected by the model's router. Subsequently, all expert outputs are aggregated with weights. For example, GLM4.7 has 160 routing experts and 1 shared expert, with each token selecting the top 8 routing experts. In early implementations, these two parts were processed independently. However, since they have the same tensor shape and computation flow, they can naturally be unified: integrating the shared expert into the routing MoE structure, selecting top 9 from a total of 161 experts, with the shared expert fixed at the 9th position.
As described in the PR, this optimization yields up to 23.7% improvement in TTFT and 20.8% in ITL. Under TP8 and FP8 configurations (with an intermediate size of only 192, which is small for H200 hardware), the fusion operation significantly improves Streaming Multiprocessor (SM) utilization and substantially reduces memory I/O overhead.
2. Qknorm Fusion
SGLang PR #15141: Qknorm Fusion
SGLang PR #15305: Qknorm Fusion Fix

This optimization is based on migration from Qwen-MOE. The core idea is simple: both are head-wise computations and naturally merge into a single kernel. Our contribution lies in adapting it for the GLM4-MoE variant, which has a special case where only half the dimensions within each head are rotated.
3. Async Transfer
SGLang PR #14782: Async Transfer

In scenarios applying PD separation and overlapping scheduling, while throughput improves by approximately 10%, TTFT significantly degrades. We observed that in the current prefill implementation, data transfer is delayed until after the next batch's kernel launch. For 92-layer models like GLM4.7, kernel launch without CUDA Graph takes considerable time (often hundreds of milliseconds or even over 1 second).
Our modification advances the transfer step to schedule immediately after the corresponding GPU operation completes, placing it in a separate thread. Through careful handling of data race structures, we avoid blocking the main thread.
For models with frequent kernel launches, this optimization has tremendous effects. Under high load, TTFT can save up to 1 second, as shown below.

Production Benchmark Results
After implementing the above optimizations, GLM-MoE model performance improved significantly, as shown in the benchmark results below.
Benchmark Configuration
- Input Length: 4096
- Output Length: 1000
- Request Rate: 14 req/s
- Model: GLM-4.7 FP8 (TP8)


These optimizations are no longer experimental—they have been deployed and validated in Novita AI's production inference service.
Suffix Decoding
Agentic coding scenarios (like Cursor and Claude Code) have numerous reusable code patterns, making them suitable for targeted optimizations like Suffix Decoding.
Background: Inference Bottlenecks in Agentic Coding
LLM Agents excel at code generation, but latency remains a challenge. Traditional Speculative Decoding accelerates by pre-predicting multiple tokens but requires training additional draft models, adding engineering complexity.
How Suffix Decoding Works

Suffix Decoding is completely model-agnostic:
- No additional model weights required
- Leverages historical output sequence patterns to predict subsequent tokens
- When current request suffixes match historical patterns, speculation follows the historical sequence
Data Validation: Output Pattern Repetition Analysis
Analysis of 22 Claude Code sessions (17,487 conversation rounds) revealed:
- 39.3% output pattern repetition: Tool calls and response patterns frequently similar
- Highly structured agent behavior: Fixed phrases like "Let me...", "Now let me..." appear frequently
To support further research, we open-sourced the evaluation dataset on Hugging Face: Agentic Code Dataset.
Performance Comparison
Combined with built-in MTP acceleration, Suffix Decoding further reduces TPOT by 22% (from 25.13ms to 19.63ms):
| Metric | MTP | Suffix Decoding | Change |
|---|---|---|---|
| Average TPOT | 25.13 ms | 19.63 ms | -21.90% |
| Median TPOT | 25.95 ms | 20.05 ms | -22.70% |
Conclusion
These combined optimizations provide comprehensive performance improvements for SGLang deployments:
| Optimization | Impact/Benefits |
|---|---|
| Shared Experts Fusion | Addresses computational efficiency in MoE models |
| QK-Norm-RoPE Fusion | Reduces kernel launch overhead |
| Async Transfer | Optimizes data movement for disaggregated deployment |
| Suffix Decoding | Leverages agentic coding pattern repetition for speculative decoding |
Most components have been upstreamed or are being integrated—check the SGLang repository.
Reproduction Guide
Only key performance parameters are listed. Complete launch scripts (baseline vs optimized), benchmark tools, and profiling traces are available on GitHub: novitalabs/sglang (glm_suffix branch).
Core SGLang Runtime Optimization Flags
--tp-size 8
--kv-cache-dtype fp8_e4m3
--attention-backend fa3
--chunked-prefill-size 16384
--enable-flashinfer-allreduce-fusion
--enable-fused-qk-norm-rope
--enable-shared-experts-fusion
--disaggregation-async-transferSpeculative Decoding Configuration (Agentic Coding Workloads)
--speculative-algorithm NEXTN
--speculative-num-steps 3
--speculative-eagle-topk 1
--speculative-num-draft-tokens 4Suffix Decoding Configuration (Optional)
--speculative-algorithm SUFFIX
--speculative-suffix-cache-max-depth 64
--speculative-suffix-max-spec-factor 1.0
--speculative-suffix-min-token-prob 0.1References
- SGLANG PR #13873: Shared Experts Optimization
- Snowflake Engineering Blog: SuffixDecoding at Production Scale
- NeurIPS Paper: SuffixDecoding
- Arctic Inference Repository
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接