SGLang-Diffusion: Two Months of Progress
SGLang-Diffusion has achieved 2.5x performance improvements since its launch in November 2025, with support for new models, LoRA, parallel processing, and ComfyUI integration.
SGLang-Diffusion has achieved 2.5x performance improvements since its launch in November 2025, with support for new models, LoRA, parallel processing, and ComfyUI integration.
We successfully optimized GPT-OSS 20B and 120B models on NVIDIA DGX Spark using SGLang, achieving state-of-the-art performance of ~70 tokens/s and ~50 tokens/s respectively, enabling fully local AI applications including coding agents.
We introduce Mini-SGLang, a lightweight yet high-performance Large Language Models (LLMs) inference framework that preserves core state-of-the-art features in just 5k lines of Python code, serving as both a reliable inference engine and transparent reference implementation for researchers and developers.
SGLang introduces Encoder-Prefill-Decode (EPD) disaggregation architecture that separates vision encoding from language processing in VLMs, enabling independent scaling and significantly reducing TTFT by 6-8x in image-intensive scenarios.
Novita AI developed production-proven optimizations for deploying GLM4-MoE models on SGLang, achieving up to 65% TTFT reduction and 22% TPOT improvement through Shared Experts Fusion and Suffix Decoding techniques.