Mini-SGLang: A Complete Analysis of the Lightweight and Efficient LLM Inference Engine

Feb 4, 2026 925 Views - Read Source LMSYS

LMSYS Mini-SGLang LLM推理 SGLang 性能优化基准测试

We are excited to introduce Mini-SGLang, a lightweight yet high-performance inference framework for Large Language Models (LLMs). Derived from the SGLang project, it aims to demystify the complexity of modern serving systems. Despite its compact codebase, it retains core features that define state-of-the-art performance, including Radix Attention for efficient KV cache reuse, Chunked Prefill for memory footprint control, Overlap Scheduling for reduced CPU overhead, and Tensor Parallelism for scalable distributed serving. Mini-SGLang provides OpenAI-compatible APIs and supports models like Llama-3 and Qwen-3 out of the box, serving as both a reliable inference engine and a transparent reference implementation for researchers and developers.

Source code: https://github.com/sgl-project/mini-sglang.

Motivation: Why Mini-SGLang?

While SGLang has achieved state-of-the-art inference performance and features, its codebase has grown to nearly 300k lines of Python code. To lower the barrier for learning and research, we developed Mini-SGLang with two main goals: providing a learning resource and accelerating research prototyping.

Educational Use

Mini-SGLang features a clean, modular codebase with only 5k lines of Python code, making it easier for beginners to understand the core components of modern LLM serving engines.

Despite its simplicity, it supports both online and offline inference and implements key modern optimizations including Tensor Parallelism, Overlap Scheduling, Chunked Prefill, Radix Cache, and JIT CUDA kernels, making it a comprehensive learning resource.

Rapid Research Prototyping

Many ML and systems researchers struggle to integrate optimizations into existing frameworks. On one hand, injecting new logic into complex frameworks like SGLang is risky and can break implicit invariants leading to subtle bugs; on the other hand, building an inference engine from scratch is tedious and requires substantial infrastructure (frontend servers, tokenizers, NCCL communication) to match baseline performance.

Mini-SGLang strikes a balance. It originated as our prototype for validating new systems ideas without spending weeks dealing with large-scale code or reimplementing infrastructure. It offers high performance out of the box, is easy to inspect and extend with optimizations, while handling infrastructure heavy lifting. We additionally provide OpenAI-compatible benchmarking tools for convenient end-to-end performance analysis and comparison with SGLang, vLLM, and TensorRT-LLM. Kernel developers can leverage fine-grained NVTX annotations to aid debugging and performance profiling.

Core Features

Mini-SGLang shares the high-level system architecture with SGLang, including a frontend API server, tokenizer server, and backend schedulers for each GPU.

System Architecture Diagram

Overlap Scheduling

LLM inference involves more than GPU computation - CPUs handle significant work including batch scheduling, memory management, and token processing. Without optimization, CPU overhead can cause GPU idling, impacting overall performance.

Mini-SGLang implements an overlap scheduling mechanism similar to SGLang, where the CPU prepares the next batch of requests while the GPU processes the current batch, effectively hiding CPU overhead. The Nsight-Systems profile below shows that GPU utilization remains saturated without idle time, improving throughput. See our previous blog post for details.

Overlap scheduling execution example with CPU overhead completely hidden

No overlap scheduling execution example with GPU blocking due to CPU overhead

To run ablation experiments without overlap scheduling: set environment variable MINISGL_DISABLE_OVERLAP_SCHEDULING=1.

High-Performance Kernels

Mini-SGLang integrates state-of-the-art attention kernels to ensure top-tier performance. On NVIDIA Hopper architecture, it uses FlashAttention-3 for prefill and FlashInfer for decode.

Drawing from FlashInfer and SGLang, Mini-SGLang integrates JIT-compiled kernels to boost runtime performance. We adopt TVM FFI for Python bindings, which is faster than the default PyTorch interface due to its lightweight design.

Interactive Shell Mode

For convenient interactive testing, Mini-SGLang includes a simple shell mode where users can directly converse with the LLM from the command line without additional clients.

Shell Mode Example

Performance Benchmarks

We conducted comprehensive experiments covering offline throughput and online service latency.

Offline Inference Throughput

We compare against Nano-vLLM on a single NVIDIA H200 GPU. Following the Nano-vLLM methodology, we use Qwen3-0.6B and Qwen3-14B models to assess scalability (limited to Nano-vLLM's supported models).

Throughput (tokens/s) results:

Offline Benchmark Chart

Mini-SGLang outperforms Nano-vLLM on both models, thanks to overlap scheduling hiding CPU overhead.

Reproducible: Offline script here.

Online Service Latency

Using real-world workload from Qwen trace, we replay 1000 requests to Qwen3-32B with 4-way Tensor Parallelism on 4 H200 GPUs. We measure throughput, P90 TTFT, and TBT.

Online Benchmark Chart

Mini-SGLang's performance is nearly identical to SGLang, demonstrating that the lightweight design doesn't sacrifice throughput or latency.

Reproducible: Launch commands:

# Mini-SGLang
python -m minisgl --model "Qwen/Qwen3-32B" --tp 4 --cache naive 

# SGLang
python3 -m sglang.launch_server --model "Qwen/Qwen3-32B" --tp 4 \
    --disable-radix --port 1919 --decode-attention flashinfer

Online script here.

Conclusion

Mini-SGLang successfully distills a state-of-the-art inference engine into a compact, understandable codebase. By retaining key optimizations like overlap scheduling and high-performance attention kernels, it achieves excellent performance while serving as both an educational tool and flexible research platform.

We invite you to explore the source code, run benchmarks, and experience the newfound accessibility of high-performance LLM inference.

Acknowledgments

Thanks to the SGLang team and community for their support, especially Liangsheng Yin, Lianmin Zheng, and others.
Thanks to MisakaVan for outstanding contributions in testing, documentation, and code improvements, and Yi Pan for the initial PyTorch C++ NCCL communicator implementation.
Thanks to Wenxin Zheng from SJTU for supporting course organization and student mentoring as a TA for the Summer 2025 lab course.
We've benefited greatly from the system designs of SGLang, FlashInfer, vLLM, and Nano-vLLM, which together shaped Mini-SGLang's simplicity and robustness.