SGLang Instantly Supports MiMo-V2-Flash Model

Feb 4, 2026 885 Views - Read Source LMSYS

LMSYS MiMo-V2-Flash SGLang SWA MTP 推理优化

Introduction

The XiaomiMiMo/MiMo-V2-Flash model has 309B total parameters with 15B activated parameters. It is a new model specifically optimized for inference, designed to maximize decoding efficiency. Its two key designs are sliding window attention (SWA) and multi-layer MTP. MiMo-V2-Flash is tailored for real service scenarios, supporting flexible balance between throughput and latency on different hardware. Combined with SGLang's optimized Spec v2 runtime, it supports multi-layer MTP and efficient SWA execution with almost zero overhead, demonstrating balanced TPOT and throughput on H200. This article will introduce the model architecture and SGLang's efficient support.

Inference-Efficient Modeling

MiMo-V2-Flash follows inference efficiency principles, adopting two core designs:

Sliding Window Attention (SWA): Each token's receptive field is limited to a fixed-size constant window, reducing attention complexity from O(N²) to O(Nw) in sequence dimension, where w is the window size.
MTP: Multi-layer MTP uses a chain of prediction heads to sequentially predict next tokens layer by layer, then verifies draft tokens in parallel using extended query.

The following diagram shows the overall architecture of MiMo-V2-Flash:

MiMo-V2-Flash Overall Architecture

SWA

MiMo-V2-Flash alternates one dense GQA layer for every five SWA attention layers. SWA improves inference efficiency from multiple angles: during the prefill phase where computation dominates the cost, O(N²) attention becomes a bottleneck for long sequences. SWA reduces this to linear complexity, significantly shortening TTFT. Meanwhile, KV cache complexity is reduced to constant level, freeing up resources to support larger batches, reducing KV loading operations, and improving TPOT.

The following shows prefill benchmark results:

MiMo-V2-Flash Prefill Benchmark (Radix Cache Disabled)

MTP

The key design of MiMo-V2-Flash is its 3-layer multi-layer MTP. In decoding scenarios, most kernels are memory-bound with query length constant at 1. Increasing parallel decoded tokens is the most intuitive way to improve throughput. However, as batch size increases, KV cache access grows linearly and becomes a bottleneck, while computational potential remains unsaturated, making it difficult to continue with simply increasing batch size.

MTP utilizes remaining computation: multiple tokens are generated simultaneously by sequential prediction heads and verified in parallel with the same query, extending query length without increasing KV access, thus improving arithmetic intensity. When memory-bound and batch effects are marginal, aggressive MTP strategies (high acceptance rate) can fully utilize device potential and optimize TPOT.

Hardware-Aware MTP Configuration

MTP benefits from unsaturated arithmetic intensity, and MiMo-V2-Flash's GQA attention naturally adapts to this. However, deployment requires selecting the right batch size and MTP depth to achieve optimal compute-memory balance. High roofline devices (such as training GPUs) are more suitable for aggressive MTP to utilize abundant computation; inference accelerators (such as H20) have limited FLOPs, requiring careful MTP to avoid becoming compute-bound and reducing throughput.

H200 benchmarks show that MiMo-V2-Flash balances throughput and per-request TPS. Even with 64K long context and batch size 16 per DP rank, decoding throughput still reaches 150 TPS, thanks to SWA and MTP.

MiMo-V2-Flash decode benchmark (DP 2, TP 4, EP 8, MTP acceptance length 3.6, input 16k, varying batch)

MiMo-V2-Flash Decode Benchmark (DP 2, TP 4, EP 8, MTP acceptance length 3.6, input 16k tokens, varying batch)

MiMo-V2-Flash decode benchmark (DP 2, TP 4, EP 8, MTP acceptance length 3.6, batch 16 per DP, varying input length)

MiMo-V2-Flash Decode Benchmark (DP 2, TP 4, EP 8, MTP acceptance length 3.6, batch 16 per DP rank, varying input length)

Fast MTP Serving with SGLang Spec v2

MiMo's multi-layer MTP natively integrates with SGLang Spec v2, utilizing fully overlapped MTP features to improve throughput and latency. In Spec v2, overlapping scheduling merges with speculative decoding: delaying output synchronization/processing, launching the next batch of kernels early, hiding CPU batch/synchronization overhead in GPU forward passes, and reducing GPU bubbles.

Spec v2 Overlapped Speculative Decoding Performance Profiling

Further Discussion

In LLM serving, the decoding phase is mostly memory-bound, with mainstream training GPUs having significant idle computation. While inference-specific accelerators with high bandwidth and low FLOPs are economical, their speed is limited. MiMo-V2-Flash optimizes inference efficiency from the model side, with multi-layer MTP potentially becoming a general solution: optimizing acceptance rates and utilizing GPU computation to accelerate decoding. More adaptive architectures make hardware selection flexible, allowing the same hardware to handle both training and inference, simplifying deployment and reducing costs.

MiMo-V2-Flash support has been implemented through SGLang PRs (#15207, #15208) and will soon be merged into the main branch. The benchmarks in this article are based on MiMo's optimized branch, with optimizations to be upstreamed to SGLang main.

Quick Start

MiMo-V2-Flash is now available through SGLang Docker images and pip installation. Here's a guide to starting the SGLang server.

Docker

# Pull the docker image
docker pull lmsysorg/sglang:dev-pr-15207

# Launch the container
docker run -it --gpus all \
  --shm-size=32g \
  --ipc=host \
  --network=host \
  lmsysorg/sglang:dev-pr-15207 bash

# Start the server
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
        --model-path XiaomiMiMo/MiMo-V2-Flash \
        --dp-size 2 \
        --enable-dp-attention \
        --tp-size 8 \
        --trust-remote-code \
        --mem-fraction-static 0.75 \
        --max-running-requests 128 \
        --chunked-prefill-size 16384 \
        --reasoning-parser qwen3 \
        --tool-call-parser mimo \
        --model-loader-extra-config '{"enable_multithread_load": "true","num_threads": 64}' \
        --attention-backend fa3 \
        --speculative-algorithm EAGLE \
        --speculative-num-steps=3 \
        --speculative-eagle-topk=1 \
        --speculative-num-draft-tokens=4 \
        --enable-mtp

Pip Installation

# On a machine with SGLang dependencies installed or inside a SGLang nightly container
# Start an SGLang nightly container
docker run -it --gpus all \
  --shm-size=32g \
  --ipc=host \
  --network=host \
  lmsysorg/sglang:nightly-dev-20251215-4449c170 bash

# If you already have SGLang installed, uninstall the current SGLang version
pip uninstall sglang -y

# Install the PyPI Package
pip install sglang==0.5.6.post2.dev8005+pr.15207.g39d5bd57a \
  --index-url https://sgl-project.github.io/whl/pr/ \
  --extra-index-url https://pypi.org/simple

#Launch the server
SGLANG_ENABLE_SPEC_V2=1 python3 -m sglang.launch_server \
        --model-path XiaomiMiMo/MiMo-V2-Flash \
        --dp-size 2 \
        --enable-dp-attention \
        --tp-size 8 \
        --trust-remote-code \
        --mem-fraction-static 0.75 \
        --max-running-requests 128 \
        --chunked-prefill-size 16384 \
        --reasoning-parser qwen3 \
        --tool-call-parser mimo \
        --model-loader