AI Reviews | Winzheng

SGLang Instantly Supports MiMo-V2-Flash Model

SGLang now supports the MiMo-V2-Flash model, a 309B parameter model optimized for inference with sliding window attention and multi-layer MTP, achieving balanced throughput and latency on H200 GPUs.

Mini-SGLang: A Complete Analysis of the Lightweight and Efficient LLM Inference Engine

We introduce Mini-SGLang, a lightweight yet high-performance Large Language Models (LLMs) inference framework that preserves core state-of-the-art features in just 5k lines of Python code, serving as both a reliable inference engine and transparent reference implementation for researchers and developers.

SGLang Empowers Diffusion Large Models: LLaDA 2.0 Now Supported

SGLang introduces a seamless integration framework for Diffusion Large Language Models (dLLMs), enabling LLaDA 2.0 support through existing ChunkedPrefill mechanisms without core architecture changes, while maintaining full performance benefits and allowing customizable diffusion decoding algorithms.

SpecBundle & SpecForge v0.2: Production-Ready Speculative Decoding Models and Framework Released

SpecForge team, in collaboration with industry partners including Ant, Meituan, Nex-AGI, and EigenAI, releases SpecBundle (Phase 1), a collection of production-grade EAGLE-3 model checkpoints trained on large-scale datasets. Alongside, SpecForge v0.2 brings major system upgrades including comprehensive refactoring for improved usability and multi-backend support.

SGLang Instantly Supports MiMo-V2-Flash Model

Mini-SGLang: A Complete Analysis of the Lightweight and Efficient LLM Inference Engine

SGLang Empowers Diffusion Large Models: LLaDA 2.0 Now Supported

SpecBundle & SpecForge v0.2: Production-Ready Speculative Decoding Models and Framework Released

EPD Disaggregation in SGLang: Elastic Encoder Scaling for Vision-Language Models

Deploying 1TB Models on a Single H200: End-to-End INT4 QAT RL Practice

SGLang Optimizes GLM4-MoE Production Deployment: 65% TTFT Improvement