性能优化 - AI News | 赢政天下

SGLang-Diffusion: Two Months of Progress

SGLang-Diffusion has achieved 2.5x performance improvements since its launch in November 2025, with support for new models, LoRA, parallel processing, and ComfyUI integration.

Optimizing GPT-OSS on NVIDIA DGX Spark: Unleashing Spark's Maximum Potential

We successfully optimized GPT-OSS 20B and 120B models on NVIDIA DGX Spark using SGLang, achieving state-of-the-art performance of ~70 tokens/s and ~50 tokens/s respectively, enabling fully local AI applications including coding agents.

Mini-SGLang: A Complete Analysis of the Lightweight and Efficient LLM Inference Engine

We introduce Mini-SGLang, a lightweight yet high-performance Large Language Models (LLMs) inference framework that preserves core state-of-the-art features in just 5k lines of Python code, serving as both a reliable inference engine and transparent reference implementation for researchers and developers.

EPD Disaggregation in SGLang: Elastic Encoder Scaling for Vision-Language Models

SGLang introduces Encoder-Prefill-Decode (EPD) disaggregation architecture that separates vision encoding from language processing in VLMs, enabling independent scaling and significantly reducing TTFT by 6-8x in image-intensive scenarios.

SGLang Optimizes GLM4-MoE Production Deployment: 65% TTFT Improvement

Novita AI developed production-proven optimizations for deploying GLM4-MoE models on SGLang, achieving up to 65% TTFT reduction and 22% TPOT improvement through Shared Experts Fusion and Suffix Decoding techniques.

性能优化 (5 articles)

SGLang-Diffusion: Two Months of Progress

Optimizing GPT-OSS on NVIDIA DGX Spark: Unleashing Spark's Maximum Potential

Mini-SGLang: A Complete Analysis of the Lightweight and Efficient LLM Inference Engine

EPD Disaggregation in SGLang: Elastic Encoder Scaling for Vision-Language Models

SGLang Optimizes GLM4-MoE Production Deployment: 65% TTFT Improvement