Medperf Adds Webui Capabilities
MLCommons 旗下开源平台 MedPerf 近日推出 WebUI 支持,用户无需本地安装即可通过浏览器轻松运行隐私保护的机器学习基准测试。新功能集成了 SGLang 等后端,简化了模型评估流程,支持多种任务如图像分类和 NLP。WebUI 提供直观界面,实时显示 Elo Rating 等关键指标,帮助开发者快速比较模型性能。该更新标志着 MedPerf 向更易用方向迈进,助力联邦学习和隐私计算领域发展。(128字)
MLCommons 旗下开源平台 MedPerf 近日推出 WebUI 支持,用户无需本地安装即可通过浏览器轻松运行隐私保护的机器学习基准测试。新功能集成了 SGLang 等后端,简化了模型评估流程,支持多种任务如图像分类和 NLP。WebUI 提供直观界面,实时显示 Elo Rating 等关键指标,帮助开发者快速比较模型性能。该更新标志着 MedPerf 向更易用方向迈进,助力联邦学习和隐私计算领域发展。(128字)
MLCommons近日公布VLM(视觉语言模型)推理基准测试结果,Shopify团队表现出色。本次测试聚焦LLaVA-1.5-7B等模型在电商场景下的实时推理性能,采用MLPerf Inference框架评估。Shopify利用SGLang和自定义优化,在A100 GPU上实现高吞吐量和低延迟,Elo Rating领先同行。测试覆盖图像描述、视觉问答等多任务,揭示了VLM在生产环境部署的关键挑战与优化策略,为AI电商应用提供宝贵参考。(128字)
KTransformers, developed by Tsinghua University's MadSys and Approaching.AI, optimizes CPU/GPU collaborative inference for sparse MoE models through AMX-optimized kernels, efficient device coordination, and expert deferral mechanisms, now integrated into SGLang for enhanced performance.
SGLang-Diffusion has achieved 2.5x performance improvements since its launch in November 2025, with support for new models, LoRA, parallel processing, and ComfyUI integration.
SGLang launches a highly optimized Pipeline Parallelism implementation designed for ultra-long context inference challenges. Through integrated optimizations and a clean design, it achieves a 3.31x speedup in prefill throughput for DeepSeek V3 on multi-node H20 clusters, demonstrating strong scalability for trillion-parameter models.
We developed Petit, a collection of FP16/BF16 × FP4 mixed-precision GPU kernels for AMD GPUs, enabling 1.74× faster Llama 3.3 70B inference on existing MI250/MI300 hardware without upgrades.
SGLang implements fully deterministic inference with only 34.35% performance overhead and enables 100% reproducible RL training in collaboration with slime, providing reliable solutions for rigorous scientific experiments.
The SGLang team shares their optimization progress on DeepSeek V3/R1 inference performance using GB200 NVL72, achieving 26,156 input tokens/s for prefill and 13,386 output tokens/s for decode per NVIDIA Blackwell GPU through techniques like FP8 attention, NVFP4 MoE, and large-scale expert parallelism.
This article presents comprehensive optimization strategies for deploying DeepSeek-R1 on H20 GPUs, achieving state-of-the-art performance of 16.5k input tokens/s and 5.7k output tokens/s per node through hardware-aware parallelization, kernel optimizations, and advanced scheduling techniques.
This article introduces PD-Multiplexing, a new serving paradigm in SGLang that leverages NVIDIA's GreenContext technology to achieve higher goodput for LLM services through efficient intra-GPU resource sharing between prefill and decode phases.
SGLang announces Day 0 support for DeepSeek-V3.2, implementing DeepSeek Sparse Attention (DSA) mechanism that significantly improves training and inference efficiency, especially in long-context scenarios.
We conducted an in-depth review of NVIDIA DGX Spark, a compact all-in-one system that brings supercomputing-level performance to desktop workstation form factor. While its unified memory design enables running ultra-large models, performance is constrained by memory bandwidth, making it ideal for prototyping and experimentation rather than production deployment.
SGLang collaborates with NVIDIA to leverage Blackwell architecture innovations, achieving breakthrough performance on DeepSeek models with up to 4x improvements, and is selected as the default inference engine for NVIDIA and AMD hardware in the InferenceMAX benchmark.
We introduce SGLang-Jax, a state-of-the-art open-source inference engine built entirely on Jax and XLA, achieving fast native TPU inference with advanced features like continuous batching, prefix caching, and speculative decoding.
We successfully optimized GPT-OSS 20B and 120B models on NVIDIA DGX Spark using SGLang, achieving state-of-the-art performance of ~70 tokens/s and ~50 tokens/s respectively, enabling fully local AI applications including coding agents.
SGLang announces first-day support for MiniMax M2, a flagship MoE model that returns to full attention after empirical findings show efficient attention methods face significant production deployment challenges.
SGLang Diffusion brings SGLang's top performance to diffusion model image and video generation, supporting mainstream open-source models with 1.2x to 5.9x speedups across diverse workloads.
We are excited to announce the official collaboration between SGLang and AutoRound, supporting low-bit quantization for efficient LLM inference. This integration enables developers to quantize large models using AutoRound's signed gradient optimization techniques and deploy them directly in SGLang's efficient runtime, achieving low-bit model inference while minimizing accuracy loss and significantly reducing latency.
Today we release Miles, an enterprise-grade reinforcement learning framework designed for large-scale MoE training and production workloads, built on the proven foundation of slime.
LMSYS announces its Fellowship Program offering up to $50,000 in funding for U.S. PhD students who have made significant contributions to open-source AI infrastructure.