Original AI News | 赢政天下

🚀 AutoRound Partners with SGLang: A New Era of Efficient Quantized Model Inference

We are excited to announce the official collaboration between SGLang and AutoRound, supporting low-bit quantization for efficient LLM inference. This integration enables developers to quantize large models using AutoRound's signed gradient optimization techniques and deploy them directly in SGLang's efficient runtime, achieving low-bit model inference while minimizing accuracy loss and significantly reducing latency.

Miles Released: Enterprise-Grade RL Framework Igniting Large-Scale MoE Training

Today we release Miles, an enterprise-grade reinforcement learning framework designed for large-scale MoE training and production workloads, built on the proven foundation of slime.

LMSYS Fellowship Program Officially Launches

LMSYS announces its Fellowship Program offering up to $50,000 in funding for U.S. PhD students who have made significant contributions to open-source AI infrastructure.

Unified FP8: Beyond Mixed Precision, Achieving Stable Accelerated MoE RL Training

We implemented an end-to-end FP8 sampling and training pipeline for RL. Experiments show that for MoE models, using BF16 training with FP8 rollout leads to severe train-inference inconsistency as model size increases. Unified FP8 for both training and rollout effectively eliminates quantization-induced inconsistency, improving RL training speed and stability.

From Research to Production: EAGLE-3 Accelerates Open Source LLM Inference 2-3x on Vertex AI

This article details how EAGLE-3 (Extrapolative Attention Guided LEarning) was productionized on Vertex AI, achieving 2-3x speedup for LLM inference through lightweight draft heads instead of separate draft models, along with engineering challenges and lessons learned.

SGLang Inference Acceleration: Native Integration with NVIDIA Model Optimizer for Seamless Quantized Deployment

SGLang now features native integration with NVIDIA Model Optimizer, enabling direct quantization and deployment within the SGLang ecosystem, achieving up to 2x single-GPU throughput improvements.

Letting Tensors Soar: R-Fork Accelerates Large Model Weight Loading

We introduce Tensor R-Fork, a novel weight loading method that leverages efficient cross-node device-to-device interconnects to achieve zero-copy tensor loading from running SGLang instances to new instances, reducing loading time from minutes to seconds.

SGLang Adds Same-Day Support for Efficient Open-Source Nemotron 3 Nano Mixed MoE Model

SGLang announces same-day support for NVIDIA's new Nemotron 3 Nano model, a compact MoE language model offering industry-leading computational efficiency and accuracy for building specialized agentic AI systems.

SGLang Instantly Supports MiMo-V2-Flash Model

SGLang now supports the MiMo-V2-Flash model, a 309B parameter model optimized for inference with sliding window attention and multi-layer MTP, achieving balanced throughput and latency on H200 GPUs.

Mini-SGLang: A Complete Analysis of the Lightweight and Efficient LLM Inference Engine

We introduce Mini-SGLang, a lightweight yet high-performance Large Language Models (LLMs) inference framework that preserves core state-of-the-art features in just 5k lines of Python code, serving as both a reliable inference engine and transparent reference implementation for researchers and developers.

SGLang Empowers Diffusion Large Models: LLaDA 2.0 Now Supported

SGLang introduces a seamless integration framework for Diffusion Large Language Models (dLLMs), enabling LLaDA 2.0 support through existing ChunkedPrefill mechanisms without core architecture changes, while maintaining full performance benefits and allowing customizable diffusion decoding algorithms.

SpecBundle & SpecForge v0.2: Production-Ready Speculative Decoding Models and Framework Released

SpecForge team, in collaboration with industry partners including Ant, Meituan, Nex-AGI, and EigenAI, releases SpecBundle (Phase 1), a collection of production-grade EAGLE-3 model checkpoints trained on large-scale datasets. Alongside, SpecForge v0.2 brings major system upgrades including comprehensive refactoring for improved usability and multi-backend support.

EPD Disaggregation in SGLang: Elastic Encoder Scaling for Vision-Language Models

SGLang introduces Encoder-Prefill-Decode (EPD) disaggregation architecture that separates vision encoding from language processing in VLMs, enabling independent scaling and significantly reducing TTFT by 6-8x in image-intensive scenarios.

Deploying 1TB Models on a Single H200: End-to-End INT4 QAT RL Practice

The SGLang RL team achieves major breakthroughs in RL training stability and efficiency, implementing end-to-end INT4 QAT that enables ~1TB model deployment on a single H200 GPU while maintaining training-inference consistency.

SGLang Optimizes GLM4-MoE Production Deployment: 65% TTFT Improvement

Novita AI developed production-proven optimizations for deploying GLM4-MoE models on SGLang, achieving up to 65% TTFT reduction and 22% TPOT improvement through Shared Experts Fusion and Suffix Decoding techniques.

Firefox to Add One-Click Disable for All AI Features: Mozilla Announces "Block AI Switch" in Browser Settings

Mozilla will introduce a global "Block AI enhancements" toggle in Firefox 148, allowing users to disable all current and future generative AI features with a single click, responding to growing demand for AI-free browsing.

AI Empowers Educational Transformation: China's Market Embraces a New Era of Intelligent Learning

With rapid technological advancement, AI applications in education have become a hot topic in China's market, as intelligent learning platforms transform traditional teaching methods.

The Ethical Test of the AI Era: The Game Between Data Privacy and Moral Boundaries

As AI technology rapidly advances, it brings unprecedented ethical challenges, particularly concerning data privacy and moral boundaries, sparking widespread debate among various stakeholders.

Karpathy's New nanochat: Training GPT-2 Models for $100, AI Open Source Creates New Waves

Andrej Karpathy's latest open-source project karpathy/nanochat achieves complete GPT-2-level language model training for just ~$73 (3 hours on single 8xH100 node), 600x cheaper than OpenAI's 2019 baseline, rapidly topping GitHub Trending and sparking global AI community discussions.

Claude-Mem: Claude Code Persistent Memory Plugin Goes Viral on GitHub, Solving Developer Pain Points

A new open-source plugin Claude-Mem has exploded on GitHub with over 19.5k stars, solving Claude Code's context loss issue by enabling intelligent cross-session memory, reducing token usage by 95% and increasing tool call efficiency by 20x.