SGLang Instantly Supports MiMo-V2-Flash Model
SGLang now supports the MiMo-V2-Flash model, a 309B parameter model optimized for inference with sliding window attention and multi-layer MTP, achieving balanced throughput and latency on H200 GPUs.
SGLang now supports the MiMo-V2-Flash model, a 309B parameter model optimized for inference with sliding window attention and multi-layer MTP, achieving balanced throughput and latency on H200 GPUs.
We introduce Mini-SGLang, a lightweight yet high-performance Large Language Models (LLMs) inference framework that preserves core state-of-the-art features in just 5k lines of Python code, serving as both a reliable inference engine and transparent reference implementation for researchers and developers.
SGLang introduces a seamless integration framework for Diffusion Large Language Models (dLLMs), enabling LLaDA 2.0 support through existing ChunkedPrefill mechanisms without core architecture changes, while maintaining full performance benefits and allowing customizable diffusion decoding algorithms.
SpecForge team, in collaboration with industry partners including Ant, Meituan, Nex-AGI, and EigenAI, releases SpecBundle (Phase 1), a collection of production-grade EAGLE-3 model checkpoints trained on large-scale datasets. Alongside, SpecForge v0.2 brings major system upgrades including comprehensive refactoring for improved usability and multi-backend support.
SGLang introduces Encoder-Prefill-Decode (EPD) disaggregation architecture that separates vision encoding from language processing in VLMs, enabling independent scaling and significantly reducing TTFT by 6-8x in image-intensive scenarios.
The SGLang RL team achieves major breakthroughs in RL training stability and efficiency, implementing end-to-end INT4 QAT that enables ~1TB model deployment on a single H200 GPU while maintaining training-inference consistency.
Novita AI developed production-proven optimizations for deploying GLM4-MoE models on SGLang, achieving up to 65% TTFT reduction and 22% TPOT improvement through Shared Experts Fusion and Suffix Decoding techniques.
Mozilla will introduce a global "Block AI enhancements" toggle in Firefox 148, allowing users to disable all current and future generative AI features with a single click, responding to growing demand for AI-free browsing.
With rapid technological advancement, AI applications in education have become a hot topic in China's market, as intelligent learning platforms transform traditional teaching methods.
As AI technology rapidly advances, it brings unprecedented ethical challenges, particularly concerning data privacy and moral boundaries, sparking widespread debate among various stakeholders.
Andrej Karpathy's latest open-source project karpathy/nanochat achieves complete GPT-2-level language model training for just ~$73 (3 hours on single 8xH100 node), 600x cheaper than OpenAI's 2019 baseline, rapidly topping GitHub Trending and sparking global AI community discussions.
A new open-source plugin Claude-Mem has exploded on GitHub with over 19.5k stars, solving Claude Code's context loss issue by enabling intelligent cross-session memory, reducing token usage by 95% and increasing tool call efficiency by 20x.
Anthropic's Claude 3.5 Sonnet model achieved 92.0% on the SWE-bench software engineering benchmark, surpassing all previous AI models and marking a new milestone in AI coding capabilities. This breakthrough sparked heated discussions on X platform with over 150,000 interactions, as developers shared real projects built with Claude and debated the future role of AI programmers.
Chinese AI startup DeepSeek has released DeepSeek-V2, an open-source large language model that outperforms OpenAI's GPT-4o in Chinese benchmarks while achieving significant efficiency gains through innovative architecture.
Elon Musk recently sounded the alarm on AI safety on X platform, calling for a global pause on training giant AI models, sparking intense debate amid the heated US-China AI race.
Meta has released Llama 3.2 series lightweight models (1B and 3B parameters), the first vision-enabled multimodal models in the Llama family optimized for edge devices. This launch marks a significant shift from cloud to edge AI, potentially reshaping the mobile AI ecosystem.
xAI officially launched Grok-2's image generation feature in August 2024, powered by the Flux.1 model, which quickly became a trending topic on X platform with its high-quality output and free access, while its no-censorship policy is reshaping the AI image generation landscape.
NVIDIA's highly anticipated Blackwell B200 AI chip faces overwhelming demand, with first deliveries pushed to 2025 due to production bottlenecks. This shortage highlights the broader AI infrastructure constraints as hyperscalers and AI companies scramble for computing power.
Anthropic's Claude 3.5 Sonnet achieves over 90% on the SWE-bench software engineering benchmark, marking a milestone in AI coding capabilities. This breakthrough has sparked widespread discussion in the developer community and a surge in practical project implementations.
Google DeepMind launches Gemini 2.0 Flash, a lightweight, high-speed multimodal AI model that has sparked over 100,000 benchmark tests from developers. The model features ultra-low latency and efficient performance, positioning itself as a game-changer for real-time AI applications.