Background: Heterogeneous Inference for Sparse MoE Models
Modern Mixture-of-Experts (MoE) language models, such as DeepSeek-V3, contain hundreds of billions of parameters but only activate a small subset of experts for each token. Due to this sparse activation pattern, MoE models are well-suited for CPU/GPU heterogeneous inference: sparsely activated experts can run efficiently on memory-rich CPUs, while dense computational components can execute on GPUs with higher bandwidth and throughput.

KTransformers: Unlocking the Potential of CPU/GPU Heterogeneous Inference for MoE Models
To address the challenges in heterogeneous inference, Tsinghua University's MadSys and Approaching.AI developed the KTransformers project, providing a series of optimizations that make CPU/GPU collaborative inference more efficient. The improvements are primarily divided into three aspects:
1. AMX-Optimized CPU Kernels
KTransformers redesigns CPU computation through Intel AMX-optimized kernels and cache-hierarchy-oriented memory layouts. On a single Xeon socket, the AMX-optimized kernels achieve sustained throughput of up to 21.3 TFLOPS, which is 3.9x faster than PyTorch's native implementation.
2. Efficient Device Coordination
By introducing NUMA-aware tensor parallelism and CUDA graph-enabled scheduling, KTransformers significantly reduces coordination costs between CPU and GPU. NUMA-aware tensor parallelism places expert weight shards in each NUMA node's local memory, avoiding expensive cross-NUMA memory traffic and achieving up to 63% improvement in decoding throughput.
3. Expert Deferral Mechanism
By deferring the execution of certain experts, KTransformers allows CPU expert computation to overlap with GPU attention processing, thereby improving concurrent device utilization, with decoding throughput improvements of up to 1.45x and accuracy changes within 0.5%.
Integrating KTransformers into SGLang
SGLang has now integrated KTransformers as a backend library, making CPU/GPU heterogeneous inference for MoE models more efficient. It combines GPU tensor parallelism with CPU/GPU heterogeneous expert parallelism, supporting inference on heterogeneous devices.
Installation Guide
To use SGLang's KTransformers heterogeneous inference, you need to install SGLang and the KTransformers CPU kernel (kt-kernel). Please ensure your system meets the following requirements: CUDA version 12.1 or above, Linux x86_64 operating system, gcc, g++ >= 11, CMake >= 3.25, Python 3.11.
Benchmark Results (Preview)
In a single GPU+CPU configuration, KTransformers outperforms the baseline across all prompt lengths, achieving speedups of up to 20x thanks to AMX-optimized CPU kernels. During the decoding phase, KTransformers also excels, primarily due to reduced CPU/GPU coordination overhead, with speed improvements of up to 4x.
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接