Unified FP8: Beyond Mixed Precision, Achieving Stable Accelerated MoE RL Training

Feb 4, 2026 785 Views - Read Source LMSYS

LMSYS FP8 RL训练 MoE模型低精度计算 Tensor Cores

TL;DR: We implemented an end-to-end FP8 sampling and training pipeline for RL. Experiments show that for MoE models, using BF16 training with FP8 rollout leads to severe train-inference inconsistency as model size increases. Unified FP8 for both training and rollout effectively eliminates quantization-induced train-inference inconsistency, improving RL training speed and stability.

The SGLang RL team and Miles community have conducted interesting explorations in RL training stability and acceleration, including aligning SGLang and FSDP backends to achieve strict zero KL divergence, and Speculative Decoding combined with online SFT for draft models.

Building on this, we share a new advancement balancing stability and performance—end-to-end FP8 RL training and sampling pipeline. The miles framework has fully supported FP8 RL training for Qwen3-4B and Qwen3-30B-A3B (details here), ready to use out of the box.

This work was jointly completed by InfiXAI Team, Ant Group AQ Team, SGLang RL Team, and Miles Team. Special thanks to Verda Cloud for computational support and NVIDIA for technical support on Transformer Engine (TE).

Hardware Foundation for FP8 Training

Tensor Cores and Low Precision Support

Low precision computing is a gem of hardware-software co-design. Its hardware foundation is Tensor Cores, a GPU hardware acceleration unit specifically designed for large-scale matrix multiply-accumulate operations, the core computation in deep learning. Compared to traditional CUDA cores, Tensor Cores provide higher throughput for low precision formats (such as FP16, BF16, FP8). Their evolution started from basic FMA instructions and DP4A vectorization, with the Volta architecture first introducing dedicated Tensor Cores, followed by continuous advances in Ampere, Hopper, and Blackwell:

Scale expansion: Processing larger matrices in single operations, improving compute-to-memory ratio.
Precision reduction: Continuous support for lower precision formats like FP/BF16, FP8, etc.

Arch	FP64	F16	INT8	INT4	FP8	MXFP
Volta	❌	✅ FP16	❌	❌	❌	❌
Turing	❌	✅ FP16	✅	✅	❌	❌
Ampere	✅	✅ FP16/BF16	✅	✅	❌	❌
Hopper	✅	✅ FP16/BF16	✅	❌	✅ (FP22 accumulation only)	❌
Blackwell	✅	✅ FP16/BF16	✅	❌	✅	✅ MXFP(8/6/4) NVFP4
Blackwell Ultra	✅ (reduced FLOPs)	✅ FP16/BF16	✅ (reduced FLOPS)	❌	✅	✅ MXFP(8/6/4) NVFP4

Image source: zartbot, SemiAnalysis

This trend makes low precision storage and computation more attractive. Specific advantages include:

Significantly reduced memory footprint: FP8 theoretically halves model weights and activation memory, alleviating VRAM pressure.
Theoretical 2× computational throughput: On H100 SXM, FP8 Tensor Cores achieve 1979 TFLOPS, twice that of BF16 (989 TFLOPS).
Alleviates memory bandwidth bottlenecks: More compact data reduces HBM to compute core transfers.

FP8 Format

FP8 is an 8-bit floating-point format. Compared to FP32 (32-bit) and FP16/BF16 (16-bit), it reduces storage and transfer costs to 1/4 or 1/2, alleviating VRAM and bandwidth bottlenecks and improving training and inference performance. Currently there are two main formats:

E4M3: 4-bit exponent + 3-bit mantissa. Small dynamic range but high precision.
E5M2: 5-bit exponent + 2-bit mantissa. Large dynamic range but low precision.

FP8 E4M3 vs E5M2

Image source: OCP whitepaper

This design maximizes hardware throughput while maintaining sufficient numerical range and precision.

FP8 Scale Selection

Dimension	FP32 Scale (Full precision scaling factor)	E8M0 Scale (Exponent-only scaling)
Format Definition	FP32 (IEEE 754 single precision float)	E8M0 (8-bit exponent, 0-bit mantissa)
Numerical Properties	Arbitrary precision real number representation	Only supports powers of 2, e.g., 1, 2, 0.5; cannot represent 1.5 etc.
Core Idea	High precision management of scaling factors, ensuring numerical stability in training	Incorporate scaling factors into low precision, efficient with bit operations
Main Advantages	1. High precision, stable training: Precisely captures dynamic range, reduces quantization error, prevents divergence. 2. Wide support: Default for NVIDIA Transformer Engine, mature ecosystem	1. Extremely hardware-friendly: Scaling as simple bit shifts, fast and low energy. 2. Unified pipeline: Full 8-bit operation, simplified hardware design
Main Disadvantages	1. Storage overhead: Each quantized tensor needs additional FP32 scale, consuming VRAM. 2. Computational overhead: Scale computation and conversion require FP32	1. Risk of precision loss: Forced rounding to power of 2 introduces noise, backprop accumulation causes divergence. 2. Limited dynamic range resolution: Difficult to finely adapt to complex tensor distributions
Summary	Most common, safe solution in industry	Sacrifices precision for extreme hardware efficiency

After comprehensive evaluation, we chose FP32 as the training scale precision. Reasons:

Precision alignment and training stability: FP32 scale finely captures tensor dynamic range, making FP8 training loss curves close to BF16 baseline.
Consistency with inference ecosystem: Mainstream inference models also use FP32 quantization scales.
Actual hardware benefits:
- Hopper (H100/H800): Supports FP8 Tensor Cores but no dedicated E8M0 units.
- Blackwell (B100/B200): Introduces MXFP8, supporting E8M0-like block-level scaling (arXiv:2506.08027).

Therefore, on current H-series clusters, forcing E8M0 not only provides no significant acceleration but also introduces software simulation overhead and precision risks.

FP8 Quantization

Common quantization strategies include per-tensor, per-block and per-token. Regardless of granularity, quantization typically involves two steps:

FP8 quantization flow

Image source: InfiR2: A Comprehensive FP8 Training Recipe for Reasoning-Enhanced Language Models

Step 1: Calculate scaling factor S

Take the maximum absolute value of the tensor (or block) max|X|, divide by FP8 maximum representable value V_max:

S = max|X| / V_max

Step 2: Calculate quantized value Q

Use S to divide each element x of original tensor X by S and round:

Q(x) = round(x / S)

Since FP8 precision is lower than FP16/BF16, in practice there's a tradeoff between stability and efficiency. Forward/backward passes often use different strategies and granularities:

Activations: Usually per-token quantization. Activations often contain significant outliers; fine granularity can localize outlier impact while preserving overall precision.
Weights: Usually per-block quantization. After convergence, weight distribution is smooth (near Gaussian) with few outliers, but sensitive to quantization errors. Block-wise (e.g., block_size × block_size) balances precision, hardware optimization, efficiency, and memory savings.
Gradients: Usually per-token quantization.