Introduction
State-of-the-art large language models (LLMs) continue to grow in scale, driving ever-increasing demands for GPU compute and memory bandwidth. GPU vendors and model developers are turning to lower-precision floating-point formats, with FP4 (4-bit floating point) quantization being particularly compelling. For example, the FP4-quantized Llama 3.3 70B model is 3.5× smaller with minimal quality loss on benchmarks like MMLU.
However, current hardware support has notable limitations. While next-generation GPUs like NVIDIA GB200 and AMD MI350 natively support FP4 matrix multiplication, the widely deployed AMD Instinct MI250 and MI300 series GPUs lack this capability, preventing users from efficiently running FP4 models on existing AMD hardware.
To bridge this gap, we developed Petit—a collection of FP16/BF16 × FP4 mixed-precision GPU kernels designed for AMD GPUs. Petit enables serving FP4 models on MI200 and MI300 series without hardware upgrades, delivering significant performance improvements:
- 1.74× faster Llama 3.3 70B end-to-end inference when using SGLang;
- Up to 3.7× faster equivalent matrix multiplication compared to AMD's state-of-the-art GEMM library hipBLASLt.
Petit is open-sourced under BSD license and integrated into SGLang 0.4.10. Users can launch Llama 3.3 70B FP4 model serving on AMD MI250/MI300X with the following command:
python -m sglang.launch_server --model-path nvidia/Llama-3.3-70B-Instruct-FP4 --host 0.0.0.0 --port 30000This article details our optimization journey and technical insights. Petit fully leverages AMD's open-source software ecosystem while introducing innovations like offline reordering and hardware-specific low-level optimizations.
Co-designing Efficient GPU Kernels with Hardware Architecture
Modern GPUs achieve massive computational throughput by stacking compact compute units (CUs), but realizing peak performance requires deep co-design between applications and underlying architecture. As shown in Figure 1, Petit's development follows several key co-design principles.
Efficient Dequantization through Preprocessing
Petit efficiently utilizes AMD GPU's dedicated MatrixCore hardware for accelerated matrix multiplication. A wavefront (group of 64 threads) can efficiently multiply two BF16/FP16 16×16 matrices collectively. However, MI300X GPUs lack native MatrixCore support for FP4 weights, requiring dequantization of FP4 weights to BF16/FP16 while maintaining efficient memory loading and MatrixCore preparation.
This creates a core challenge: memory loading and MatrixCore preparation require different data layouts. Memory efficiency demands contiguous loads of 1024-byte blocks, while MatrixCore expects 16×16 tiles distributed across the wavefront. Traditional on-GPU reordering incurs significant overhead.
Marlin implementation on NVIDIA GPUs avoids this by pre-arranging elements on disk. We pack 8 consecutive FP4 values into 32-bit integers, requiring 31 instructions for dequantization. Petit further customizes the bit-packing format: the first 4 FP4s are reordered in BF8 layout with the rest stored in remaining bits. Leveraging AMD's unique v_bfrev_b32 and v_cvt_pk_f32_bf8 instructions (with sub-dword addressing SDWA support), we dequantize 8 FP4 values using only 15 instructions, achieving 30% faster multiplication.
Mastering Memory Hierarchy
GPUs like MI300X have extremely high arithmetic density (>500), requiring CUs to execute hundreds of operations per byte to reach peak FLOPS, making maximizing effective memory bandwidth crucial. Petit employs established techniques like tiling and double buffering, with AMD-specific optimizations:
- Avoiding LDS Bank Conflicts: AMD GPU LDS is divided into 32 banks, allowing 32 unique bank concurrent accesses per cycle. Conflicts cause serialization bottlenecks, especially with 64-thread wavefronts. Petit implements permuted data layouts based on bank design, achieving conflict-free LDS usage.
- Chiplets and Interconnect: Each MI300 chiplet (XCD) has 4MB local L2 cache, with 256MB L3 cache shared across the entire GPU. Interconnect bandwidth is high but latency is significant. Petit implements topology-aware workload partitioning to reduce interconnect traffic, preferring naive grid partitioning over global striping when profiling shows higher interconnect overhead.
Generating High-Quality Machine Code
GPUs use simple out-of-order execution units to maximize CU density, but branches and pipeline stalls are costly. AMD GPUs provide conditional moves and bounded memory instructions to completely eliminate branches. For example, Petit utilizes range-specified buffer load/store instructions where the GPU automatically drops out-of-bounds accesses; LDS accesses beyond 64KB are also handled automatically without performance penalty. Additionally, Petit provides compiler hints to overlap MFMA (matrix fused multiply-add) instructions with memory accesses, effectively hiding memory latency.
Standard compilers may not fully exploit advanced GPU ISA (e.g., intentional out-of-bounds access is undefined behavior). These optimizations require manual construction and verification.
Performance Results
End-to-End Inference Performance
We evaluate Petit's practical effectiveness by comparing FP4 vs BF16 model end-to-end inference performance. Tests use two Llama 3.3 70B variants with SGLang v0.4.10, measuring input/output token throughput at batch sizes 10 and 64. Environment: AMD Developer Cloud VM (1× MI300X GPU, 240 GB RAM, 5 TB SSD) running ROCm 6.4.2 on Ubuntu 24.04.1.
Figure 2 shows offline generation benchmark results (using real ShareGPT traces reflecting production performance). Overall, Petit serves Llama 3.3 70B FP4 models 1.74× faster (batch 10) and 1.60× faster (batch 64) than SGLang's native BF16 models. In memory-bandwidth-limited production scenarios with small batches, Petit efficiently utilizes the 3.5× smaller FP4 models, achieving higher throughput. Reproduce with:
python -m sglang.bench_offline_throughput --model-path nvidia/Llama-3.3-70B-Instruct-FP4 --num-prompts 10
python -m sglang.bench_offline_throughput --model-path nvidia/Llama-3.3-70B-Instruct-FP4 --num-prompts 64Detailed Performance Analysis
We further compare Petit's performance against hipBLASLt (AMD's state-of-the-art low-level assembly GEMM library). Note the libraries have slightly different objectives:
- Petit: BF16 matrix × NVFP4 matrix (16 elements sharing 1 FP8 scale).
- hipBLASLt: Two BF16 matrices.
While not identical, results provide quantitative insights. We examine actual weight matrix dimensions when serving Llama 3 70B, measuring performance for m=16 (decoding workload) and m=256 (prefilling workload), averaging 100 runs (after 50 warmup runs). Both libraries are tuned to optimal configurations.
Figures 3a and 3b show GEMM performance. For m=16 (decoding-dominant), Petit is up to 3.7× faster than hipBLASLt, with average speedup of 2.56×. For m=256 (prefilling), Petit is up to 1.09× faster.
© 2026 Winzheng.com 赢政天下 | 转载请注明来源并附原文链接