Letting Tensors Soar: R-Fork Accelerates Large Model Weight Loading

Feb 4, 2026 785 Views - Read Source LMSYS

LMSYS SGLang Tensor R-Fork GPU-Direct RDMA 大模型优化权重加载

TL;DR

We introduce Tensor R-Fork (Tensor Remote Fork), a novel weight loading method that leverages efficient cross-node device-to-device interconnects to achieve zero-copy loading of tensors from running SGLang instances to new instances.

This method offers three key advantages:

Significantly improves weight loading performance;
Eliminates redundant model weight storage on local disk/DRAM;
Ensures inference service remains undisturbed.

Using Deepseek-R1 model as an example, loading time is reduced from several minutes to just seconds, local disk/DRAM storage saves approximately 600GB, while inference service quality remains stable.

Background

As LLM service scale and model weight sizes continue to grow, the cold start time of SGLang instances has become a critical bottleneck for production efficiency. Among these, weight loading is the most time-consuming phase.

Taking Deepseek-R1 as an example, loading from local disk takes several minutes, while loading from remote storage can take dozens of minutes. As model sizes grow exponentially, initialization and data transfer times will further deteriorate.

How to optimize weight loading? The most direct approach is to maximize the bottleneck bandwidth of the weight data flow. Common model loading methods in the industry and their bottlenecks are shown in the table below:

Loading Source	Data Flow	Bottleneck
Remote Storage Center	Remote Storage → Remote Ethernet NIC → Ethernet → Local Ethernet NIC → Local DRAM → Local GPU Memory	NVMe / Ethernet NIC
Local Disk	Disk → DRAM → GPU Memory	NVMe
Local DRAM	DRAM → GPU Memory	PCIe

Can we utilize higher bandwidth data flows to transfer tensors? The answer is YES — cross-node device-to-device interconnects can provide hundreds of GB/s throughput. But the key question is: how to fully utilize this interconnect bandwidth for efficient weight loading in SGLang?

To address this, we developed a new weight loading framework Tensor R-Fork, which reduces Deepseek-R1 loading time to seconds and is production-ready.

Design

The core concept of Tensor R-Fork is to utilize GPU-Direct RDMA to build a point-to-point (P2P) weight storage architecture.

Traditional data transfer performance is poor because there's always a bottleneck in the path with bandwidth far lower than device-to-device interconnects. From data flow analysis, we know that weight tensors are stored on each GPU and can be transferred directly across nodes via GPU-Direct RDMA.

To maximize RDMA NIC bandwidth utilization, we designed a per-GPU-pair data transfer strategy: local GPUs directly transfer data to/from paired remote GPUs. This design bypasses the PCIe bottleneck between GPU and CPU, achieving high-throughput communication without CPU/host memory dependencies.

The process of loading weights from a remote SGLang instance is as follows:

Implementation

To enable each running instance to serve as a weight source for new instances of the same model while minimizing (or eliminating) interference with the running instance's inference service, we implemented two backends: NCCL and TransferEngine. Using running instance A (source instance) and new instance B (target instance) as examples, we detail the weight transfer mechanism.

NCCL Backend

Using NCCL as the backend [1], the process consists of two phases:

Establish communication groups between source and target instances.
Transfer weights through the communication groups.

During target instance initialization, it sends an HTTP request to the specified source instance to initiate communication group creation. Each target TPWorker establishes an NCCL communication group with the corresponding source TPWorker (source rank 0 pairs with target rank 0, etc.), with each group having only two members.

After establishing communication groups, each source TPWorker broadcasts GPU memory weight tensors via NCCL broadcast, which target TPWorkers receive directly into their GPU memory without intermediate copies.

While NCCL utilizes GPU-Direct RDMA, it has key limitations: transfers interfere with source instance inference service because:

Communication group establishment: Source instances need to actively participate.
CUDA kernel interference: NCCL broadcast triggers CUDA kernels, competing for GPU resources and causing latency spikes in generation tasks.

TransferEngine Backend

To achieve interference-free transfers, we introduce the TransferEngine backend [2], a lightweight RDMA-based transfer runtime that runs in parallel with each TPWorker of the source instance, exposing GPU-resident weight tensors to remote readers without requiring CUDA kernels on the source side.

During source SGLang instance initialization:

Each TPWorker starts a TransferEngine instance.
TransferEngine registers weight GPU memory addresses to RDMA channels.

During target instance initialization:

Send HTTP request to obtain source TransferEngine metadata, including RDMA keys mapped to GPU memory addresses.
Using RDMA keys, directly load weights from source GPU memory without interrupting source service.

Want to learn more about TransferEngine? Check out the appendix 🚀

NCCL vs. TransferEngine

	NCCL	TransferEngine
Deployment Complexity	✅ No additional dependencies	❌ Requires additional library mooncake
Transfer Setup Overhead	✅ Building communication group only takes hundreds of milliseconds	➖ Memory region registration takes seconds, but can overlap with other initialization
No Interference to GPU Load	❌ Transfer launches CUDA kernels	✅ No CUDA kernels

Usage

See R-Fork Documentation for details

Using NCCL Backend

Seed Instance:

python -m sglang.launch_server [args]

Client Instance:

python -m sglang.launch_server [args] \
  --load-format remote_instance  \
  --remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \
  --remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \
  --remote-instance-weight-loader-send-weights-group-ports [send_weights_nccl_group_ports_list]  \
  --remote-instance-weight-loader-backend nccl

Using TransferEngine Backend

Seed Instance:

python -m sglang.launch_server [args] \
  --remote-instance-weight-loader-start-seed-via-transfer-engine

Client Instance:

python -m sglang.launch_server [args] \
  --load-format remote_instance  \
  --remote-instance-weight-loader-seed-instance-ip [seed_instance_ip] \
  --remote-instance-weight-loader-seed-instance-service-port [seed_instance_service_port] \
  --remote-instance-weight-loader-backend transfer_engine

Performance

We tested performance loading the DeepSeek-R1 model from different sources on a new SGLang instance equipped with 8 NVIDIA H20 GPUs.

Memory region registration can overlap with other initialization phases, further optimizing total startup time.

Industrial Practice

The manual seed instance configuration described above is suitable for testing, but identifying available seed instances in industrial deployment requires significant operational overhead.

To address this challenge, we propose the Ten solution (original text ends here).