KV Cache Transfer in Disaggregated Serving

In disaggregated serving architectures, KV cache must be transferred between prefill and decode workers. TensorRT-LLM supports two methods for this transfer:

Default Method: UCX

By default, TensorRT-LLM uses UCX (Unified Communication X) for KV cache transfer between prefill and decode workers. UCX provides high-performance communication optimized for GPU-to-GPU transfers.

Beta Method: NIXL

TensorRT-LLM also supports using NIXL (NVIDIA Inference Xfer Library) for KV cache transfer. NIXL is NVIDIA's high-performance communication library designed for efficient data transfer in distributed GPU environments.

Note: NIXL support in TensorRT-LLM is currently beta and may have some sharp edges.

Using NIXL for KV Cache Transfer

Note: NIXL backend for TensorRT-LLM is currently only supported on AMD64 (x86_64) architecture. If you're running on ARM64, you'll need to use the default UCX method for KV cache transfer.

To enable NIXL for KV cache transfer in disaggregated serving:

Build the container with NIXL support: The TensorRT-LLM wheel must be built from source with NIXL support. The ./container/build.sh script caches previously built TensorRT-LLM wheels to reduce build time. If you have previously built a TensorRT-LLM wheel without NIXL support, you must delete the cached wheel to force a rebuild with NIXL support.

Remove cached TensorRT-LLM wheel (only if previously built without NIXL support):
```
rm -rf /tmp/trtllm_wheel
```
Build the container with NIXL support:
```
./container/build.sh --framework trtllm \
  --tensorrtllm-git-url https://github.com/NVIDIA/TensorRT-LLM.git \
  --tensorrtllm-commit main
```
Run the containerized environment: See run container section to learn how to start the container image built in previous step.
Start the disaggregated service: See disaggregated serving to see how to start the deployment.
Send the request: See client section to learn how to send the request to deployment.

Important: Ensure that ETCD and NATS services are running before starting the service.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

KV Cache Transfer in Disaggregated Serving

Default Method: UCX

Beta Method: NIXL

Using NIXL for KV Cache Transfer

FilesExpand file tree

kv-cache-transfer.md

Latest commit

History

kv-cache-transfer.md

File metadata and controls

KV Cache Transfer in Disaggregated Serving

Default Method: UCX

Beta Method: NIXL

Using NIXL for KV Cache Transfer