A complete GPU KV-cache offload solution that moves KV tensors from Host GPU memory to BlueField DPU-backed storage tiers without Host CPU involvement.
This project provides an end-to-end pipeline for offloading GPU-resident data β primarily LLM KV caches β to storage attached to a local BlueField DPU. It is built from three integrated pieces:
- DPU Agent (
dpu-agent/) β Runs on the BlueField DPU ARM cores. It imports the remote GPU memory map, executes DOCA DMA operations, and writes incoming data to DPU-side storage backends. - NIXL Plugin (
nixl-plugin/) β A host-side NIXL backend namedDOCA_DMA_PROXY. It registers GPU buffers asVRAM_SEG, exports them over PCIe with DOCA DMA, and forwards transfer requests to the DPU agent. - LMCache Integration (
examples/lmcache/) β A patch set and configuration example that enables LMCache v0.4.3 to use theDOCA_DMA_PROXYbackend for transparent KV-cache tiering.
Together these components let an application such as LMCache express a transfer as VRAM_SEG β OBJ_SEG and have the actual PCIe DMA and storage I/O executed by the DPU.
The DPU agent can land data in multiple backend types, allowing the same offload path to target different cost/performance tiers:
| Target | How it is used | Typical use case |
|---|---|---|
| DPU DRAM | Pre-allocated staging buffer; can also serve as a fast transient tier | Low-latency cache spill |
| DPU-local disk | POSIX files via the agent's posix_storage_backend |
Capacity tier on BlueField NVMe |
| Remote / object storage | NIXL OBJ_SEG backend (e.g. xdfs_storage_backend) |
Shared object store, distributed cache |
Bulk data always moves over DOCA DMA between Host GPU and DPU. Only small control messages travel over DOCA Comch or TCP.
In LLM serving, the KV cache is large, grows with sequence length, and competes with model weights for limited GPU HBM. Existing offload paths often route data through the Host CPU or across the network, which:
- consumes host CPU cycles that could run the inference engine,
- adds extra memory copies,
- and is hard to integrate cleanly with a tiered cache.
By using the BlueField DPU's dedicated DOCA DMA engine, this solution:
- moves data directly between GPU and DPU storage across the PCIe complex,
- keeps the host CPU out of the data path,
- and exposes the offload path through the standard NIXL API so applications like LMCache do not need to know DOCA details.
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β Host β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β LMCache / vLLM β β NIXL Agent β β
β β (KV-cache manager) βββββΊβ + DOCA_DMA_PROXY backend β β
β βββββββββββββββββββββββ β - registers GPU VRAM β β
β β - exports GPU mmap β β
β β - sends transfer requests β β
β βββββββββββββββ¬ββββββββββββββββ β
β β β
β Control planeβ(DOCA Comch / TCP) β
β βΌ β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
β β GPU HBM (VRAM_SEG) βββββΊβ DOCA DMA over PCIe β β
β βββββββββββββββββββββββ βββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β
βΌ
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
β BlueField DPU β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
β β dpu_dma_copy agent β β
β β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββββββββββββββββββββ β β
β β β DOCA DMA βββββΊβ staging bufferβββββΊβ NIXL storage backend β β β
β β β engine β β (DPU DRAM) β β (posix / xdfs / xdfs_kv / ...) β β β
β β βββββββββββββββββ βββββββββββββββββ βββββββββββββββββββββββββββββββββββ β β
β β β β β
β β βΌ β β
β β βββββββββββββββββ β β
β β β DPU-local β β β
β β β NVMe / OBJ β β β
β β βββββββββββββββββ β β
β βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ β
βββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββββ
The DPU agent is the piece that executes the offload. It runs as a service on the BlueField DPU and is intentionally separate from the NIXL library so it can evolve independently.
Responsibilities:
- Import the host GPU mmap from the PCI export descriptor sent by the plugin.
- Maintain a reusable DPU-side staging buffer.
- Execute chunked, pipelined DOCA DMA with configurable queue depth.
- Forward received data to a NIXL storage backend running on the DPU, which in turn writes to local files or object storage.
Build and run instructions are in dpu-agent/README.md.
The host plugin implements the NIXL nixlBackendEngine interface. It exposes two memory types:
VRAM_SEGβ Host GPU memory, exported viadoca_mmap_export_pci().OBJ_SEGβ DPU-resident object/file, identified by a path or key string.
The backend is local-only (supportsRemote() == false): both the GPU and the DPU must be reachable through the same host-side BlueField PCI function.
Because NIXL loads backends dynamically, the plugin source is injected into a NIXL source tree with scripts/patch_nixl.sh and built together with NIXL.
examples/lmcache/ contains:
lmcache_integration.patchβ modifications to LMCache v0.4.3 to recognize and use theDOCA_DMA_PROXYbackend.lmcache-config.yamlβ sample configuration.patch_lmcache.shβ helper that applies the patch idempotently.
After patching LMCache, you can configure a storage backend that points to the DPU agent and offload KV tensors transparently.
.
βββ common/ # Shared host-DPU control channel + wire protocol (dma_transfer.h)
βββ nixl-plugin/ # NIXL backend plugin source (patch into NIXL)
βββ dpu-agent/ # BlueField DPU proxy service
βββ examples/
β βββ cpp/ # NIXL C++ example
β βββ python/ # NIXL Python example
β βββ standalone/ # Standalone host test tool (no NIXL required)
β βββ lmcache/ # LMCache v0.4.3 integration patch
βββ scripts/ # patch_nixl.sh and build helpers
βββ docs/ # Architecture and integration docs
βββ CMakeLists.txt
βββ LICENSE
βββ CONTRIBUTING.md
On the BlueField DPU:
export DOCA_DIR=/opt/mellanox/doca
export NIXL_ROOT=/opt/nvidia/nvda_nixl
mkdir -p build && cd build
cmake .. -DBUILD_EXAMPLES=OFF
make -j$(nproc) dpu_dma_copyRun the agent (TCP fallback mode for the easiest first test):
./dpu-agent/dpu_dma_copy -p 0000:03:00.0 -m 256 -q 4 -b posix -TOmit -T to use DOCA Comch mode.
On the host where NIXL is built:
./scripts/patch_nixl.sh /path/to/nixl/source
cd /path/to/nixl/source
meson setup build -Denable_plugins=DOCA_DMA_PROXY
ninja -C buildThe patch script is idempotent; running it multiple times is safe.
export NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib/plugins
python3 examples/python/nixl_doca_dma_proxy_example.py \
-o push \
-p 0000:ba:00.0 \
-g 0 \
-f /data/test_obj \
-s 64 \
-d 10.75.70.125 \
-m tcpSee examples/python/README.md for push/pull examples and COMCH-mode usage.
This project has been verified against NIXL v1.1.0. Other NIXL versions may require minor adjustments to scripts/patch_nixl.sh or the plugin source.
docs/ARCHITECTURE.mdβ Host plugin, DPU agent, control plane, and data plane design.docs/LMCache_INTEGRATION.mdβ KV-cache offload reference architecture.dpu-agent/README.mdβ Build, run, and tune the DPU agent.examples/python/README.mdβ Python end-to-end example.examples/standalone/β Standalone host test tool that does not require NIXL.CONTRIBUTING.mdβ Build, test, and NIXL upstreaming workflow.
NIXL 1.1.0 uses tomlplusplus as a required dependency. If the telemetry plugin is enabled, the include path may not be propagated correctly.
Disable telemetry plugins before building:
cd /path/to/nixl/source
sed -i "s/^subdir('telemetry')/# subdir('telemetry')/" src/plugins/meson.build
meson setup build --wipe -Denable_plugins=DOCA_DMA_PROXY
ninja -C buildThe C++ examples require CUDA Toolkit. On a machine without CUDA, disable examples:
cmake .. -DBUILD_EXAMPLES=OFF
make dpu_dma_copyOr build the DPU agent directly from the dpu-agent/ directory:
cd dpu-agent
./scripts/build_dpu.shSet the plugin search path:
export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/plugins:$LD_LIBRARY_PATHOr in Python/C++ code:
agent.add_plugin_directory("/opt/nvidia/nvda_nixl/lib/plugins")If NIXL was built with -Dstatic_plugins=DOCA_DMA_PROXY, the plugin is linked into libnixl.so and no search path is needed.
DOCA SDK is not installed or DOCA_DIR is incorrect:
cmake .. -DDOCA_DIR=/opt/mellanox/docaVerify that /opt/mellanox/doca/include/doca_dma.h exists.
Apache-2.0. See LICENSE.