Skip to content

BaizeAI/BlueCache

Folders and files

NameName
Last commit message
Last commit date

Latest commit

Β 

History

15 Commits
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 
Β 

Repository files navigation

BlueCache

A complete GPU KV-cache offload solution that moves KV tensors from Host GPU memory to BlueField DPU-backed storage tiers without Host CPU involvement.

Overview

This project provides an end-to-end pipeline for offloading GPU-resident data β€” primarily LLM KV caches β€” to storage attached to a local BlueField DPU. It is built from three integrated pieces:

  1. DPU Agent (dpu-agent/) β€” Runs on the BlueField DPU ARM cores. It imports the remote GPU memory map, executes DOCA DMA operations, and writes incoming data to DPU-side storage backends.
  2. NIXL Plugin (nixl-plugin/) β€” A host-side NIXL backend named DOCA_DMA_PROXY. It registers GPU buffers as VRAM_SEG, exports them over PCIe with DOCA DMA, and forwards transfer requests to the DPU agent.
  3. LMCache Integration (examples/lmcache/) β€” A patch set and configuration example that enables LMCache v0.4.3 to use the DOCA_DMA_PROXY backend for transparent KV-cache tiering.

Together these components let an application such as LMCache express a transfer as VRAM_SEG ↔ OBJ_SEG and have the actual PCIe DMA and storage I/O executed by the DPU.

Supported DPU storage targets

The DPU agent can land data in multiple backend types, allowing the same offload path to target different cost/performance tiers:

Target How it is used Typical use case
DPU DRAM Pre-allocated staging buffer; can also serve as a fast transient tier Low-latency cache spill
DPU-local disk POSIX files via the agent's posix_storage_backend Capacity tier on BlueField NVMe
Remote / object storage NIXL OBJ_SEG backend (e.g. xdfs_storage_backend) Shared object store, distributed cache

Bulk data always moves over DOCA DMA between Host GPU and DPU. Only small control messages travel over DOCA Comch or TCP.

Why this matters

In LLM serving, the KV cache is large, grows with sequence length, and competes with model weights for limited GPU HBM. Existing offload paths often route data through the Host CPU or across the network, which:

  • consumes host CPU cycles that could run the inference engine,
  • adds extra memory copies,
  • and is hard to integrate cleanly with a tiered cache.

By using the BlueField DPU's dedicated DOCA DMA engine, this solution:

  • moves data directly between GPU and DPU storage across the PCIe complex,
  • keeps the host CPU out of the data path,
  • and exposes the offload path through the standard NIXL API so applications like LMCache do not need to know DOCA details.

Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Host                                                                                   β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚  β”‚ LMCache / vLLM      β”‚    β”‚ NIXL Agent                  β”‚                             β”‚
β”‚  β”‚ (KV-cache manager)  │───►│ + DOCA_DMA_PROXY backend    β”‚                             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚   - registers GPU VRAM      β”‚                             β”‚
β”‚                             β”‚   - exports GPU mmap        β”‚                             β”‚
β”‚                             β”‚   - sends transfer requests β”‚                             β”‚
β”‚                             β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β”‚                                           β”‚                                             β”‚
β”‚                              Control planeβ”‚(DOCA Comch / TCP)                          β”‚
β”‚                                           β–Ό                                             β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                             β”‚
β”‚  β”‚ GPU HBM (VRAM_SEG)  │◄──►│ DOCA DMA over PCIe          β”‚                             β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                             β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
                                            β”‚
                                            β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  BlueField DPU                                                                          β”‚
β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”‚
β”‚  β”‚ dpu_dma_copy agent                                                              β”‚    β”‚
β”‚  β”‚  β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”  β”‚    β”‚
β”‚  β”‚  β”‚ DOCA DMA      │───►│ staging buffer│───►│ NIXL storage backend            β”‚  β”‚    β”‚
β”‚  β”‚  β”‚ engine        β”‚    β”‚ (DPU DRAM)    β”‚    β”‚ (posix / xdfs / xdfs_kv / ...)  β”‚  β”‚    β”‚
β”‚  β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜  β”‚    β”‚
β”‚  β”‚                                              β”‚                                   β”‚    β”‚
β”‚  β”‚                                              β–Ό                                   β”‚    β”‚
β”‚  β”‚                                       β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”                          β”‚    β”‚
β”‚  β”‚                                       β”‚ DPU-local     β”‚                          β”‚    β”‚
β”‚  β”‚                                       β”‚ NVMe / OBJ    β”‚                          β”‚    β”‚
β”‚  β”‚                                       β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜                          β”‚    β”‚
β”‚  β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜    β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

DPU Agent

The DPU agent is the piece that executes the offload. It runs as a service on the BlueField DPU and is intentionally separate from the NIXL library so it can evolve independently.

Responsibilities:

  • Import the host GPU mmap from the PCI export descriptor sent by the plugin.
  • Maintain a reusable DPU-side staging buffer.
  • Execute chunked, pipelined DOCA DMA with configurable queue depth.
  • Forward received data to a NIXL storage backend running on the DPU, which in turn writes to local files or object storage.

Build and run instructions are in dpu-agent/README.md.

NIXL Plugin

The host plugin implements the NIXL nixlBackendEngine interface. It exposes two memory types:

  • VRAM_SEG β€” Host GPU memory, exported via doca_mmap_export_pci().
  • OBJ_SEG β€” DPU-resident object/file, identified by a path or key string.

The backend is local-only (supportsRemote() == false): both the GPU and the DPU must be reachable through the same host-side BlueField PCI function.

Because NIXL loads backends dynamically, the plugin source is injected into a NIXL source tree with scripts/patch_nixl.sh and built together with NIXL.

LMCache Integration

examples/lmcache/ contains:

  • lmcache_integration.patch β€” modifications to LMCache v0.4.3 to recognize and use the DOCA_DMA_PROXY backend.
  • lmcache-config.yaml β€” sample configuration.
  • patch_lmcache.sh β€” helper that applies the patch idempotently.

After patching LMCache, you can configure a storage backend that points to the DPU agent and offload KV tensors transparently.

Repository Layout

.
β”œβ”€β”€ common/              # Shared host-DPU control channel + wire protocol (dma_transfer.h)
β”œβ”€β”€ nixl-plugin/         # NIXL backend plugin source (patch into NIXL)
β”œβ”€β”€ dpu-agent/           # BlueField DPU proxy service
β”œβ”€β”€ examples/
β”‚   β”œβ”€β”€ cpp/             # NIXL C++ example
β”‚   β”œβ”€β”€ python/          # NIXL Python example
β”‚   β”œβ”€β”€ standalone/      # Standalone host test tool (no NIXL required)
β”‚   └── lmcache/         # LMCache v0.4.3 integration patch
β”œβ”€β”€ scripts/             # patch_nixl.sh and build helpers
β”œβ”€β”€ docs/                # Architecture and integration docs
β”œβ”€β”€ CMakeLists.txt
β”œβ”€β”€ LICENSE
└── CONTRIBUTING.md

Quick Start

1. Build the DPU Agent

On the BlueField DPU:

export DOCA_DIR=/opt/mellanox/doca
export NIXL_ROOT=/opt/nvidia/nvda_nixl

mkdir -p build && cd build
cmake .. -DBUILD_EXAMPLES=OFF
make -j$(nproc) dpu_dma_copy

Run the agent (TCP fallback mode for the easiest first test):

./dpu-agent/dpu_dma_copy -p 0000:03:00.0 -m 256 -q 4 -b posix -T

Omit -T to use DOCA Comch mode.

2. Patch NIXL with the Plugin

On the host where NIXL is built:

./scripts/patch_nixl.sh /path/to/nixl/source

cd /path/to/nixl/source
meson setup build -Denable_plugins=DOCA_DMA_PROXY
ninja -C build

The patch script is idempotent; running it multiple times is safe.

3. Run the Python Example

export NIXL_PLUGIN_DIR=/opt/nvidia/nvda_nixl/lib/plugins

python3 examples/python/nixl_doca_dma_proxy_example.py \
    -o push \
    -p 0000:ba:00.0 \
    -g 0 \
    -f /data/test_obj \
    -s 64 \
    -d 10.75.70.125 \
    -m tcp

See examples/python/README.md for push/pull examples and COMCH-mode usage.

Compatibility

This project has been verified against NIXL v1.1.0. Other NIXL versions may require minor adjustments to scripts/patch_nixl.sh or the plugin source.

Documentation

Troubleshooting

NIXL build fails with fatal error: toml++/toml.hpp: No such file or directory

NIXL 1.1.0 uses tomlplusplus as a required dependency. If the telemetry plugin is enabled, the include path may not be propagated correctly.

Disable telemetry plugins before building:

cd /path/to/nixl/source
sed -i "s/^subdir('telemetry')/# subdir('telemetry')/" src/plugins/meson.build

meson setup build --wipe -Denable_plugins=DOCA_DMA_PROXY
ninja -C build

Could not find nvcc, please set CUDAToolkit_ROOT

The C++ examples require CUDA Toolkit. On a machine without CUDA, disable examples:

cmake .. -DBUILD_EXAMPLES=OFF
make dpu_dma_copy

Or build the DPU agent directly from the dpu-agent/ directory:

cd dpu-agent
./scripts/build_dpu.sh

DOCA_DMA_PROXY plugin not found at runtime

Set the plugin search path:

export LD_LIBRARY_PATH=/opt/nvidia/nvda_nixl/lib/plugins:$LD_LIBRARY_PATH

Or in Python/C++ code:

agent.add_plugin_directory("/opt/nvidia/nvda_nixl/lib/plugins")

If NIXL was built with -Dstatic_plugins=DOCA_DMA_PROXY, the plugin is linked into libnixl.so and no search path is needed.

doca_dma.h not found

DOCA SDK is not installed or DOCA_DIR is incorrect:

cmake .. -DDOCA_DIR=/opt/mellanox/doca

Verify that /opt/mellanox/doca/include/doca_dma.h exists.

License

Apache-2.0. See LICENSE.

About

KVCache Management Via BlueField

Resources

License

Contributing

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors