OpenVINO backend: GPU compile-time memory optimizations by cavusmustafa · Pull Request #236 · ravi9/llama.cpp

cavusmustafa · 2026-07-01T19:57:54Z

Reduces host memory use of the OpenVINO GPU backend. All features are opt-in (default off) — behavior is unchanged unless the flags are set.

Changes

GGML_OPENVINO_RELEASE_WEIGHTS — drop host weight copies after compile once the plugin holds its device copy (madvise(MADV_DONTNEED)).
Stream weight requantization to cut the compile-time RSS peak.
Skip redundant token_embd requantization at compile.
GGML_OPENVINO_REDUCE_COMPILE_MEM — gate for the compile-memory optimizations.
GGML_OPENVINO_MODEL_CACHE_DIR — frontend model cache; skips requant + convert + compile on repeat runs. Auto-creates the cache dir.

Results (Llama-3.2-1B / Meta-Llama-3.1-8B Q4_K_M, Intel Arc GPU)

Steady-state host RSS: 1B 1467→971 MB, 8B 4684→643 MB.
Perplexity unchanged vs. baseline (PPL 1.0369); output verified correct.

Scope: OpenVINO backend only. Verified with llama-bench, llama-cli, and llama-perplexity.

…ht RSS on GPU The OpenVINO weight Constants are zero-copy views into host buffers allocated by the backend (ggml_aligned_malloc, anonymous memory). On GPU the plugin holds its own device copy after compile_model, so these host pages are dead weight for inference. For a 1B Q4_K_M model this leaves ~850 MB of host RSS resident that the GPU path never reads again. Add an opt-in GGML_OPENVINO_RELEASE_WEIGHTS mode that madvise(MADV_DONTNEED)s the registered host weight buffers once the model is compiled, dropping their resident pages while keeping the mappings valid (ggml still owns the lifetime; tensors still point in). Measured steady-state RSS drops from ~1555 MB to ~710 MB on Llama-3.2-1B-Q4_K_M (Arc iGPU) with unchanged throughput and correct output. The GPU backend uses a single dynamic-shape model for both prefill and decode, so a graph is compiled once and reused; the only event that forces a recompile is clear_caches() on backend teardown. The change therefore: - releases on the first cache-hit (model compiled, plugin has its copy); - pins the compiled-model cache across backend teardown so a later context reuses it instead of recompiling against the dropped pages; - fails loud (GGML_ABORT) on a cache-miss recompile or on a second model load, both of which would otherwise read zeroed weights or silently reuse the wrong compiled graph. Scope/limitations (all fail loud, never silently wrong): GPU only (the CPU plugin reads the host Constants at inference time), one model per process, and stable graph shapes. This reduces steady-state RSS, not the transient compile-time peak. All changes are confined to the OpenVINO backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…SS peak requantize_to_buffers() dequantized the entire tensor to a temporary std::vector<float> of n_elements before requantizing. For token_embd.weight (128256 x 2048) that transient is ~1 GB (1B model) / ~2 GB (8B), and it is the single largest contributor to the OpenVINO compile-time memory peak -- it also fires twice for token_embd (once at load, once at graph build, because token_embd is loaded via a CPU/mmap buffer and not cached as an OV weight extra). Stream the dequant instead: process a fixed window of complete rows (CHUNK_ROWS=256) into a small scratch buffer and quantize/convert each chunk straight into the output buffers. The transient F32 footprint is now CHUNK_ROWS*ne0 floats regardless of tensor size. quantize_q8_0/q8_1 gain an optional block_offset arg (default 0) so a chunk writes its weights/scales/zp at the correct block. Streaming is applied to the Q8_0_C / Q8_1_C / F16 targets (the large requant cases); the u4 (Q4_0) path keeps the whole-array call because it packs two weights per byte with running zp ORs, and a fallback handles any future target whose block size does not divide a row. Measured peak RSS (cold compile, GPU): 1B 2868 -> 1809 MB (-1.06 GB); 8B 11618 -> 9608 MB (-2.0 GB). Output verified unchanged ("capital of France is Paris"); throughput unchanged. Unlike GGML_OPENVINO_RELEASE_WEIGHTS this reduces the transient peak, not just steady-state, and needs no env flag. All changes confined to the OpenVINO backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

token_embd.weight is referenced twice in the graph path: as the GET_ROWS embedding (a CPU/mmap-buffer tensor) it was re-extracted/re-requantized on every weight-node build, and is_model_splitted() built a full (naive) set of weight nodes just to test name membership — each requant is a ~1-2 GB F32 dequant of the 262M-element embedding. Two changes: - Add collect_weight_names(): a name-only collector for topology checks. is_model_splitted() now uses it instead of create_weight_nodes(cgraph, true), so the splitted-check no longer triggers any weight extraction. - Memoize weight nodes built from non-OpenVINO buffers in a process-lifetime cache keyed by tensor->data. These tensors have no OV buffer context to own a cached extra, so without this they were rebuilt on every (re)compile; prefill and decode graphs now share one build (verified: 2nd graph hits the cache instead of re-requantizing). Peak RSS is unchanged (the streaming-requant commit already removed the F32 transient); this removes redundant compile-time work. Output verified unchanged ("capital of France is Paris"). Confined to the OpenVINO backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

…_REDUCE_COMPILE_MEM The streaming requantization and the non-OpenVINO-buffer weight-node cache (plus the name-only is_model_splitted path that pairs with it) are now opt-in via GGML_OPENVINO_REDUCE_COMPILE_MEM. When unset, requantize_to_buffers() fully materializes the F32 buffer and weights are rebuilt per compile exactly as before; when set, the streaming path and the cross-compile weight cache are used. Default off keeps behavior identical to upstream unless explicitly enabled. Verified: flag off -> peak RSS 2800 MB (original), flag on -> 1810 MB; output "capital of France is Paris" in both modes. (GGML_OPENVINO_RELEASE_WEIGHTS, added earlier, remains a separate opt-in for the steady-state release.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

The plugin-level ov::cache_dir caches the compiled blob keyed by the OV model, but producing that model still runs the full frontend every time: weight requantization (incl. the large token_embd F32 transient) and the ggml->OV graph conversion. This adds an opt-in frontend cache keyed off a fingerprint computed directly from the ggml cgraph, so a hit imports a previously exported CompiledModel and skips requant + convert + compile entirely. Key (model-cache.{h,cpp}) = 64-bit FNV-1a of: graph topology (n_nodes + per node op/name), a sampled per-weight fingerprint (name/shape/type + bounded head+tail byte sample), and blob-affecting config (device, flash-attn, rope params, REDUCE_COMPILE_MEM/stateful flags, OpenVINO version). A sidecar manifest stores every weight's fingerprint and is re-verified on load, so a sampled-hash collision cannot cause a wrong-model hit (verified: two different quantizations of the same model produce distinct cache entries). Flow (dynamic single-model path only; split models defer to ov::cache_dir): on a verified hit, core.import_model() restores the CompiledModel and a lightweight decoder is built with a names-only weight map (membership is all the decoder needs for I/O mapping; weights live in the imported model). On a miss, compile as usual then export the blob (atomic temp+rename, manifest written first). The frontend cache supersedes ov::cache_dir, so CACHE_DIR/ CACHE_MODE are stripped from the config used for the cached compile and the import — a blob compiled with cache_dir set cannot be re-imported. Measured 8B Q4_K_M (GPU): full requant+convert+compile 15.3s -> import 6.3s (~2.4x faster compile phase). Output verified unchanged on cold and warm, standalone and combined with REDUCE_COMPILE_MEM + RELEASE_WEIGHTS. Default off; confined to the OpenVINO backend. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

Mustafa Cavus and others added 5 commits July 1, 2026 23:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

OpenVINO backend: GPU compile-time memory optimizations#236

OpenVINO backend: GPU compile-time memory optimizations#236
cavusmustafa wants to merge 5 commits into
ravi9:dev_backend_openvinofrom
cavusmustafa:ov-mem-optimizations

cavusmustafa commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cavusmustafa commented Jul 1, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant