OpenVINO backend: GPU compile-time memory optimizations#236
Draft
cavusmustafa wants to merge 5 commits into
Draft
OpenVINO backend: GPU compile-time memory optimizations#236cavusmustafa wants to merge 5 commits into
cavusmustafa wants to merge 5 commits into
Conversation
…ht RSS on GPU
The OpenVINO weight Constants are zero-copy views into host buffers
allocated by the backend (ggml_aligned_malloc, anonymous memory). On GPU
the plugin holds its own device copy after compile_model, so these host
pages are dead weight for inference. For a 1B Q4_K_M model this leaves
~850 MB of host RSS resident that the GPU path never reads again.
Add an opt-in GGML_OPENVINO_RELEASE_WEIGHTS mode that madvise(MADV_DONTNEED)s
the registered host weight buffers once the model is compiled, dropping
their resident pages while keeping the mappings valid (ggml still owns the
lifetime; tensors still point in). Measured steady-state RSS drops from
~1555 MB to ~710 MB on Llama-3.2-1B-Q4_K_M (Arc iGPU) with unchanged
throughput and correct output.
The GPU backend uses a single dynamic-shape model for both prefill and
decode, so a graph is compiled once and reused; the only event that forces
a recompile is clear_caches() on backend teardown. The change therefore:
- releases on the first cache-hit (model compiled, plugin has its copy);
- pins the compiled-model cache across backend teardown so a later
context reuses it instead of recompiling against the dropped pages;
- fails loud (GGML_ABORT) on a cache-miss recompile or on a second model
load, both of which would otherwise read zeroed weights or silently
reuse the wrong compiled graph.
Scope/limitations (all fail loud, never silently wrong): GPU only (the CPU
plugin reads the host Constants at inference time), one model per process,
and stable graph shapes. This reduces steady-state RSS, not the transient
compile-time peak. All changes are confined to the OpenVINO backend.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…SS peak
requantize_to_buffers() dequantized the entire tensor to a temporary
std::vector<float> of n_elements before requantizing. For token_embd.weight
(128256 x 2048) that transient is ~1 GB (1B model) / ~2 GB (8B), and it is
the single largest contributor to the OpenVINO compile-time memory peak --
it also fires twice for token_embd (once at load, once at graph build,
because token_embd is loaded via a CPU/mmap buffer and not cached as an OV
weight extra).
Stream the dequant instead: process a fixed window of complete rows
(CHUNK_ROWS=256) into a small scratch buffer and quantize/convert each chunk
straight into the output buffers. The transient F32 footprint is now
CHUNK_ROWS*ne0 floats regardless of tensor size.
quantize_q8_0/q8_1 gain an optional block_offset arg (default 0) so a chunk
writes its weights/scales/zp at the correct block. Streaming is applied to
the Q8_0_C / Q8_1_C / F16 targets (the large requant cases); the u4 (Q4_0)
path keeps the whole-array call because it packs two weights per byte with
running zp ORs, and a fallback handles any future target whose block size
does not divide a row.
Measured peak RSS (cold compile, GPU): 1B 2868 -> 1809 MB (-1.06 GB);
8B 11618 -> 9608 MB (-2.0 GB). Output verified unchanged
("capital of France is Paris"); throughput unchanged. Unlike
GGML_OPENVINO_RELEASE_WEIGHTS this reduces the transient peak, not just
steady-state, and needs no env flag. All changes confined to the OpenVINO
backend.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
token_embd.weight is referenced twice in the graph path: as the GET_ROWS
embedding (a CPU/mmap-buffer tensor) it was re-extracted/re-requantized on
every weight-node build, and is_model_splitted() built a full (naive) set of
weight nodes just to test name membership — each requant is a ~1-2 GB F32
dequant of the 262M-element embedding.
Two changes:
- Add collect_weight_names(): a name-only collector for topology checks.
is_model_splitted() now uses it instead of create_weight_nodes(cgraph,
true), so the splitted-check no longer triggers any weight extraction.
- Memoize weight nodes built from non-OpenVINO buffers in a process-lifetime
cache keyed by tensor->data. These tensors have no OV buffer context to own
a cached extra, so without this they were rebuilt on every (re)compile;
prefill and decode graphs now share one build (verified: 2nd graph hits the
cache instead of re-requantizing).
Peak RSS is unchanged (the streaming-requant commit already removed the F32
transient); this removes redundant compile-time work. Output verified
unchanged ("capital of France is Paris"). Confined to the OpenVINO backend.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…_REDUCE_COMPILE_MEM The streaming requantization and the non-OpenVINO-buffer weight-node cache (plus the name-only is_model_splitted path that pairs with it) are now opt-in via GGML_OPENVINO_REDUCE_COMPILE_MEM. When unset, requantize_to_buffers() fully materializes the F32 buffer and weights are rebuilt per compile exactly as before; when set, the streaming path and the cross-compile weight cache are used. Default off keeps behavior identical to upstream unless explicitly enabled. Verified: flag off -> peak RSS 2800 MB (original), flag on -> 1810 MB; output "capital of France is Paris" in both modes. (GGML_OPENVINO_RELEASE_WEIGHTS, added earlier, remains a separate opt-in for the steady-state release.) Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
The plugin-level ov::cache_dir caches the compiled blob keyed by the OV
model, but producing that model still runs the full frontend every time:
weight requantization (incl. the large token_embd F32 transient) and the
ggml->OV graph conversion. This adds an opt-in frontend cache keyed off a
fingerprint computed directly from the ggml cgraph, so a hit imports a
previously exported CompiledModel and skips requant + convert + compile
entirely.
Key (model-cache.{h,cpp}) = 64-bit FNV-1a of: graph topology (n_nodes + per
node op/name), a sampled per-weight fingerprint (name/shape/type + bounded
head+tail byte sample), and blob-affecting config (device, flash-attn, rope
params, REDUCE_COMPILE_MEM/stateful flags, OpenVINO version). A sidecar
manifest stores every weight's fingerprint and is re-verified on load, so a
sampled-hash collision cannot cause a wrong-model hit (verified: two
different quantizations of the same model produce distinct cache entries).
Flow (dynamic single-model path only; split models defer to ov::cache_dir):
on a verified hit, core.import_model() restores the CompiledModel and a
lightweight decoder is built with a names-only weight map (membership is all
the decoder needs for I/O mapping; weights live in the imported model). On a
miss, compile as usual then export the blob (atomic temp+rename, manifest
written first). The frontend cache supersedes ov::cache_dir, so CACHE_DIR/
CACHE_MODE are stripped from the config used for the cached compile and the
import — a blob compiled with cache_dir set cannot be re-imported.
Measured 8B Q4_K_M (GPU): full requant+convert+compile 15.3s -> import 6.3s
(~2.4x faster compile phase). Output verified unchanged on cold and warm,
standalone and combined with REDUCE_COMPILE_MEM + RELEASE_WEIGHTS. Default
off; confined to the OpenVINO backend.
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Reduces host memory use of the OpenVINO GPU backend. All features are opt-in (default off) — behavior is unchanged unless the flags are set.
Changes
Results (Llama-3.2-1B / Meta-Llama-3.1-8B Q4_K_M, Intel Arc GPU)
Scope: OpenVINO backend only. Verified with llama-bench, llama-cli, and llama-perplexity.