fix(GgufInsights): correct KV cache VRAM estimate for quantized types by andreinknv · Pull Request #608 · withcatai/node-llama-cpp

andreinknv · 2026-05-14T22:59:11Z

Summary

The KV cache size estimator in GgufInsights.estimateContextResourceRequirements (and its helper _estimateKVCacheMemorySizeInBytes) overestimates by ~32× for block-quantized KV cache types (Q4_, Q5_, Q6_K, Q8_0, etc.). The cause: getTypeSizeForGgmlType() returns bytes per block, not bytes per element. For block-quantized types one block holds 32 elements, so per-element cost is 32× smaller than what the estimator currently uses.

Why this matters

The overestimate trips the VRAM rejection branch in GgufInsightsConfigurationResolver.resolveConfigForUsage. Valid configurations with quantized KV cache (a common deployment trick for models that hover near the VRAM ceiling at FP16) get refused with a "not enough VRAM" result.

Example: a 7B model at 8192 ctx with experimentalDefaultContextKvCacheKeyType: 'Q8_0' should fit comfortably on hardware that handles FP16 at 4096 ctx. The estimator instead reports the Q8 cache as larger than the FP16 cache and rejects it.

The bug

addonGetTypeSizeForGgmlType (llama/addon/addon.cpp:76) calls ggml_type_size() which is documented as "block size in bytes":

const auto typeSize = ggml_type_size(static_cast<ggml_type>(ggmlType));

For F16: blockSize=1, so blockBytes=2, per-element=2. Correct.
For Q8_0: blockSize=32, so blockBytes=34, per-element=34/32 ≈ 1.0625. The current estimator multiplies element count by 34 instead of ~1.0625.

The other existing consumer of this binding (calculateTensorSize for general tensor sizing in the same file, ~line 827) already pairs getTypeSizeForGgmlType with getBlockSizeForGgmlType correctly. The KV-cache estimator was the only path missing the block-size division.

Fix

In the KV-cache estimator, also fetch getBlockSizeForGgmlType for each KV type, then divide:

const keyBytesPerElement = keyTypeSize / Math.max(1, keyBlockSize);
const valueBytesPerElement = valueTypeSize / Math.max(1, valueBlockSize);
// ...
const gpuKVCacheSize = usingGpu
    ? ((gpuKvElementsK * keyBytesPerElement) + (gpuKvElementsV * valueBytesPerElement))
    : 0;

For F16 / F32 (blockSize=1) the division is a no-op — no behavior change on the most common configs.

Test plan

Manual: estimator output for Q8_0 KV at 8192 ctx now matches the actual allocation reported by llama.cpp on context creation (within ~5%, the residual being the graph-overhead estimate which is intentionally rough per the comment at line 232).
F16 / F32 estimates unchanged.
CI on this PR.

The KV cache size estimator multiplies element count by the result of `getTypeSizeForGgmlType()`. That binding wraps llama.cpp's `ggml_type_size()`, which returns bytes per BLOCK — not bytes per element. For block-quantized types (Q4_0, Q5_0, Q8_0, etc.) one block holds 32 elements, so the per-element cost is 32× smaller than the block size. Before this fix: - Q8_0 KV cache: estimate is 32× too large (block size 34, true bytes/element ≈ 1.0625) - Q4_0 KV cache: estimate is 32× too large (block size 18, true bytes/element ≈ 0.5625) - F16 / F32: correct (block size = 1, no scaling) The overestimate trips the VRAM rejection path (`GgufInsightsConfigurationResolver.resolveConfigForUsage`), so valid configurations with quantized KV (e.g. Q8_0 at 8k context on a model that easily fits with FP16 + 8k) get refused with a "not enough VRAM" result. Fix: also fetch `getBlockSizeForGgmlType()` for each KV type and compute `keyBytesPerElement = blockBytes / blockSize`. The other existing consumer of these bindings (`calculateTensorSize` for general tensor size estimation, lines 827+) already uses both functions together — the KV-cache estimator was the only path missing the block-size division. For F16 / F32 (blockSize=1) the division is a no-op so no behavior changes there.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix(GgufInsights): correct KV cache VRAM estimate for quantized types#608

fix(GgufInsights): correct KV cache VRAM estimate for quantized types#608
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:fix/kv-cache-vram-estimator-quant

andreinknv commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andreinknv commented May 14, 2026

Summary

Why this matters

The bug

Fix

Test plan

Related

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant