Skip to content

fix(GgufInsights): correct KV cache VRAM estimate for quantized types#608

Open
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:fix/kv-cache-vram-estimator-quant
Open

fix(GgufInsights): correct KV cache VRAM estimate for quantized types#608
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:fix/kv-cache-vram-estimator-quant

Conversation

@andreinknv
Copy link
Copy Markdown

Summary

The KV cache size estimator in GgufInsights.estimateContextResourceRequirements (and its helper _estimateKVCacheMemorySizeInBytes) overestimates by ~32× for block-quantized KV cache types (Q4_, Q5_, Q6_K, Q8_0, etc.). The cause: getTypeSizeForGgmlType() returns bytes per block, not bytes per element. For block-quantized types one block holds 32 elements, so per-element cost is 32× smaller than what the estimator currently uses.

Why this matters

The overestimate trips the VRAM rejection branch in GgufInsightsConfigurationResolver.resolveConfigForUsage. Valid configurations with quantized KV cache (a common deployment trick for models that hover near the VRAM ceiling at FP16) get refused with a "not enough VRAM" result.

Example: a 7B model at 8192 ctx with experimentalDefaultContextKvCacheKeyType: 'Q8_0' should fit comfortably on hardware that handles FP16 at 4096 ctx. The estimator instead reports the Q8 cache as larger than the FP16 cache and rejects it.

The bug

addonGetTypeSizeForGgmlType (llama/addon/addon.cpp:76) calls ggml_type_size() which is documented as "block size in bytes":

const auto typeSize = ggml_type_size(static_cast<ggml_type>(ggmlType));

For F16: blockSize=1, so blockBytes=2, per-element=2. Correct.
For Q8_0: blockSize=32, so blockBytes=34, per-element=34/32 ≈ 1.0625. The current estimator multiplies element count by 34 instead of ~1.0625.

The other existing consumer of this binding (calculateTensorSize for general tensor sizing in the same file, ~line 827) already pairs getTypeSizeForGgmlType with getBlockSizeForGgmlType correctly. The KV-cache estimator was the only path missing the block-size division.

Fix

In the KV-cache estimator, also fetch getBlockSizeForGgmlType for each KV type, then divide:

const keyBytesPerElement = keyTypeSize / Math.max(1, keyBlockSize);
const valueBytesPerElement = valueTypeSize / Math.max(1, valueBlockSize);
// ...
const gpuKVCacheSize = usingGpu
    ? ((gpuKvElementsK * keyBytesPerElement) + (gpuKvElementsV * valueBytesPerElement))
    : 0;

For F16 / F32 (blockSize=1) the division is a no-op — no behavior change on the most common configs.

Test plan

  • Manual: estimator output for Q8_0 KV at 8192 ctx now matches the actual allocation reported by llama.cpp on context creation (within ~5%, the residual being the graph-overhead estimate which is intentionally rough per the comment at line 232).
  • F16 / F32 estimates unchanged.
  • CI on this PR.

Related

This is paired with #607 (resolveGgmlTypeOption case-insensitivity). Together they unlock a real workflow: pass experimentalKvCacheKeyType: 'q8_0' from a config file and have it actually take effect AND pass the VRAM check.

🤖 Generated with Claude Code

The KV cache size estimator multiplies element count by the result of
`getTypeSizeForGgmlType()`. That binding wraps llama.cpp's
`ggml_type_size()`, which returns bytes per BLOCK — not bytes per
element. For block-quantized types (Q4_0, Q5_0, Q8_0, etc.) one block
holds 32 elements, so the per-element cost is 32× smaller than the
block size.

Before this fix:
  - Q8_0 KV cache: estimate is 32× too large (block size 34, true
    bytes/element ≈ 1.0625)
  - Q4_0 KV cache: estimate is 32× too large (block size 18, true
    bytes/element ≈ 0.5625)
  - F16 / F32: correct (block size = 1, no scaling)

The overestimate trips the VRAM rejection path
(`GgufInsightsConfigurationResolver.resolveConfigForUsage`), so valid
configurations with quantized KV (e.g. Q8_0 at 8k context on a model
that easily fits with FP16 + 8k) get refused with a "not enough VRAM"
result.

Fix: also fetch `getBlockSizeForGgmlType()` for each KV type and
compute `keyBytesPerElement = blockBytes / blockSize`. The other
existing consumer of these bindings (`calculateTensorSize` for general
tensor size estimation, lines 827+) already uses both functions
together — the KV-cache estimator was the only path missing the
block-size division.

For F16 / F32 (blockSize=1) the division is a no-op so no behavior
changes there.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant