fix(GgufInsights): correct KV cache VRAM estimate for quantized types#608
Open
andreinknv wants to merge 1 commit into
Open
fix(GgufInsights): correct KV cache VRAM estimate for quantized types#608andreinknv wants to merge 1 commit into
andreinknv wants to merge 1 commit into
Conversation
The KV cache size estimator multiplies element count by the result of
`getTypeSizeForGgmlType()`. That binding wraps llama.cpp's
`ggml_type_size()`, which returns bytes per BLOCK — not bytes per
element. For block-quantized types (Q4_0, Q5_0, Q8_0, etc.) one block
holds 32 elements, so the per-element cost is 32× smaller than the
block size.
Before this fix:
- Q8_0 KV cache: estimate is 32× too large (block size 34, true
bytes/element ≈ 1.0625)
- Q4_0 KV cache: estimate is 32× too large (block size 18, true
bytes/element ≈ 0.5625)
- F16 / F32: correct (block size = 1, no scaling)
The overestimate trips the VRAM rejection path
(`GgufInsightsConfigurationResolver.resolveConfigForUsage`), so valid
configurations with quantized KV (e.g. Q8_0 at 8k context on a model
that easily fits with FP16 + 8k) get refused with a "not enough VRAM"
result.
Fix: also fetch `getBlockSizeForGgmlType()` for each KV type and
compute `keyBytesPerElement = blockBytes / blockSize`. The other
existing consumer of these bindings (`calculateTensorSize` for general
tensor size estimation, lines 827+) already uses both functions
together — the KV-cache estimator was the only path missing the
block-size division.
For F16 / F32 (blockSize=1) the division is a no-op so no behavior
changes there.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The KV cache size estimator in
GgufInsights.estimateContextResourceRequirements(and its helper_estimateKVCacheMemorySizeInBytes) overestimates by ~32× for block-quantized KV cache types (Q4_, Q5_, Q6_K, Q8_0, etc.). The cause:getTypeSizeForGgmlType()returns bytes per block, not bytes per element. For block-quantized types one block holds 32 elements, so per-element cost is 32× smaller than what the estimator currently uses.Why this matters
The overestimate trips the VRAM rejection branch in
GgufInsightsConfigurationResolver.resolveConfigForUsage. Valid configurations with quantized KV cache (a common deployment trick for models that hover near the VRAM ceiling at FP16) get refused with a "not enough VRAM" result.Example: a 7B model at 8192 ctx with
experimentalDefaultContextKvCacheKeyType: 'Q8_0'should fit comfortably on hardware that handles FP16 at 4096 ctx. The estimator instead reports the Q8 cache as larger than the FP16 cache and rejects it.The bug
addonGetTypeSizeForGgmlType(llama/addon/addon.cpp:76) callsggml_type_size()which is documented as "block size in bytes":For F16: blockSize=1, so blockBytes=2, per-element=2. Correct.
For Q8_0: blockSize=32, so blockBytes=34, per-element=34/32 ≈ 1.0625. The current estimator multiplies element count by 34 instead of ~1.0625.
The other existing consumer of this binding (
calculateTensorSizefor general tensor sizing in the same file, ~line 827) already pairsgetTypeSizeForGgmlTypewithgetBlockSizeForGgmlTypecorrectly. The KV-cache estimator was the only path missing the block-size division.Fix
In the KV-cache estimator, also fetch
getBlockSizeForGgmlTypefor each KV type, then divide:For F16 / F32 (blockSize=1) the division is a no-op — no behavior change on the most common configs.
Test plan
Related
This is paired with #607 (
resolveGgmlTypeOptioncase-insensitivity). Together they unlock a real workflow: passexperimentalKvCacheKeyType: 'q8_0'from a config file and have it actually take effect AND pass the VRAM check.🤖 Generated with Claude Code