feat(LlamaContext): expose kvUnified option#605
Open
andreinknv wants to merge 1 commit into
Open
Conversation
Adds a `kvUnified?: boolean` field to LlamaContextOptions, plumbing
through to `llama_context_params.kv_unified` in llama.cpp.
When enabled, llama.cpp uses a single contiguous KV buffer indexed by
sequence id rather than per-sequence buffers. On GPU backends this
can significantly improve multi-sequence prefill/decode throughput by
reducing per-sequence buffer juggling and enabling more efficient
batched attention.
The default remains "whatever llama.cpp picks" — currently the upstream
default depends on whether the number of sequences is auto-detected,
and llama-server exposes the same toggle via `--kv-unified`.
Plumbing:
- LlamaContextOptions.kvUnified (types.ts) — public option with docstring.
- LlamaContext._kvUnified (LlamaContext.ts) — destructured in the
constructor, forwarded to AddonContext options.
- AddonContext.cpp — when `options.Has("kvUnified")`, sets
`context_params.kv_unified` to the boolean value.
No default change. Existing callers see no behavior shift.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds a
kvUnified?: booleanoption onLlamaContextOptions, plumbing through tollama_context_params.kv_unifiedin llama.cpp.When enabled, llama.cpp uses a single contiguous KV buffer indexed by sequence id rather than per-sequence buffers. On GPU backends (Metal, CUDA) this can significantly improve multi-sequence prefill/decode throughput by reducing per-sequence buffer juggling and enabling more efficient batched attention.
Why
llama-serveralready exposes this toggle as--kv-unifiedand most multi-slot deployments turn it on. Without a corresponding option innode-llama-cpp, callers running a single LlamaContext withsequences: Ncan't reach the same configuration, even when the underlying model + hardware would clearly benefit. The option was reachable viaexperimental*KvCache*Typefor the dtype controls but not for the unified-buffer flag itself.Plumbing
src/evaluator/LlamaContext/types.ts— newkvUnified?: booleanoption onLlamaContextOptionswith full docstring (sibling to the existingswaFullCachefield).src/evaluator/LlamaContext/LlamaContext.ts—_kvUnifiedmember, constructor destructure, forwarded into theAddonContextoptions bag alongsideswaFullCache,kvCacheKeyType, etc.llama/addon/AddonContext.cpp— whenoptions.Has("kvUnified"), setscontext_params.kv_unifiedto the unwrapped boolean.Compatibility
kvUnifiedis optional. When unset, the existing default ("whatever llama.cpp picks for this configuration") is preserved exactly —context_params.kv_unifiedis not touched.Test plan
kvUnified: true— context constructs cleanly; per-decode logs confirmkv_unified=1reaches llama.cpp.🤖 Generated with Claude Code