Skip to content

feat(LlamaContext): expose kvUnified option#605

Open
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:feat/kv-unified-option
Open

feat(LlamaContext): expose kvUnified option#605
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:feat/kv-unified-option

Conversation

@andreinknv
Copy link
Copy Markdown

Summary

Adds a kvUnified?: boolean option on LlamaContextOptions, plumbing through to llama_context_params.kv_unified in llama.cpp.

When enabled, llama.cpp uses a single contiguous KV buffer indexed by sequence id rather than per-sequence buffers. On GPU backends (Metal, CUDA) this can significantly improve multi-sequence prefill/decode throughput by reducing per-sequence buffer juggling and enabling more efficient batched attention.

Why

llama-server already exposes this toggle as --kv-unified and most multi-slot deployments turn it on. Without a corresponding option in node-llama-cpp, callers running a single LlamaContext with sequences: N can't reach the same configuration, even when the underlying model + hardware would clearly benefit. The option was reachable via experimental*KvCache*Type for the dtype controls but not for the unified-buffer flag itself.

Plumbing

  • src/evaluator/LlamaContext/types.ts — new kvUnified?: boolean option on LlamaContextOptions with full docstring (sibling to the existing swaFullCache field).
  • src/evaluator/LlamaContext/LlamaContext.ts_kvUnified member, constructor destructure, forwarded into the AddonContext options bag alongside swaFullCache, kvCacheKeyType, etc.
  • llama/addon/AddonContext.cpp — when options.Has("kvUnified"), sets context_params.kv_unified to the unwrapped boolean.

Compatibility

  • kvUnified is optional. When unset, the existing default ("whatever llama.cpp picks for this configuration") is preserved exactly — context_params.kv_unified is not touched.
  • No public-API breakage. Existing callers see no behavior shift.

Test plan

  • Local build + smoke test against Qwen2.5-Coder-3B-Instruct Q4_K_M (Metal, contextSize=8192, sequences=8) with kvUnified: true — context constructs cleanly; per-decode logs confirm kv_unified=1 reaches llama.cpp.
  • Existing tests pass locally.
  • CI on this PR.

🤖 Generated with Claude Code

Adds a `kvUnified?: boolean` field to LlamaContextOptions, plumbing
through to `llama_context_params.kv_unified` in llama.cpp.

When enabled, llama.cpp uses a single contiguous KV buffer indexed by
sequence id rather than per-sequence buffers. On GPU backends this
can significantly improve multi-sequence prefill/decode throughput by
reducing per-sequence buffer juggling and enabling more efficient
batched attention.

The default remains "whatever llama.cpp picks" — currently the upstream
default depends on whether the number of sequences is auto-detected,
and llama-server exposes the same toggle via `--kv-unified`.

Plumbing:
  - LlamaContextOptions.kvUnified (types.ts) — public option with docstring.
  - LlamaContext._kvUnified (LlamaContext.ts) — destructured in the
    constructor, forwarded to AddonContext options.
  - AddonContext.cpp — when `options.Has("kvUnified")`, sets
    `context_params.kv_unified` to the boolean value.

No default change. Existing callers see no behavior shift.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant