feat(LlamaContext): expose kvUnified option by andreinknv · Pull Request #605 · withcatai/node-llama-cpp

andreinknv · 2026-05-14T22:46:33Z

Summary

Adds a kvUnified?: boolean option on LlamaContextOptions, plumbing through to llama_context_params.kv_unified in llama.cpp.

When enabled, llama.cpp uses a single contiguous KV buffer indexed by sequence id rather than per-sequence buffers. On GPU backends (Metal, CUDA) this can significantly improve multi-sequence prefill/decode throughput by reducing per-sequence buffer juggling and enabling more efficient batched attention.

Why

llama-server already exposes this toggle as --kv-unified and most multi-slot deployments turn it on. Without a corresponding option in node-llama-cpp, callers running a single LlamaContext with sequences: N can't reach the same configuration, even when the underlying model + hardware would clearly benefit. The option was reachable via experimental*KvCache*Type for the dtype controls but not for the unified-buffer flag itself.

Plumbing

src/evaluator/LlamaContext/types.ts — new kvUnified?: boolean option on LlamaContextOptions with full docstring (sibling to the existing swaFullCache field).
src/evaluator/LlamaContext/LlamaContext.ts — _kvUnified member, constructor destructure, forwarded into the AddonContext options bag alongside swaFullCache, kvCacheKeyType, etc.
llama/addon/AddonContext.cpp — when options.Has("kvUnified"), sets context_params.kv_unified to the unwrapped boolean.

Compatibility

kvUnified is optional. When unset, the existing default ("whatever llama.cpp picks for this configuration") is preserved exactly — context_params.kv_unified is not touched.
No public-API breakage. Existing callers see no behavior shift.

Test plan

Local build + smoke test against Qwen2.5-Coder-3B-Instruct Q4_K_M (Metal, contextSize=8192, sequences=8) with kvUnified: true — context constructs cleanly; per-decode logs confirm kv_unified=1 reaches llama.cpp.
Existing tests pass locally.
CI on this PR.

🤖 Generated with Claude Code

Adds a `kvUnified?: boolean` field to LlamaContextOptions, plumbing through to `llama_context_params.kv_unified` in llama.cpp. When enabled, llama.cpp uses a single contiguous KV buffer indexed by sequence id rather than per-sequence buffers. On GPU backends this can significantly improve multi-sequence prefill/decode throughput by reducing per-sequence buffer juggling and enabling more efficient batched attention. The default remains "whatever llama.cpp picks" — currently the upstream default depends on whether the number of sequences is auto-detected, and llama-server exposes the same toggle via `--kv-unified`. Plumbing: - LlamaContextOptions.kvUnified (types.ts) — public option with docstring. - LlamaContext._kvUnified (LlamaContext.ts) — destructured in the constructor, forwarded to AddonContext options. - AddonContext.cpp — when `options.Has("kvUnified")`, sets `context_params.kv_unified` to the boolean value. No default change. Existing callers see no behavior shift.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat(LlamaContext): expose kvUnified option#605

feat(LlamaContext): expose kvUnified option#605
andreinknv wants to merge 1 commit into
withcatai:masterfrom
andreinknv:feat/kv-unified-option

andreinknv commented May 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

andreinknv commented May 14, 2026

Summary

Why

Plumbing

Compatibility

Test plan

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant