perf(advisor): recommend Turbo KV-cache modes per family and context by inureyes · Pull Request #343 · lablup/mlxcel

inureyes · 2026-06-17T18:50:10Z

Summary

Extend the existing quant/cache advisor so mlxcel generate --recommend-quant also emits an advisory KV-cache-mode suggestion per model family and context range, among fp16, int8, fp16+turbo4 (asymmetric Turbo4), turbo4 (symmetric), and fp16+turbo3. This is a P3 backlog item: the goal is to give operators a benchmark starting point for long-context and memory-constrained serving, not to change any default.

The recommendations are advisory and opt-in only. The default inference path is unchanged: with no flags resolve_kv_cache_mode still returns fp16, and nothing is auto-applied. The advisor reads only config.json, prints suggestions, and exits.

What changed

New module src/execution/kv_cache_advisor.rs: the pure recommendation core recommend_kv_cache_mode(arch_kind, model_type, range), the config-driven advise_kv_cache_modes(model_path), the KvContextRange bucket (short <=4K, medium 4K-32K, long >32K), the KvCacheModeAdvice value, and render_kv_cache_advice / print_kv_cache_advice.
It keys on the KV-architecture class already computed by src/execution/kv_arch.rs (standard, sliding-window, MLA, hybrid, pure SSM) plus the raw model_type for the symmetric-Turbo4 PPL allowlist (src/lib/mlxcel-core/src/cache/turbo/allowlist.rs).
QuantAdvice gains a kv_cache_advice: Vec<KvCacheModeAdvice> field, populated in advise_quantization and rendered by print_quant_advice, so --recommend-quant surfaces the suggestions with no behavior change to any other caller.
docs/turbo-kv-cache.md documents the new --recommend-quant surface: the family/context keying, the conservative rule table, the advisory/opt-in framing, and the no-weight-promotion guarantee.

No bf16-to-f16 quantized-weight promotion (the #289 landmine)

A recommendation is always one of the five KVCacheMode values. KV-cache modes quantize only the K/V cache tensors and dequantize back to FP16 for SDPA; they never touch model weights or their scales/biases. The advisor never calls into the weight-loading / sanitize.rs path, so it cannot reintroduce the bf16-to-f16 quantized-weight promotion that caused the #289 decode regression. Symmetric turbo4 is suggested only for families on the PPL allowlist, and MLA / pure-SSM families never receive a Walsh-Hadamard Turbo mode (their cache dimension is not a power of two, or absent). A unit test enumerates every (architecture, family, range) combination and asserts the output stays within the five allowed modes and never turbo4-delegated.

Test plan

cargo check -p mlxcel --lib --release --features metal,accelerate
cargo test --release -p mlxcel --lib --features metal,accelerate kv_cache_advisor:: (15 tests: range bucketing, short-context-is-fp16, symmetric-turbo4-only-for-allowlisted, MLA-uses-int8-never-turbo, pure-SSM-fp16, always-an-allowed-mode, default-unchanged-without-opt-in, advisory-framing render)
cargo clippy -p mlxcel --lib --tests --release --features metal,accelerate clean for the changed files
cargo fmt --check clean for the changed files
python3 scripts/ci/check_cross_repo_refs.py

Remaining acceptance-criteria items (recorded 4K/16K/32K sweeps across two dense and two MoE/VLM families, and the recorded per-family quality guard) are intentionally left open: they are the validation coverage that keeps the recommendations advisory. The orchestrator runs the real-model KV-mode smoke after merge.

Closes #327

Extend the quant/cache advisor so `mlxcel generate --recommend-quant` also suggests a KV-cache mode (fp16, int8, fp16+turbo4, turbo4, fp16+turbo3) per model family and context range. The suggestion is advisory and opt-in: the default inference path is unchanged (resolve_kv_cache_mode still returns fp16 with no flags) and nothing is auto-applied. New module src/execution/kv_cache_advisor.rs holds the pure recommendation core (recommend_kv_cache_mode) plus config-driven helpers (advise_kv_cache_modes) and the printable advisory block. It keys on the KV-architecture class from src/execution/kv_arch.rs (standard, sliding-window, MLA, hybrid, pure SSM) and the raw model_type for the symmetric-Turbo4 PPL allowlist, bucketed by context range (short <=4K, medium 4K-32K, long >32K). Long context and memory-constrained serving are prioritized over raw short-decode tok/s. QuantAdvice gains a kv_cache_advice field, populated in advise_quantization and printed by print_quant_advice, so --recommend-quant surfaces the suggestions with no behavior change elsewhere. Safety: a recommendation is always one of the five KV-cache modes, which quantize only the K/V cache tensors and dequantize to FP16 for SDPA. They never touch model weights or their scales/biases, so they cannot reintroduce the bf16-to-f16 quantized-weight promotion from #289. Symmetric turbo4 is suggested only for allowlisted families; MLA and pure-SSM families never receive a Walsh-Hadamard Turbo mode. These invariants are covered by unit tests, including a default-unchanged test. Docs: docs/turbo-kv-cache.md documents the advisor surface, its family/context keying, the advisory/opt-in framing, and the no-weight-promotion guarantee.

Split the inline `#[cfg(test)] mod tests` out of kv_cache_advisor.rs into kv_cache_advisor_tests.rs, wired with `#[path = "kv_cache_advisor_tests.rs"] mod tests;`. This follows the CLAUDE.md convention (unit tests in `_tests.rs` files beside the implementation) and keeps the implementation file under the 500-line limit. No code or test behavior changes; the tests still run as execution::kv_cache_advisor::tests::*.

… dims TurboQuantParams::new and TurboQuantParams3::new assert that the attention head dimension is a power of two, panicking on families like Phi-2 where head_dim is 80. The advisor now reads the head dimension from config.json (explicit head_dim/head_size or derived from hidden_size/num_attention_heads) and, when the value is present and not a power of two, downgrades any Turbo (Walsh-Hadamard) suggestion to int8 or fp16 for that family, matching the existing treatment of MLA families. Two new unit tests cover the Phi-2 non-power-of-two path and the Llama power-of-two path.

The non-power-of-two Turbo guard in advise_kv_cache_modes derived the head dim via read_head_dim, which checked only hidden_size/d_model for the hidden size and num_attention_heads/num_heads for the head count. Alternate config naming used by supported families (OLMo and MPT-style d_model + n_heads) made read_head_dim return None, which defaults turbo_ok to true and bypasses the guard, even though kv_arch::attn_dims still derives the head dim and classifies the model as standard attention. OLMo avoids a live panic only because its head dim is 128; a non-power-of-two head dim under the same naming (for example MPT-30B at 112) would yield a Walsh-Hadamard Turbo suggestion that panics in TurboQuantParams::new if a user opts into it. Broaden read_head_dim's hidden-size and head-count field lists to mirror kv_arch::attn_dims (adding dim, model_dim, n_heads, n_head) so the guard sees the same head dim the classifier used. The change is additive and strictly more conservative: power-of-two models are unaffected, and alternate-naming models with a non-power-of-two head dim now correctly downgrade to int8/fp16. Add regression tests covering the non-power-of-two case (MPT-30B-style, head_dim 112, Turbo withheld) and the power-of-two case (OLMo-1B-style, head_dim 128, Turbo kept).

The head-dimension derivation in the advisor now covers the full set of alternate field names added in 0b24f0f (OLMo/MPT-style d_model + n_heads, etc.), mirroring the kv_arch classifier so both paths agree. The previous sentence only mentioned hidden_size / num_attention_heads and was stale.

inureyes · 2026-06-17T19:59:39Z

PR Finalization

Tests

19 advisor unit tests pass. No gaps found; the existing suite covers context-range bucketing, per-family recommendations, allowlist gating, non-power-of-two head-dim downgrade, default-unchanged invariant, render output, and the alternate field-name guard added in 0b24f0f.

Documentation

docs/turbo-kv-cache.md had one stale sentence: the head-dimension derivation description mentioned only hidden_size / num_attention_heads, but 0b24f0f broadened the field-name coverage to also handle d_model / dim / model_dim and num_heads / n_heads / n_head. Updated to list the full set and note that it mirrors the kv_arch classifier. All four advisory requirements are present and accurate: the section describes --recommend-quant, marks the output as advisory/opt-in, states the default is unchanged, notes KV modes never touch model weights (no bf16->f16 weight promotion), and covers the non-power-of-two head-dim fallback.

No Korean mirror of this doc exists in the repo.

Lint / Format

cargo fmt --check clean. cargo clippy -p mlxcel --lib --tests --features metal,accelerate -- -D warnings exits 0 with no diagnostics.

Commit: d4a0445

inureyes · 2026-06-17T20:05:40Z

Real-model validation (orchestrator)

--recommend-quant smoke on M1 Ultra:

qwen2.5-7b-4bit (non-allowlisted dense): per-range KV advice prints. short = fp16; medium/long = fp16+turbo4 (also int8 / fp16+turbo3). Symmetric turbo4 is correctly withheld because the family is not on the PPL allowlist.
qwen3.5-4b-4bit (allowlisted): long = turbo4 (symmetric, ~73% KV savings), correctly differentiated by is_symmetric_turbo_allowed; fp16+turbo4 offered as the lower-risk fallback.
The output states the advice is advisory/opt-in and that the default inference path is unchanged (fp16).
Recommended modes load and generate coherently: qwen2.5-7b fp16+turbo4 OK, qwen3.5-4b turbo4 OK.
Default (no kv flag) unchanged: qwen2.5-7b runs fp16, coherent, 85 tok/s.

Reviewer and security-checker each fixed a MEDIUM head_dim guard (a non-power-of-2 head_dim family could otherwise be recommended a Turbo mode that panics on opt-in; now downgraded to int8/fp16, with the field-name set broadened to match the arch classifier). Finalizer corrected a stale doc sentence. 0 unresolved CRITICAL/HIGH. Acceptance criteria met (advisory/opt-in, no bf16->f16 quant-weight promotion, sound per-family/context recommendations, integrated into --recommend-quant).

inureyes added status:review Under review type:performance Performance improvements priority:backlog Future considerations area:core mlxcel-core: MLX FFI, primitives, KV cache, layers area:inference Generation, sampling, decoding (incl. speculative, DRY) labels Jun 17, 2026

inureyes added 4 commits June 18, 2026 04:07

inureyes added status:done Completed and removed status:review Under review labels Jun 17, 2026

inureyes merged commit 568766f into main Jun 17, 2026
5 checks passed

inureyes deleted the perf/327-kv-cache-turbo-advisor branch June 17, 2026 20:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(advisor): recommend Turbo KV-cache modes per family and context#343

perf(advisor): recommend Turbo KV-cache modes per family and context#343
inureyes merged 5 commits into
mainfrom
perf/327-kv-cache-turbo-advisor

inureyes commented Jun 17, 2026

Uh oh!

inureyes commented Jun 17, 2026

Uh oh!

inureyes commented Jun 17, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

inureyes commented Jun 17, 2026

Summary

What changed

No bf16-to-f16 quantized-weight promotion (the #289 landmine)

Test plan

Uh oh!

inureyes commented Jun 17, 2026

PR Finalization

Tests

Documentation

Lint / Format

Uh oh!

inureyes commented Jun 17, 2026

Real-model validation (orchestrator)

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant