Skip to content

perf(advisor): recommend Turbo KV-cache modes per family and context#343

Merged
inureyes merged 5 commits into
mainfrom
perf/327-kv-cache-turbo-advisor
Jun 17, 2026
Merged

perf(advisor): recommend Turbo KV-cache modes per family and context#343
inureyes merged 5 commits into
mainfrom
perf/327-kv-cache-turbo-advisor

Conversation

@inureyes

Copy link
Copy Markdown
Member

Summary

Extend the existing quant/cache advisor so mlxcel generate --recommend-quant also emits an advisory KV-cache-mode suggestion per model family and context range, among fp16, int8, fp16+turbo4 (asymmetric Turbo4), turbo4 (symmetric), and fp16+turbo3. This is a P3 backlog item: the goal is to give operators a benchmark starting point for long-context and memory-constrained serving, not to change any default.

The recommendations are advisory and opt-in only. The default inference path is unchanged: with no flags resolve_kv_cache_mode still returns fp16, and nothing is auto-applied. The advisor reads only config.json, prints suggestions, and exits.

What changed

  • New module src/execution/kv_cache_advisor.rs: the pure recommendation core recommend_kv_cache_mode(arch_kind, model_type, range), the config-driven advise_kv_cache_modes(model_path), the KvContextRange bucket (short <=4K, medium 4K-32K, long >32K), the KvCacheModeAdvice value, and render_kv_cache_advice / print_kv_cache_advice.
  • It keys on the KV-architecture class already computed by src/execution/kv_arch.rs (standard, sliding-window, MLA, hybrid, pure SSM) plus the raw model_type for the symmetric-Turbo4 PPL allowlist (src/lib/mlxcel-core/src/cache/turbo/allowlist.rs).
  • QuantAdvice gains a kv_cache_advice: Vec<KvCacheModeAdvice> field, populated in advise_quantization and rendered by print_quant_advice, so --recommend-quant surfaces the suggestions with no behavior change to any other caller.
  • docs/turbo-kv-cache.md documents the new --recommend-quant surface: the family/context keying, the conservative rule table, the advisory/opt-in framing, and the no-weight-promotion guarantee.

No bf16-to-f16 quantized-weight promotion (the #289 landmine)

A recommendation is always one of the five KVCacheMode values. KV-cache modes quantize only the K/V cache tensors and dequantize back to FP16 for SDPA; they never touch model weights or their scales/biases. The advisor never calls into the weight-loading / sanitize.rs path, so it cannot reintroduce the bf16-to-f16 quantized-weight promotion that caused the #289 decode regression. Symmetric turbo4 is suggested only for families on the PPL allowlist, and MLA / pure-SSM families never receive a Walsh-Hadamard Turbo mode (their cache dimension is not a power of two, or absent). A unit test enumerates every (architecture, family, range) combination and asserts the output stays within the five allowed modes and never turbo4-delegated.

Test plan

  • cargo check -p mlxcel --lib --release --features metal,accelerate
  • cargo test --release -p mlxcel --lib --features metal,accelerate kv_cache_advisor:: (15 tests: range bucketing, short-context-is-fp16, symmetric-turbo4-only-for-allowlisted, MLA-uses-int8-never-turbo, pure-SSM-fp16, always-an-allowed-mode, default-unchanged-without-opt-in, advisory-framing render)
  • cargo clippy -p mlxcel --lib --tests --release --features metal,accelerate clean for the changed files
  • cargo fmt --check clean for the changed files
  • python3 scripts/ci/check_cross_repo_refs.py

Remaining acceptance-criteria items (recorded 4K/16K/32K sweeps across two dense and two MoE/VLM families, and the recorded per-family quality guard) are intentionally left open: they are the validation coverage that keeps the recommendations advisory. The orchestrator runs the real-model KV-mode smoke after merge.

Closes #327

Extend the quant/cache advisor so `mlxcel generate --recommend-quant` also suggests a KV-cache mode (fp16, int8, fp16+turbo4, turbo4, fp16+turbo3) per model family and context range. The suggestion is advisory and opt-in: the default inference path is unchanged (resolve_kv_cache_mode still returns fp16 with no flags) and nothing is auto-applied.

New module src/execution/kv_cache_advisor.rs holds the pure recommendation core (recommend_kv_cache_mode) plus config-driven helpers (advise_kv_cache_modes) and the printable advisory block. It keys on the KV-architecture class from src/execution/kv_arch.rs (standard, sliding-window, MLA, hybrid, pure SSM) and the raw model_type for the symmetric-Turbo4 PPL allowlist, bucketed by context range (short <=4K, medium 4K-32K, long >32K). Long context and memory-constrained serving are prioritized over raw short-decode tok/s.

QuantAdvice gains a kv_cache_advice field, populated in advise_quantization and printed by print_quant_advice, so --recommend-quant surfaces the suggestions with no behavior change elsewhere.

Safety: a recommendation is always one of the five KV-cache modes, which quantize only the K/V cache tensors and dequantize to FP16 for SDPA. They never touch model weights or their scales/biases, so they cannot reintroduce the bf16-to-f16 quantized-weight promotion from #289. Symmetric turbo4 is suggested only for allowlisted families; MLA and pure-SSM families never receive a Walsh-Hadamard Turbo mode. These invariants are covered by unit tests, including a default-unchanged test.

Docs: docs/turbo-kv-cache.md documents the advisor surface, its family/context keying, the advisory/opt-in framing, and the no-weight-promotion guarantee.
@inureyes inureyes added status:review Under review type:performance Performance improvements priority:backlog Future considerations area:core mlxcel-core: MLX FFI, primitives, KV cache, layers area:inference Generation, sampling, decoding (incl. speculative, DRY) labels Jun 17, 2026
inureyes added 4 commits June 18, 2026 04:07
Split the inline `#[cfg(test)] mod tests` out of kv_cache_advisor.rs into kv_cache_advisor_tests.rs, wired with `#[path = "kv_cache_advisor_tests.rs"] mod tests;`. This follows the CLAUDE.md convention (unit tests in `_tests.rs` files beside the implementation) and keeps the implementation file under the 500-line limit. No code or test behavior changes; the tests still run as execution::kv_cache_advisor::tests::*.
… dims

TurboQuantParams::new and TurboQuantParams3::new assert that the attention
head dimension is a power of two, panicking on families like Phi-2 where
head_dim is 80. The advisor now reads the head dimension from config.json
(explicit head_dim/head_size or derived from hidden_size/num_attention_heads)
and, when the value is present and not a power of two, downgrades any Turbo
(Walsh-Hadamard) suggestion to int8 or fp16 for that family, matching the
existing treatment of MLA families. Two new unit tests cover the Phi-2
non-power-of-two path and the Llama power-of-two path.
The non-power-of-two Turbo guard in advise_kv_cache_modes derived the head dim via read_head_dim, which checked only hidden_size/d_model for the hidden size and num_attention_heads/num_heads for the head count. Alternate config naming used by supported families (OLMo and MPT-style d_model + n_heads) made read_head_dim return None, which defaults turbo_ok to true and bypasses the guard, even though kv_arch::attn_dims still derives the head dim and classifies the model as standard attention. OLMo avoids a live panic only because its head dim is 128; a non-power-of-two head dim under the same naming (for example MPT-30B at 112) would yield a Walsh-Hadamard Turbo suggestion that panics in TurboQuantParams::new if a user opts into it.

Broaden read_head_dim's hidden-size and head-count field lists to mirror kv_arch::attn_dims (adding dim, model_dim, n_heads, n_head) so the guard sees the same head dim the classifier used. The change is additive and strictly more conservative: power-of-two models are unaffected, and alternate-naming models with a non-power-of-two head dim now correctly downgrade to int8/fp16.

Add regression tests covering the non-power-of-two case (MPT-30B-style, head_dim 112, Turbo withheld) and the power-of-two case (OLMo-1B-style, head_dim 128, Turbo kept).
The head-dimension derivation in the advisor now covers the full set of
alternate field names added in 0b24f0f (OLMo/MPT-style d_model + n_heads,
etc.), mirroring the kv_arch classifier so both paths agree. The previous
sentence only mentioned hidden_size / num_attention_heads and was stale.
@inureyes

Copy link
Copy Markdown
Member Author

PR Finalization

Tests

19 advisor unit tests pass. No gaps found; the existing suite covers context-range bucketing, per-family recommendations, allowlist gating, non-power-of-two head-dim downgrade, default-unchanged invariant, render output, and the alternate field-name guard added in 0b24f0f.

Documentation

docs/turbo-kv-cache.md had one stale sentence: the head-dimension derivation description mentioned only hidden_size / num_attention_heads, but 0b24f0f broadened the field-name coverage to also handle d_model / dim / model_dim and num_heads / n_heads / n_head. Updated to list the full set and note that it mirrors the kv_arch classifier. All four advisory requirements are present and accurate: the section describes --recommend-quant, marks the output as advisory/opt-in, states the default is unchanged, notes KV modes never touch model weights (no bf16->f16 weight promotion), and covers the non-power-of-two head-dim fallback.

No Korean mirror of this doc exists in the repo.

Lint / Format

cargo fmt --check clean. cargo clippy -p mlxcel --lib --tests --features metal,accelerate -- -D warnings exits 0 with no diagnostics.

Commit: d4a0445

@inureyes

Copy link
Copy Markdown
Member Author

Real-model validation (orchestrator)

--recommend-quant smoke on M1 Ultra:

  • qwen2.5-7b-4bit (non-allowlisted dense): per-range KV advice prints. short = fp16; medium/long = fp16+turbo4 (also int8 / fp16+turbo3). Symmetric turbo4 is correctly withheld because the family is not on the PPL allowlist.
  • qwen3.5-4b-4bit (allowlisted): long = turbo4 (symmetric, ~73% KV savings), correctly differentiated by is_symmetric_turbo_allowed; fp16+turbo4 offered as the lower-risk fallback.
  • The output states the advice is advisory/opt-in and that the default inference path is unchanged (fp16).
  • Recommended modes load and generate coherently: qwen2.5-7b fp16+turbo4 OK, qwen3.5-4b turbo4 OK.
  • Default (no kv flag) unchanged: qwen2.5-7b runs fp16, coherent, 85 tok/s.

Reviewer and security-checker each fixed a MEDIUM head_dim guard (a non-power-of-2 head_dim family could otherwise be recommended a Turbo mode that panics on opt-in; now downgraded to int8/fp16, with the field-name set broadened to match the arch classifier). Finalizer corrected a stale doc sentence. 0 unresolved CRITICAL/HIGH. Acceptance criteria met (advisory/opt-in, no bf16->f16 quant-weight promotion, sound per-family/context recommendations, integrated into --recommend-quant).

@inureyes inureyes added status:done Completed and removed status:review Under review labels Jun 17, 2026
@inureyes inureyes merged commit 568766f into main Jun 17, 2026
5 checks passed
@inureyes inureyes deleted the perf/327-kv-cache-turbo-advisor branch June 17, 2026 20:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area:core mlxcel-core: MLX FFI, primitives, KV cache, layers area:inference Generation, sampling, decoding (incl. speculative, DRY) priority:backlog Future considerations status:done Completed type:performance Performance improvements

Projects

None yet

Development

Successfully merging this pull request may close these issues.

perf(core): extend the quant/KV-cache advisor to recommend Turbo KV modes per family and context

1 participant