perf(advisor): recommend Turbo KV-cache modes per family and context#343
Conversation
Extend the quant/cache advisor so `mlxcel generate --recommend-quant` also suggests a KV-cache mode (fp16, int8, fp16+turbo4, turbo4, fp16+turbo3) per model family and context range. The suggestion is advisory and opt-in: the default inference path is unchanged (resolve_kv_cache_mode still returns fp16 with no flags) and nothing is auto-applied. New module src/execution/kv_cache_advisor.rs holds the pure recommendation core (recommend_kv_cache_mode) plus config-driven helpers (advise_kv_cache_modes) and the printable advisory block. It keys on the KV-architecture class from src/execution/kv_arch.rs (standard, sliding-window, MLA, hybrid, pure SSM) and the raw model_type for the symmetric-Turbo4 PPL allowlist, bucketed by context range (short <=4K, medium 4K-32K, long >32K). Long context and memory-constrained serving are prioritized over raw short-decode tok/s. QuantAdvice gains a kv_cache_advice field, populated in advise_quantization and printed by print_quant_advice, so --recommend-quant surfaces the suggestions with no behavior change elsewhere. Safety: a recommendation is always one of the five KV-cache modes, which quantize only the K/V cache tensors and dequantize to FP16 for SDPA. They never touch model weights or their scales/biases, so they cannot reintroduce the bf16-to-f16 quantized-weight promotion from #289. Symmetric turbo4 is suggested only for allowlisted families; MLA and pure-SSM families never receive a Walsh-Hadamard Turbo mode. These invariants are covered by unit tests, including a default-unchanged test. Docs: docs/turbo-kv-cache.md documents the advisor surface, its family/context keying, the advisory/opt-in framing, and the no-weight-promotion guarantee.
Split the inline `#[cfg(test)] mod tests` out of kv_cache_advisor.rs into kv_cache_advisor_tests.rs, wired with `#[path = "kv_cache_advisor_tests.rs"] mod tests;`. This follows the CLAUDE.md convention (unit tests in `_tests.rs` files beside the implementation) and keeps the implementation file under the 500-line limit. No code or test behavior changes; the tests still run as execution::kv_cache_advisor::tests::*.
… dims TurboQuantParams::new and TurboQuantParams3::new assert that the attention head dimension is a power of two, panicking on families like Phi-2 where head_dim is 80. The advisor now reads the head dimension from config.json (explicit head_dim/head_size or derived from hidden_size/num_attention_heads) and, when the value is present and not a power of two, downgrades any Turbo (Walsh-Hadamard) suggestion to int8 or fp16 for that family, matching the existing treatment of MLA families. Two new unit tests cover the Phi-2 non-power-of-two path and the Llama power-of-two path.
The non-power-of-two Turbo guard in advise_kv_cache_modes derived the head dim via read_head_dim, which checked only hidden_size/d_model for the hidden size and num_attention_heads/num_heads for the head count. Alternate config naming used by supported families (OLMo and MPT-style d_model + n_heads) made read_head_dim return None, which defaults turbo_ok to true and bypasses the guard, even though kv_arch::attn_dims still derives the head dim and classifies the model as standard attention. OLMo avoids a live panic only because its head dim is 128; a non-power-of-two head dim under the same naming (for example MPT-30B at 112) would yield a Walsh-Hadamard Turbo suggestion that panics in TurboQuantParams::new if a user opts into it. Broaden read_head_dim's hidden-size and head-count field lists to mirror kv_arch::attn_dims (adding dim, model_dim, n_heads, n_head) so the guard sees the same head dim the classifier used. The change is additive and strictly more conservative: power-of-two models are unaffected, and alternate-naming models with a non-power-of-two head dim now correctly downgrade to int8/fp16. Add regression tests covering the non-power-of-two case (MPT-30B-style, head_dim 112, Turbo withheld) and the power-of-two case (OLMo-1B-style, head_dim 128, Turbo kept).
The head-dimension derivation in the advisor now covers the full set of alternate field names added in 0b24f0f (OLMo/MPT-style d_model + n_heads, etc.), mirroring the kv_arch classifier so both paths agree. The previous sentence only mentioned hidden_size / num_attention_heads and was stale.
PR FinalizationTests19 advisor unit tests pass. No gaps found; the existing suite covers context-range bucketing, per-family recommendations, allowlist gating, non-power-of-two head-dim downgrade, default-unchanged invariant, render output, and the alternate field-name guard added in 0b24f0f. Documentation
No Korean mirror of this doc exists in the repo. Lint / Format
Commit: d4a0445 |
Real-model validation (orchestrator)
Reviewer and security-checker each fixed a MEDIUM head_dim guard (a non-power-of-2 head_dim family could otherwise be recommended a Turbo mode that panics on opt-in; now downgraded to int8/fp16, with the field-name set broadened to match the arch classifier). Finalizer corrected a stale doc sentence. 0 unresolved CRITICAL/HIGH. Acceptance criteria met (advisory/opt-in, no bf16->f16 quant-weight promotion, sound per-family/context recommendations, integrated into |
Summary
Extend the existing quant/cache advisor so
mlxcel generate --recommend-quantalso emits an advisory KV-cache-mode suggestion per model family and context range, amongfp16,int8,fp16+turbo4(asymmetric Turbo4),turbo4(symmetric), andfp16+turbo3. This is a P3 backlog item: the goal is to give operators a benchmark starting point for long-context and memory-constrained serving, not to change any default.The recommendations are advisory and opt-in only. The default inference path is unchanged: with no flags
resolve_kv_cache_modestill returnsfp16, and nothing is auto-applied. The advisor reads onlyconfig.json, prints suggestions, and exits.What changed
src/execution/kv_cache_advisor.rs: the pure recommendation corerecommend_kv_cache_mode(arch_kind, model_type, range), the config-drivenadvise_kv_cache_modes(model_path), theKvContextRangebucket (short<=4K, medium4K-32K, long>32K), theKvCacheModeAdvicevalue, andrender_kv_cache_advice/print_kv_cache_advice.src/execution/kv_arch.rs(standard, sliding-window, MLA, hybrid, pure SSM) plus the rawmodel_typefor the symmetric-Turbo4 PPL allowlist (src/lib/mlxcel-core/src/cache/turbo/allowlist.rs).QuantAdvicegains akv_cache_advice: Vec<KvCacheModeAdvice>field, populated inadvise_quantizationand rendered byprint_quant_advice, so--recommend-quantsurfaces the suggestions with no behavior change to any other caller.docs/turbo-kv-cache.mddocuments the new--recommend-quantsurface: the family/context keying, the conservative rule table, the advisory/opt-in framing, and the no-weight-promotion guarantee.No bf16-to-f16 quantized-weight promotion (the #289 landmine)
A recommendation is always one of the five
KVCacheModevalues. KV-cache modes quantize only the K/V cache tensors and dequantize back to FP16 for SDPA; they never touch model weights or their scales/biases. The advisor never calls into the weight-loading /sanitize.rspath, so it cannot reintroduce the bf16-to-f16 quantized-weight promotion that caused the #289 decode regression. Symmetricturbo4is suggested only for families on the PPL allowlist, and MLA / pure-SSM families never receive a Walsh-Hadamard Turbo mode (their cache dimension is not a power of two, or absent). A unit test enumerates every (architecture, family, range) combination and asserts the output stays within the five allowed modes and neverturbo4-delegated.Test plan
cargo check -p mlxcel --lib --release --features metal,acceleratecargo test --release -p mlxcel --lib --features metal,accelerate kv_cache_advisor::(15 tests: range bucketing, short-context-is-fp16, symmetric-turbo4-only-for-allowlisted, MLA-uses-int8-never-turbo, pure-SSM-fp16, always-an-allowed-mode, default-unchanged-without-opt-in, advisory-framing render)cargo clippy -p mlxcel --lib --tests --release --features metal,accelerateclean for the changed filescargo fmt --checkclean for the changed filespython3 scripts/ci/check_cross_repo_refs.pyRemaining acceptance-criteria items (recorded 4K/16K/32K sweeps across two dense and two MoE/VLM families, and the recorded per-family quality guard) are intentionally left open: they are the validation coverage that keeps the recommendations advisory. The orchestrator runs the real-model KV-mode smoke after merge.
Closes #327