[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1052
Closed
OCWC22 wants to merge 6 commits intoSemiAnalysisAI:mainfrom
Closed
[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1052OCWC22 wants to merge 6 commits intoSemiAnalysisAI:mainfrom
OCWC22 wants to merge 6 commits intoSemiAnalysisAI:mainfrom
Conversation
Contributor
There was a problem hiding this comment.
Pull request overview
This PR extends the ISB1 replay “consumer” surface with mechanism/quality-eval metadata, adds supporting registries + workflows, and introduces additional tooling/scripts for replay validation and KV stress experimentation across runners.
Changes:
- Add mechanism/quality-eval registries and workflow plumbing to run ISB1 mechanism sweeps and (intended) gate them via
gate_isb1.py. - Add runner/script-selection improvements (shared resolver) and multiple new single-node benchmark scripts for ISB1 replay/KV stress variants.
- Add a set of ISB1 ops/dev tools (producer/consumer export sync verifier, GMI sweep utilities, Pareto/sweep analysis scripts) plus docs and LFS-tracked export bundles.
Reviewed changes
Copilot reviewed 130 out of 134 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| utils/verify_producer_sync.py | New script to byte-compare producer vs consumer ISB1 export JSON subtrees. |
| utils/test_verify_producer_sync.py | Tests for producer/consumer sync verification behavior. |
| utils/test_summarize_isb1.py | Tests for ISB1 operator summary generation. |
| utils/test_process_result_isb1_mechanism.py | Subprocess integration tests ensuring mechanism_eval fields land in aggregated ISB1 results. |
| utils/test_process_result.py | Updates process_result tests (incl. ISB1 replay guards) and base env requirements. |
| utils/process_result.py | Adds guards to prevent ISB1 replay payloads/benchmarks from using the throughput processor. |
| runners/lib_single_node_script.sh | New helper to resolve the correct single-node benchmark script path (including ISB1 replay framework-specific scripts). |
| runners/launch_h200-nb.sh | Switches to using the shared script resolver. |
| runners/launch_h200-dgxc-slurm.sh | Switches to using the shared script resolver. |
| runners/launch_h200-cw.sh | Switches to using the shared script resolver. |
| runners/launch_h100-dgxc-slurm.sh | Switches to using the shared script resolver. |
| runners/launch_h100-cw.sh | Switches to using the shared script resolver. |
| runners/launch_h100-cr.sh | Switches to using the shared script resolver and expands env propagation for ISB1 replay. |
| runners/launch_b200-nb.sh | Switches to using the shared script resolver. |
| runners/launch_b200-dgxc.sh | Switches to using the shared script resolver and expands env propagation for ISB1 replay. |
| runners/launch_b200-dgxc-slurm.sh | Switches to using the shared script resolver. |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_vllm.sh | New experimental kv-cache-tester trace replay runner (vLLM, H200). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_sglang.sh | New experimental kv-cache-tester trace replay runner (SGLang, H200). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_vllm.sh | New experimental kv-cache-tester trace replay runner (vLLM, B200). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_sglang.sh | New experimental kv-cache-tester trace replay runner (SGLang, B200). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_vllm.sh | New experimental kv-cache-tester trace replay runner (vLLM, H200, GPT-OSS). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_sglang.sh | New experimental kv-cache-tester trace replay runner (SGLang, H200, GPT-OSS). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_vllm.sh | New experimental kv-cache-tester trace replay runner (vLLM, B200, GPT-OSS). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_sglang.sh | New experimental kv-cache-tester trace replay runner (SGLang, B200, GPT-OSS). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_h200_vllm.sh | New experimental kv-cache-tester trace replay runner (vLLM, H200, DSR1). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_b200_vllm.sh | New experimental kv-cache-tester trace replay runner (vLLM, B200, DSR1). |
| experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_h200.sh | New experimental LMCache-enabled vLLM launcher (H200). |
| experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_b200.sh | New experimental LMCache-enabled vLLM launcher (B200). |
| experimental/multiturn/vllm_benchmark/launch/README.md | Notes for experimental LMCache launch helpers. |
| experimental/multiturn/vllm_benchmark/kv-cache-tester/traces/.gitkeep | Placeholder to keep traces directory in repo. |
| experimental/multiturn/vllm_benchmark/kv-cache-tester/README.md | Placeholder README describing expected external kv-cache-tester population. |
| experimental/multiturn/vllm_benchmark/aiperf_traces/generate_aiperf_traces.py | Utility to generate synthetic AIPerf-style traces. |
| experimental/multiturn/vllm_benchmark/README.md | High-level README for experimental multiturn benchmark parity surface. |
| experimental/multiturn/vllm_benchmark/.gitignore | Ignore rules for experimental multiturn artifacts. |
| experimental/multiturn/README.md | Rewrites experimental multiturn notes with explicit “not official ISB1” boundary and pointers to canonical docs. |
| experimental/README.md | Clarifies experimental directory intent and points to official ISB1 support docs. |
| datasets/isb1/scripts/plot_pareto.py | New script to compute Pareto frontier (throughput vs p99 TTFT) from DB/JSON. |
| datasets/isb1/scripts/gpu_profile_collector.sh | New helper to poll nvidia-smi into a CSV until terminated. |
| datasets/isb1/scripts/gmi_test_matrix.sh | New curated GMI test-matrix runner. |
| datasets/isb1/scripts/gmi_kv_sweep.sh | New GMI KV stress sweep harness (users × offload modes). |
| datasets/isb1/scripts/gmi_full_suite.sh | New “full suite” runner for key ISB1 replay combinations. |
| datasets/isb1/scripts/generate_qwen35_low_band_exports.py | Generator to derive Qwen3.5 low-band exports from runnable GPT-OSS cells. |
| datasets/isb1/scripts/collect_sweep_results.py | Aggregates sweep results from SQLite or agg_*.json directory and emits CSV/JSON summaries. |
| datasets/isb1/scripts/analyze_benchmark_distributions.py | Analyzer for ISL/OSL/turn distributions from ISB1 exports or trace directories. |
| datasets/isb1/scripts/adapt_trace_replay_result.py | Adapter from kv-cache-tester CSV outputs into ISB1 replay JSON-like schema. |
| datasets/isb1/registry/quality_eval_registry.json | Registry of allowed quality-eval harness IDs and metadata. |
| datasets/isb1/registry/mechanism_variant_registry.json | Registry of allowed mechanism×variant pairs + mechanism sets. |
| datasets/isb1/exports/preview/long_context_500k/manifest_qwen3.5.json | New LFS-tracked preview manifest pointer. |
| datasets/isb1/exports/preview/long_context_500k/manifest.json | New LFS-tracked preview manifest pointer. |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_qwen3.5_xlc2_500k_preview_v1.json | New LFS-tracked 500k preview export pointer. |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_gptoss_xlc2_500k_preview_v1.json | New LFS-tracked 500k preview export pointer. |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_qwen3.5_xlc2_500k_preview_v1.json | New LFS-tracked 500k preview export pointer. |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_gptoss_xlc2_500k_preview_v1.json | New LFS-tracked 500k preview export pointer. |
| datasets/isb1/exports/preview/long_context_500k/README.md | Documentation for bounded 500k preview lane claim boundary and consumer contract. |
| datasets/isb1/exports/preview/long_context_1m/manifest.json | New LFS-tracked 1m preview manifest pointer. |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__coding_qwen3.5_ulc2_1m_preview_v1.json | New LFS-tracked 1m preview export pointer. |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__chat_qwen3.5_ulc2_1m_preview_v1.json | New LFS-tracked 1m preview export pointer. |
| datasets/isb1/exports/preview/long_context_1m/README.md | Documentation for gated 1m preview lane claim boundary. |
| datasets/isb1/exports/extension_64k/code_64k1k_qwen3.5.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_64k/code_64k1k.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_64k/chat_64k1k_qwen3.5.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_64k/chat_64k1k.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_32k/code_32k1k_qwen3.5.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_32k/code_32k1k.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_32k/chat_32k1k_qwen3.5.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_32k/chat_32k1k.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_131k/code_131k1k_qwen3.5.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_131k/code_131k1k.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_131k/chat_131k1k_qwen3.5.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_131k/chat_131k1k_dsr1.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/extension_131k/chat_131k1k.json | New/updated LFS-tracked extension export pointer. |
| datasets/isb1/exports/core/code_8k1k_qwen3.5.json | New/updated LFS-tracked core export pointer. |
| datasets/isb1/exports/core/code_8k1k.json | New/updated LFS-tracked core export pointer. |
| datasets/isb1/exports/core/chat_8k1k_qwen3.5.json | New/updated LFS-tracked core export pointer. |
| datasets/isb1/exports/core/chat_8k1k.json | New/updated LFS-tracked core export pointer. |
| datasets/isb1/README.md | New ISB1 consumer package README (coverage, inventory, claim boundary). |
| datasets/isb1/GMI_EXECUTION_PLAN.md | New bare-metal execution plan/runbook for GMI. |
| datasets/isb1/COEXISTENCE_WITH_KV_CACHE_TESTER.md | New coexistence doc clarifying ISB1 vs kv-cache-tester roles. |
| datasets/isb1/.gitattributes | Adds linguist-generated marker and (currently conflicting) EOL/text attributes for exports. |
| benchmarks/single_node/qwen3.5triattn_fp8_h200_vllm.sh | New TriAttention-enabled vLLM benchmark script (Qwen3.5, H200). |
| benchmarks/single_node/qwen3.5triattn_fp8_h100_vllm.sh | New TriAttention-enabled vLLM benchmark script (Qwen3.5, H100). |
| benchmarks/single_node/qwen3.5_fp8_h200_vllm.sh | New/updated vLLM single-node script with ISB1 replay handling hooks. |
| benchmarks/single_node/qwen3.5_fp8_h200_sglang.sh | New/updated SGLang single-node script with ISB1 replay handling hooks. |
| benchmarks/single_node/qwen3.5_fp8_h100_vllm.sh | New vLLM single-node script (H100). |
| benchmarks/single_node/qwen3.5_fp8_h100_sglang.sh | New SGLang single-node script (H100). |
| benchmarks/single_node/qwen3.5_fp8_b200_vllm.sh | New vLLM single-node script (B200). |
| benchmarks/single_node/qwen3.5_fp8_b200_sglang.sh | New SGLang single-node script (B200). |
| benchmarks/single_node/gptosstriattn_fp4_h200_vllm.sh | New TriAttention-enabled vLLM benchmark script (GPT-OSS, H200). |
| benchmarks/single_node/gptosstriattn_fp4_h100_vllm.sh | New TriAttention-enabled vLLM benchmark script (GPT-OSS, H100). |
| benchmarks/single_node/gptoss_fp4_h200_sglang.sh | New SGLang single-node script (GPT-OSS, H200). |
| benchmarks/single_node/gptoss_fp4_h200.sh | Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark. |
| benchmarks/single_node/gptoss_fp4_h100_sglang.sh | New SGLang single-node script (GPT-OSS, H100). |
| benchmarks/single_node/gptoss_fp4_h100.sh | Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark. |
| benchmarks/single_node/gptoss_fp4_b200_sglang.sh | New SGLang single-node script (GPT-OSS, B200). |
| benchmarks/single_node/gptoss_fp4_b200.sh | Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark. |
| benchmarks/single_node/dsr1triattn_fp8_h200_vllm.sh | New TriAttention-enabled vLLM benchmark script (DSR1, H200). |
| benchmarks/single_node/dsr1triattn_fp8_h100_vllm.sh | New TriAttention-enabled vLLM benchmark script (DSR1, H100). |
| benchmarks/single_node/dsr1_fp8_h200_vllm.sh | New vLLM single-node script (DSR1, H200). |
| benchmarks/single_node/dsr1_fp8_h200.sh | Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark. |
| benchmarks/single_node/dsr1_fp8_b200_vllm.sh | New vLLM single-node script (DSR1, B200). |
| benchmarks/single_node/dsr1_fp8_b200.sh | Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark. |
| benchmarks/single_node/dsr1_fp4_b200.sh | Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark. |
| .github/workflows/run-isb1-mechanism-eval.yml | New workflow to dispatch ISB1 mechanism_eval sweeps. |
| .github/workflows/run-isb1-kv-stress-sweep.yml | New workflow to dispatch ISB1 KV stress sweeps. |
| .github/workflows/collect-results.yml | Adds ISB1 operator summary + gate report steps (but currently only for exact prefix isb1). |
| .github/configs/isb1-qwen-1m-preview.yaml | Adds gated/manual 1m preview config for Qwen3.5. |
| .github/configs/isb1-mechanism-fp8-kv.yaml | Adds FP8 KV mechanism_eval config (quality eval pending). |
| .github/configs/isb1-mechanism-baseline.yaml | Adds baseline mechanism_eval config entries. |
| .gitignore | Adds additional ignore patterns (.DS_Store, prompt-exports/, .claude). |
| .gitattributes | Configures Git LFS tracking for datasets/isb1 exports. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Comment on lines
+42
to
+50
| - name: ISB1 operator summary | ||
| if: inputs.result-prefix == 'isb1' | ||
| run: | | ||
| pip install tabulate | ||
| python3 utils/summarize_isb1.py results/ >> $GITHUB_STEP_SUMMARY | ||
|
|
||
| - name: ISB1 gate report | ||
| if: inputs.result-prefix == 'isb1' | ||
| run: | |
Comment on lines
31
to
33
| - name: Print summary | ||
| if: inputs.result-prefix != 'isb1' | ||
| run: | |
Comment on lines
+1
to
+2
| exports/**/*.json linguist-generated=true | ||
| exports/**/*.json text eol=lf |
7f8cd0c to
5c6b82f
Compare
…races Add ISB1 (Inference Standard Benchmark, v1) — a multi-turn, long-context KV cache stress testing dataset targeting realistic production KV-cache pressure patterns. ## What this adds 35 synthetic multi-turn traces across 7 context bands (8K → 1M+ tokens): - 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal - KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout - Real conversation content with 60-95% prefix overlap (enables prefix cache testing) - Context assets from 15KB to 6.6MB inlined into traces for honest token counts Export bundles for vLLM + SGLang replay: - extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200) - preview/long_context_500k: Qwen 3.5 500K context stress test - preview/long_context_1m: Qwen 3.5 1M context stress test 10 KV stress sweep configs: - 3 models × 2 GPUs × 2 engines - Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s ## Benchmark infrastructure - benchmark_export_replay.py: replay harness with actual_context_len telemetry - process_result_isb1.py: result aggregation with KV metrics - Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes - Pareto frontier: throughput vs p99 TTFT at each concurrency level
- Keep only configs whose (runtime, hardware, model) triples exist in the export files — eliminates sweep generator failures - Fix canonical-model-id to match export metadata (e.g., gpt_oss_120b not gptoss) - Fix support-status to match export tiers (reviewed_preview vs unsupported) - Remove configs for engines/GPUs not yet in exports (SGLang, Dynamo, TRT, Atom, AMD) — these need export metadata updates before they can be added back - Add workload-type field required by sweep generator schema - Remove disagg/multinode fields not in KV stress schema Sweep generator now passes: exit code 0, produces valid matrix rows.
…mbos Export metadata now includes all valid (runtime, hardware, model) triples from nvidia-master.yaml + amd-master.yaml: - 8 runtimes: vllm, sglang, trt, atom, sglang-disagg, dynamo-* - 9 GPU types: H100, H200, B200, B300, GB200, GB300, MI300X, MI325X, MI355X - 6 models: DSR1, GPT-OSS, Qwen 3.5, GLM-5, Kimi K2.5, MiniMax M2.5 87 KV stress configs with correct canonical-model-id and support-status matching export metadata. Sweep generator passes (exit code 0). MI355X configs sweep to 512 concurrent users (288GB HBM advantage).
…2.0 manifests, prefix-aware replay Final closure pass landing PR SemiAnalysisAI#1032 end-to-end across every (runtime, hardware, canonical-model) triple currently in the export metadata. Sweep configs: - Consolidate the sweep config under its canonical name isb1-kv-stress.yaml - Rewrite isb1-master / isb1-triattn-preview / isb1-qwen-1m-preview: drop/demote dead stanzas, flatten paths (strip /vllm//sglang/ subdirs and __vllm/__sglang suffixes), repoint qwen3.5 to _qwen3.5 basename - isb1-master shrinks 1723 -> 863 lines (50 -> 26 stanzas); 1M preview drops the vllm stanza (sglang-only in reality) - All produced rows resolve to real bundle cells at declared tier Manifests -> manifest_version 0.2.0 with single-bundle exports for preview/long_context_500k (gptoss + qwen3.5) and preview/long_context_1m. Consumer replay (utils/bench_serving/benchmark_export_replay.py): hydrate v0.2.0 prefix-aware bundles — thin per-cell deltas join a shared workload prefix via prefix_ref, LRU-cached (max 8) across cells in the same bundle. Pre-0.2.0 bundles replay unchanged. Producer-sync verifier (utils/verify_producer_sync.py): extend coverage to core + extension_32k + extension_64k; silently skip subtrees absent on both sides, report asymmetric ones. Docs: coexistence and both preview READMEs updated with flat paths, canonical config name, and the sglang-only preview reality. Tests: 262/262 pass across utils/ (107 sweep-config + new test_benchmark_export_replay.py for the prefix-aware consumer + test_verify_producer_sync.py for broadened verifier coverage).
… clean support vocabulary README.md: - Remove dead links to docs removed in the prior cleanup pass (COVERAGE_AUDIT, LONG_CONTEXT_TRUTH_MATRIX, SUPPORT_MATRIX, RUNBOOKs, INVESTIGATION) - Replace stale 50-export-files count with post-flatten per-subtree inventory (23 bundles + 3 manifests = 26 total, consolidating framework-specific variants into flat single files) - Add explicit five-class support-status vocabulary section - Keep safe/unsafe claim boundary Coexistence doc: - Strip planning/negotiation sections (Recommended PR Structure and maintainer-request list) — not coexistence-technical - Update data-directory layout to show flat paths - Update ISB1 workflow name to run-isb1-kv-stress-sweep.yml - Add support-status vocabulary section GMI_EXECUTION_PLAN.md: - Prepend support-status framing (reviewed_preview, dataset_replay_verified, not live-serving certification) - Fix stale nested paths to flat: extension_131k/vllm/ -> extension_131k/ - Fix preview bundle names: strip __vllm/__sglang suffixes - Update final result-pipeline sentence to cite actual analyzer scripts
…istries + hard gate Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status= supported for a lossy technique unless a registered quality benchmark has completed. Follow-up to PR SemiAnalysisAI#1032. All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db(). New files: - utils/mechanism_eval.py Env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate. - datasets/isb1/registry/mechanism_variant_registry.json 9 registered mechanism/variant pairs covering baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash. - datasets/isb1/registry/quality_eval_registry.json 4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval, math_500. - .github/configs/isb1-mechanism-baseline.yaml DSR1 (H100) and Qwen3.5 (B200) baseline cells. - .github/configs/isb1-mechanism-fp8-kv.yaml Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it). - .github/workflows/run-isb1-mechanism-eval.yml Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl. - utils/test_mechanism_eval.py (13 tests). - utils/test_process_result_isb1_mechanism.py (3 subprocess tests). Extended files: - utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row. - utils/gate_isb1.py — new mechanism_compression_quality gate enforcing: (1) any non-baseline mechanism_variant must resolve in the registry; (2) quality_eval_status in {pending, completed, failed, not_required}; (3) supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id; (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate. - datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags. - utils/test_gate_isb1.py — 7 new mechanism-gate tests. Full suite: 285 passed, 2 pre-existing warnings. References — public literature the registries are grounded in: KV cache quantization (mechanism: kv_quantization) - fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang. - turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes. KV cache compression (mechanism: kv_compression) - kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation. Compressed attention (mechanism: compressed_attention) - triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Speculative decoding (mechanism: speculative_decoding) - mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437). - eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe). - medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774. - dflash: umbrella slot for DeepFlash-style draft stacks. Quality benchmarks (quality_eval_registry.json) - ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M. - longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads. - humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. - math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.
5c6b82f to
2df18f2
Compare
Author
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Adds a way for ISB1 replay rows to declare which optimization technique they exercise (baseline, FP8 KV quantization, KV compression, compressed attention, speculative decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled
supportedfor a lossy technique unless a registered quality benchmark (RULER, LongBench v2, HumanEval, or MATH-500) has completed.This is a schema-only, backward-compatible change. Existing ISB1 rows, configs, and result databases keep working unchanged — new fields default to
nulland legacy SQLite databases auto-migrate on first open.Follow-up to #1032.
What is ISB1?
ISB1 (Inference Standard Benchmark, v1) is this repo's replay-based inference benchmark. It lives at
datasets/isb1/. Each replay bundle captures a synthetic multi-turn trace (coding or chat workload, 8K–1M context band) together with enough metadata to faithfully re-run it against any vLLM/SGLang/Dynamo stack. Results go through:utils/benchmark_export_replay.py— replays the bundle against a running serverutils/process_result_isb1.py— aggregates the raw result into a single JSON rowutils/gate_isb1.py— evaluates advisory gates (control-lane health, 131K/500K/1M preview coverage)datasets/isb1/scripts/isb1_results_db.py— ingests rows into SQLite for cross-run analysisAll current ISB1 rows are implicitly "baseline" — no quantization, no compression, no speculative decoding. Once people start running the same benchmark with FP8 KV cache, KV compression, EAGLE-class drafts, etc., you need a way to tell those rows apart from baseline and to gate quality claims. That is what this PR adds.
Why this matters
Today, nothing in the ISB1 schema stops someone from publishing a throughput chart comparing:
as if they are on the same accuracy axis. The KV-quantized run is faster but can degrade chain-of-thought quality on reasoning models (see the References section below — this is a well-studied failure mode, not a hypothesis). Without a gate, a faster-but-worse config can silently be marked
supportedand cited in downstream material.This PR closes that gap by requiring any
support_status=supportedrow that claims a lossy technique (KV quantization / KV compression / compressed attention) to ship with a completed, registered accuracy benchmark. Rows with the technique applied but no completed benchmark stay atreviewed_previewand can't be cited assupporteduntil the eval lands.What changed, in plain language
1. Every ISB1 row now carries a
mechanismfield.Defaults to
"baseline". Other values:kv_quantization,kv_compression,compressed_attention,speculative_decoding. Amechanism_variantsub-field names the specific technique (e.g.fp8_e4m3,eagle3).2. Two JSON registries whitelist accepted values.
datasets/isb1/registry/mechanism_variant_registry.json— the 9 mechanism/variant pairs we accept. Adding new ones requires a PR that amends the registry.datasets/isb1/registry/quality_eval_registry.json— the 4 accepted accuracy benchmarks (RULER v1 for long-context retrieval, LongBench v2 for long-context reasoning, HumanEval for code, MATH-500 for math).Unregistered mechanism/variant pairs do not break the pipeline — they just fail the gate and show up in the gate report.
3. A new hard gate in
gate_isb1.py:mechanism_compression_quality.Four rules:
mechanism_variant registered— any non-baseline row must resolve in the registry.quality_eval_statusmust be one ofpending | completed | failed | not_required.support_status=supported+ compression technique ⇒quality_eval_status=completedwith a registeredquality_eval_id. This is the hard rule.mechanism=speculative_decoding⇒ must carrydraft_model_idandspeculative_acceptance_rate.4. SQLite schema gains 16 optional columns.
All default to
NULL. Existing databases migrate in place on firstconnect_db()(idempotentALTER TABLEpattern that matches how ISB1 already handles schema evolution).5. Two example configs + one dispatch workflow.
.github/configs/isb1-mechanism-baseline.yaml— DSR1 (H100) and Qwen3.5 (B200) baseline cells..github/configs/isb1-mechanism-fp8-kv.yaml— same two cells with FP8 KV quantization, wired toruler_v1and held atreviewed_previewuntil the RULER run lands (the gate will fail if you try to flip them tosupportedwithout a completed eval)..github/workflows/run-isb1-mechanism-eval.yml— dispatch workflow that routes mechanism configs through the existingbenchmark-isb1-tmpl.yml.Who this is for
If you are running ISB1 today and all you care about is baseline throughput, nothing changes — your configs continue to work and your rows are auto-classified as
mechanism=baseline.Worked example
An operator wants to publish an FP8-KV throughput number for DeepSeek-R1 on H100 at the
supportedtier..github/configs/isb1-mechanism-fp8-kv.yamland keepmechanism: kv_quantization,mechanism_variant: fp8_e4m3,kv-cache-dtype: fp8.support-status: supportedin the config.gate_isb1.pyon the aggregated result.mechanism_compression_qualitygate fails withfailed_criteria: ["supported+compression ⇒ quality_eval_status == completed"]becausequality_eval_statusis stillpendingand no RULER run has landed.quality_eval_status: completedand the quality delta, and re-runs the gate. It passes.Without this PR, step 4 silently succeeds and the row ships as
supported.Complete file list
New files (7):
utils/mechanism_eval.py— env-driven field catalog, registry loaders, validatorsdatasets/isb1/registry/mechanism_variant_registry.jsondatasets/isb1/registry/quality_eval_registry.json.github/configs/isb1-mechanism-baseline.yaml.github/configs/isb1-mechanism-fp8-kv.yaml.github/workflows/run-isb1-mechanism-eval.ymlutils/test_mechanism_eval.py(13 tests)utils/test_process_result_isb1_mechanism.py(3 subprocess integration tests)Extended files (4):
utils/process_result_isb1.py— emits 14 mechanism fields +mechanism_eval_validationrecord.utils/gate_isb1.py— newmechanism_compression_qualitygate.datasets/isb1/scripts/isb1_results_db.py— 16 additive ALTER TABLE migrations + matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, CLI ingest flags.utils/test_gate_isb1.py— 7 new mechanism-gate tests.Total: +1514 / −1 lines, 12 files.
Backward compatibility notes
MECHANISM,QUALITY_EVAL_ID) are all optional. Unset =null(exceptMECHANISMwhich defaults to"baseline")."registry_load_error: ..."appears in the validation record'sissueslist) rather than breaking the pipeline.Claim boundary
benchmark_certification_status=dataset_replay_verified.live_benchmark_certification.supportedtier for a compression mechanism — the included FP8-KV configs stay atreviewed_previewspecifically because RULER has not run yet. The gate is the mechanism that keeps them there.References — what we read for KV cache and compression
The mechanism and quality-eval registries are grounded in public research. Each registered value in
mechanism_variant_registry.jsonandquality_eval_registry.jsonmaps to one of the sources below.KV cache quantization (
mechanism: kv_quantization)fp8_e4m3— Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA / Intel / Arm, 2022), arXiv:2209.05433. Defines the E4M3 / E5M2 formats used by the engine-native FP8 KV paths in vLLM and SGLang.turboquant_class— umbrella slot for Hadamard-rotated 4-bit KV schemes in the TurboQuant / KVQuant lineage (e.g. Hooper et al., "KVQuant", 2024, arXiv:2401.18079). Operators submitting rows in this class cite their specific implementation inmechanism_notes.KV cache compression (
mechanism: kv_compression)kvtc_class— umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; the exact paper citation travels with each submitted row.Compressed attention (
mechanism: compressed_attention)triattention_class— umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Operator-submitted rows carry the specific implementation citation.Speculative decoding (
mechanism: speculative_decoding)mtp— Multi-Token Prediction head, as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437).eagle3— EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe).medusa— Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774.dflash— umbrella slot for DeepFlash-style draft stacks; specific citation travels with each submitted row.Quality benchmarks (
quality_eval_registry.json)ruler_v1— Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M.longbench_v2— Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads.humaneval— Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. Code-generation accuracy under compression.math_500— 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Math reasoning — detects chain-of-thought degradation from aggressive KV quantization, which is the specific failure mode the hard gate is designed to catch.Test plan
utils/test_mechanism_eval.py— 13 tests covering env parsing, numeric coercion, registry validation (registered / unregistered / bad status / missing draft), baseline pass-through, registry file validity, the hard-rule predicate matrix, idempotent migrations, and legacy-schema upgrade path.utils/test_gate_isb1.py— 7 new mechanism-gate tests (baseline trivially passes, supported+FP8 without eval fails, reviewed_preview+FP8 passes, supported+FP8 with completed registered eval passes, unregistered variant fails, speculative without draft fails, speculative with full fields passes).utils/test_process_result_isb1_mechanism.py— 3 subprocess integration tests verifying env-driven defaults, registered FP8-KV field surfacing, and unregistered-variant flagging.