[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality by OCWC22 · Pull Request #1052 · SemiAnalysisAI/InferenceX

OCWC22 · 2026-04-17T06:24:13Z

TL;DR

Adds a way for ISB1 replay rows to declare which optimization technique they exercise (baseline, FP8 KV quantization, KV compression, compressed attention, speculative decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled supported for a lossy technique unless a registered quality benchmark (RULER, LongBench v2, HumanEval, or MATH-500) has completed.

This is a schema-only, backward-compatible change. Existing ISB1 rows, configs, and result databases keep working unchanged — new fields default to null and legacy SQLite databases auto-migrate on first open.

Follow-up to #1032.

What is ISB1?

ISB1 (Inference Standard Benchmark, v1) is this repo's replay-based inference benchmark. It lives at datasets/isb1/. Each replay bundle captures a synthetic multi-turn trace (coding or chat workload, 8K–1M context band) together with enough metadata to faithfully re-run it against any vLLM/SGLang/Dynamo stack. Results go through:

utils/benchmark_export_replay.py — replays the bundle against a running server
utils/process_result_isb1.py — aggregates the raw result into a single JSON row
utils/gate_isb1.py — evaluates advisory gates (control-lane health, 131K/500K/1M preview coverage)
datasets/isb1/scripts/isb1_results_db.py — ingests rows into SQLite for cross-run analysis

All current ISB1 rows are implicitly "baseline" — no quantization, no compression, no speculative decoding. Once people start running the same benchmark with FP8 KV cache, KV compression, EAGLE-class drafts, etc., you need a way to tell those rows apart from baseline and to gate quality claims. That is what this PR adds.

Why this matters

Today, nothing in the ISB1 schema stops someone from publishing a throughput chart comparing:

a baseline FP8 weights + FP16 KV run, and
a more-aggressive FP8 weights + FP8 E4M3 KV run

as if they are on the same accuracy axis. The KV-quantized run is faster but can degrade chain-of-thought quality on reasoning models (see the References section below — this is a well-studied failure mode, not a hypothesis). Without a gate, a faster-but-worse config can silently be marked supported and cited in downstream material.

This PR closes that gap by requiring any support_status=supported row that claims a lossy technique (KV quantization / KV compression / compressed attention) to ship with a completed, registered accuracy benchmark. Rows with the technique applied but no completed benchmark stay at reviewed_preview and can't be cited as supported until the eval lands.

What changed, in plain language

1. Every ISB1 row now carries a mechanism field.
Defaults to "baseline". Other values: kv_quantization, kv_compression, compressed_attention, speculative_decoding. A mechanism_variant sub-field names the specific technique (e.g. fp8_e4m3, eagle3).

2. Two JSON registries whitelist accepted values.

datasets/isb1/registry/mechanism_variant_registry.json — the 9 mechanism/variant pairs we accept. Adding new ones requires a PR that amends the registry.
datasets/isb1/registry/quality_eval_registry.json — the 4 accepted accuracy benchmarks (RULER v1 for long-context retrieval, LongBench v2 for long-context reasoning, HumanEval for code, MATH-500 for math).

Unregistered mechanism/variant pairs do not break the pipeline — they just fail the gate and show up in the gate report.

3. A new hard gate in gate_isb1.py: mechanism_compression_quality.
Four rules:

mechanism_variant registered — any non-baseline row must resolve in the registry.
quality_eval_status must be one of pending | completed | failed | not_required.
support_status=supported + compression technique ⇒ quality_eval_status=completed with a registered quality_eval_id. This is the hard rule.
mechanism=speculative_decoding ⇒ must carry draft_model_id and speculative_acceptance_rate.

4. SQLite schema gains 16 optional columns.
All default to NULL. Existing databases migrate in place on first connect_db() (idempotent ALTER TABLE pattern that matches how ISB1 already handles schema evolution).

5. Two example configs + one dispatch workflow.

.github/configs/isb1-mechanism-baseline.yaml — DSR1 (H100) and Qwen3.5 (B200) baseline cells.
.github/configs/isb1-mechanism-fp8-kv.yaml — same two cells with FP8 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run lands (the gate will fail if you try to flip them to supported without a completed eval).
.github/workflows/run-isb1-mechanism-eval.yml — dispatch workflow that routes mechanism configs through the existing benchmark-isb1-tmpl.yml.

Who this is for

ISB1 maintainers who want to start reporting non-baseline configs without losing the accuracy guarantee.
Framework authors (vLLM / SGLang / Dynamo / llm-d) who want to submit replay rows for their custom KV paths and need a clear way to declare "this is still preview — full quality eval pending."
Hardware evaluators who want to compare a baseline row against a quantized row and need to know, at a glance, whether the quantized row has a completed accuracy benchmark.

If you are running ISB1 today and all you care about is baseline throughput, nothing changes — your configs continue to work and your rows are auto-classified as mechanism=baseline.

Worked example

An operator wants to publish an FP8-KV throughput number for DeepSeek-R1 on H100 at the supported tier.

They clone .github/configs/isb1-mechanism-fp8-kv.yaml and keep mechanism: kv_quantization, mechanism_variant: fp8_e4m3, kv-cache-dtype: fp8.
They try to flip support-status: supported in the config.
They run gate_isb1.py on the aggregated result.
The mechanism_compression_quality gate fails with failed_criteria: ["supported+compression ⇒ quality_eval_status == completed"] because quality_eval_status is still pending and no RULER run has landed.
The operator runs RULER against the FP8-KV config, records quality_eval_status: completed and the quality delta, and re-runs the gate. It passes.

Without this PR, step 4 silently succeeds and the row ships as supported.

Complete file list

New files (7):

utils/mechanism_eval.py — env-driven field catalog, registry loaders, validators
datasets/isb1/registry/mechanism_variant_registry.json
datasets/isb1/registry/quality_eval_registry.json
.github/configs/isb1-mechanism-baseline.yaml
.github/configs/isb1-mechanism-fp8-kv.yaml
.github/workflows/run-isb1-mechanism-eval.yml
utils/test_mechanism_eval.py (13 tests)
utils/test_process_result_isb1_mechanism.py (3 subprocess integration tests)

Extended files (4):

utils/process_result_isb1.py — emits 14 mechanism fields + mechanism_eval_validation record.
utils/gate_isb1.py — new mechanism_compression_quality gate.
datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations + matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, CLI ingest flags.
utils/test_gate_isb1.py — 7 new mechanism-gate tests.

Total: +1514 / −1 lines, 12 files.

Backward compatibility notes

Env vars controlling the new fields (e.g. MECHANISM, QUALITY_EVAL_ID) are all optional. Unset = null (except MECHANISM which defaults to "baseline").
Database migration is additive and idempotent. Existing databases get the new columns on next open; all existing queries and columns keep working.
Gate evaluation handles baseline rows explicitly — they pass every criterion trivially and do not require any new env vars or registry lookups.
Registry load failures degrade to advisory ("registry_load_error: ..." appears in the validation record's issues list) rather than breaking the pipeline.

Claim boundary

Every row in this PR's configs is benchmark_certification_status=dataset_replay_verified.
No row claims live_benchmark_certification.
No row in this PR is promoted to supported tier for a compression mechanism — the included FP8-KV configs stay at reviewed_preview specifically because RULER has not run yet. The gate is the mechanism that keeps them there.

References — what we read for KV cache and compression

The mechanism and quality-eval registries are grounded in public research. Each registered value in mechanism_variant_registry.json and quality_eval_registry.json maps to one of the sources below.

KV cache quantization (`mechanism: kv_quantization`)

fp8_e4m3 — Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA / Intel / Arm, 2022), arXiv:2209.05433. Defines the E4M3 / E5M2 formats used by the engine-native FP8 KV paths in vLLM and SGLang.
turboquant_class — umbrella slot for Hadamard-rotated 4-bit KV schemes in the TurboQuant / KVQuant lineage (e.g. Hooper et al., "KVQuant", 2024, arXiv:2401.18079). Operators submitting rows in this class cite their specific implementation in mechanism_notes.

KV cache compression (`mechanism: kv_compression`)

kvtc_class — umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; the exact paper citation travels with each submitted row.

Compressed attention (`mechanism: compressed_attention`)

triattention_class — umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Operator-submitted rows carry the specific implementation citation.

Speculative decoding (`mechanism: speculative_decoding`)

mtp — Multi-Token Prediction head, as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437).
eagle3 — EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe).
medusa — Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774.
dflash — umbrella slot for DeepFlash-style draft stacks; specific citation travels with each submitted row.

Quality benchmarks (`quality_eval_registry.json`)

ruler_v1 — Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M.
longbench_v2 — Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads.
humaneval — Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. Code-generation accuracy under compression.
math_500 — 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Math reasoning — detects chain-of-thought degradation from aggressive KV quantization, which is the specific failure mode the hard gate is designed to catch.

Test plan

utils/test_mechanism_eval.py — 13 tests covering env parsing, numeric coercion, registry validation (registered / unregistered / bad status / missing draft), baseline pass-through, registry file validity, the hard-rule predicate matrix, idempotent migrations, and legacy-schema upgrade path.
utils/test_gate_isb1.py — 7 new mechanism-gate tests (baseline trivially passes, supported+FP8 without eval fails, reviewed_preview+FP8 passes, supported+FP8 with completed registered eval passes, unregistered variant fails, speculative without draft fails, speculative with full fields passes).
utils/test_process_result_isb1_mechanism.py — 3 subprocess integration tests verifying env-driven defaults, registered FP8-KV field surfacing, and unregistered-variant flagging.
Full suite: 285 passed, 2 pre-existing warnings.

claude

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copilot

Pull request overview

This PR extends the ISB1 replay “consumer” surface with mechanism/quality-eval metadata, adds supporting registries + workflows, and introduces additional tooling/scripts for replay validation and KV stress experimentation across runners.

Changes:

Add mechanism/quality-eval registries and workflow plumbing to run ISB1 mechanism sweeps and (intended) gate them via gate_isb1.py.
Add runner/script-selection improvements (shared resolver) and multiple new single-node benchmark scripts for ISB1 replay/KV stress variants.
Add a set of ISB1 ops/dev tools (producer/consumer export sync verifier, GMI sweep utilities, Pareto/sweep analysis scripts) plus docs and LFS-tracked export bundles.

Reviewed changes

Copilot reviewed 130 out of 134 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
utils/verify_producer_sync.py	New script to byte-compare producer vs consumer ISB1 export JSON subtrees.
utils/test_verify_producer_sync.py	Tests for producer/consumer sync verification behavior.
utils/test_summarize_isb1.py	Tests for ISB1 operator summary generation.
utils/test_process_result_isb1_mechanism.py	Subprocess integration tests ensuring mechanism_eval fields land in aggregated ISB1 results.
utils/test_process_result.py	Updates process_result tests (incl. ISB1 replay guards) and base env requirements.
utils/process_result.py	Adds guards to prevent ISB1 replay payloads/benchmarks from using the throughput processor.
runners/lib_single_node_script.sh	New helper to resolve the correct single-node benchmark script path (including ISB1 replay framework-specific scripts).
runners/launch_h200-nb.sh	Switches to using the shared script resolver.
runners/launch_h200-dgxc-slurm.sh	Switches to using the shared script resolver.
runners/launch_h200-cw.sh	Switches to using the shared script resolver.
runners/launch_h100-dgxc-slurm.sh	Switches to using the shared script resolver.
runners/launch_h100-cw.sh	Switches to using the shared script resolver.
runners/launch_h100-cr.sh	Switches to using the shared script resolver and expands env propagation for ISB1 replay.
runners/launch_b200-nb.sh	Switches to using the shared script resolver.
runners/launch_b200-dgxc.sh	Switches to using the shared script resolver and expands env propagation for ISB1 replay.
runners/launch_b200-dgxc-slurm.sh	Switches to using the shared script resolver.
experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_vllm.sh	New experimental kv-cache-tester trace replay runner (vLLM, H200).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_sglang.sh	New experimental kv-cache-tester trace replay runner (SGLang, H200).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_vllm.sh	New experimental kv-cache-tester trace replay runner (vLLM, B200).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_sglang.sh	New experimental kv-cache-tester trace replay runner (SGLang, B200).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_vllm.sh	New experimental kv-cache-tester trace replay runner (vLLM, H200, GPT-OSS).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_sglang.sh	New experimental kv-cache-tester trace replay runner (SGLang, H200, GPT-OSS).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_vllm.sh	New experimental kv-cache-tester trace replay runner (vLLM, B200, GPT-OSS).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_sglang.sh	New experimental kv-cache-tester trace replay runner (SGLang, B200, GPT-OSS).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_h200_vllm.sh	New experimental kv-cache-tester trace replay runner (vLLM, H200, DSR1).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_b200_vllm.sh	New experimental kv-cache-tester trace replay runner (vLLM, B200, DSR1).
experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_h200.sh	New experimental LMCache-enabled vLLM launcher (H200).
experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_b200.sh	New experimental LMCache-enabled vLLM launcher (B200).
experimental/multiturn/vllm_benchmark/launch/README.md	Notes for experimental LMCache launch helpers.
experimental/multiturn/vllm_benchmark/kv-cache-tester/traces/.gitkeep	Placeholder to keep traces directory in repo.
experimental/multiturn/vllm_benchmark/kv-cache-tester/README.md	Placeholder README describing expected external kv-cache-tester population.
experimental/multiturn/vllm_benchmark/aiperf_traces/generate_aiperf_traces.py	Utility to generate synthetic AIPerf-style traces.
experimental/multiturn/vllm_benchmark/README.md	High-level README for experimental multiturn benchmark parity surface.
experimental/multiturn/vllm_benchmark/.gitignore	Ignore rules for experimental multiturn artifacts.
experimental/multiturn/README.md	Rewrites experimental multiturn notes with explicit “not official ISB1” boundary and pointers to canonical docs.
experimental/README.md	Clarifies experimental directory intent and points to official ISB1 support docs.
datasets/isb1/scripts/plot_pareto.py	New script to compute Pareto frontier (throughput vs p99 TTFT) from DB/JSON.
datasets/isb1/scripts/gpu_profile_collector.sh	New helper to poll nvidia-smi into a CSV until terminated.
datasets/isb1/scripts/gmi_test_matrix.sh	New curated GMI test-matrix runner.
datasets/isb1/scripts/gmi_kv_sweep.sh	New GMI KV stress sweep harness (users × offload modes).
datasets/isb1/scripts/gmi_full_suite.sh	New “full suite” runner for key ISB1 replay combinations.
datasets/isb1/scripts/generate_qwen35_low_band_exports.py	Generator to derive Qwen3.5 low-band exports from runnable GPT-OSS cells.
datasets/isb1/scripts/collect_sweep_results.py	Aggregates sweep results from SQLite or agg_*.json directory and emits CSV/JSON summaries.
datasets/isb1/scripts/analyze_benchmark_distributions.py	Analyzer for ISL/OSL/turn distributions from ISB1 exports or trace directories.
datasets/isb1/scripts/adapt_trace_replay_result.py	Adapter from kv-cache-tester CSV outputs into ISB1 replay JSON-like schema.
datasets/isb1/registry/quality_eval_registry.json	Registry of allowed quality-eval harness IDs and metadata.
datasets/isb1/registry/mechanism_variant_registry.json	Registry of allowed mechanism×variant pairs + mechanism sets.
datasets/isb1/exports/preview/long_context_500k/manifest_qwen3.5.json	New LFS-tracked preview manifest pointer.
datasets/isb1/exports/preview/long_context_500k/manifest.json	New LFS-tracked preview manifest pointer.
datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_qwen3.5_xlc2_500k_preview_v1.json	New LFS-tracked 500k preview export pointer.
datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_gptoss_xlc2_500k_preview_v1.json	New LFS-tracked 500k preview export pointer.
datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_qwen3.5_xlc2_500k_preview_v1.json	New LFS-tracked 500k preview export pointer.
datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_gptoss_xlc2_500k_preview_v1.json	New LFS-tracked 500k preview export pointer.
datasets/isb1/exports/preview/long_context_500k/README.md	Documentation for bounded 500k preview lane claim boundary and consumer contract.
datasets/isb1/exports/preview/long_context_1m/manifest.json	New LFS-tracked 1m preview manifest pointer.
datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__coding_qwen3.5_ulc2_1m_preview_v1.json	New LFS-tracked 1m preview export pointer.
datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__chat_qwen3.5_ulc2_1m_preview_v1.json	New LFS-tracked 1m preview export pointer.
datasets/isb1/exports/preview/long_context_1m/README.md	Documentation for gated 1m preview lane claim boundary.
datasets/isb1/exports/extension_64k/code_64k1k_qwen3.5.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_64k/code_64k1k.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_64k/chat_64k1k_qwen3.5.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_64k/chat_64k1k.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_32k/code_32k1k_qwen3.5.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_32k/code_32k1k.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_32k/chat_32k1k_qwen3.5.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_32k/chat_32k1k.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/code_131k1k_qwen3.5.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/code_131k1k.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/chat_131k1k_qwen3.5.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/chat_131k1k_dsr1.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/chat_131k1k.json	New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/core/code_8k1k_qwen3.5.json	New/updated LFS-tracked core export pointer.
datasets/isb1/exports/core/code_8k1k.json	New/updated LFS-tracked core export pointer.
datasets/isb1/exports/core/chat_8k1k_qwen3.5.json	New/updated LFS-tracked core export pointer.
datasets/isb1/exports/core/chat_8k1k.json	New/updated LFS-tracked core export pointer.
datasets/isb1/README.md	New ISB1 consumer package README (coverage, inventory, claim boundary).
datasets/isb1/GMI_EXECUTION_PLAN.md	New bare-metal execution plan/runbook for GMI.
datasets/isb1/COEXISTENCE_WITH_KV_CACHE_TESTER.md	New coexistence doc clarifying ISB1 vs kv-cache-tester roles.
datasets/isb1/.gitattributes	Adds linguist-generated marker and (currently conflicting) EOL/text attributes for exports.
benchmarks/single_node/qwen3.5triattn_fp8_h200_vllm.sh	New TriAttention-enabled vLLM benchmark script (Qwen3.5, H200).
benchmarks/single_node/qwen3.5triattn_fp8_h100_vllm.sh	New TriAttention-enabled vLLM benchmark script (Qwen3.5, H100).
benchmarks/single_node/qwen3.5_fp8_h200_vllm.sh	New/updated vLLM single-node script with ISB1 replay handling hooks.
benchmarks/single_node/qwen3.5_fp8_h200_sglang.sh	New/updated SGLang single-node script with ISB1 replay handling hooks.
benchmarks/single_node/qwen3.5_fp8_h100_vllm.sh	New vLLM single-node script (H100).
benchmarks/single_node/qwen3.5_fp8_h100_sglang.sh	New SGLang single-node script (H100).
benchmarks/single_node/qwen3.5_fp8_b200_vllm.sh	New vLLM single-node script (B200).
benchmarks/single_node/qwen3.5_fp8_b200_sglang.sh	New SGLang single-node script (B200).
benchmarks/single_node/gptosstriattn_fp4_h200_vllm.sh	New TriAttention-enabled vLLM benchmark script (GPT-OSS, H200).
benchmarks/single_node/gptosstriattn_fp4_h100_vllm.sh	New TriAttention-enabled vLLM benchmark script (GPT-OSS, H100).
benchmarks/single_node/gptoss_fp4_h200_sglang.sh	New SGLang single-node script (GPT-OSS, H200).
benchmarks/single_node/gptoss_fp4_h200.sh	Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/gptoss_fp4_h100_sglang.sh	New SGLang single-node script (GPT-OSS, H100).
benchmarks/single_node/gptoss_fp4_h100.sh	Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/gptoss_fp4_b200_sglang.sh	New SGLang single-node script (GPT-OSS, B200).
benchmarks/single_node/gptoss_fp4_b200.sh	Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/dsr1triattn_fp8_h200_vllm.sh	New TriAttention-enabled vLLM benchmark script (DSR1, H200).
benchmarks/single_node/dsr1triattn_fp8_h100_vllm.sh	New TriAttention-enabled vLLM benchmark script (DSR1, H100).
benchmarks/single_node/dsr1_fp8_h200_vllm.sh	New vLLM single-node script (DSR1, H200).
benchmarks/single_node/dsr1_fp8_h200.sh	Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/dsr1_fp8_b200_vllm.sh	New vLLM single-node script (DSR1, B200).
benchmarks/single_node/dsr1_fp8_b200.sh	Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/dsr1_fp4_b200.sh	Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
.github/workflows/run-isb1-mechanism-eval.yml	New workflow to dispatch ISB1 mechanism_eval sweeps.
.github/workflows/run-isb1-kv-stress-sweep.yml	New workflow to dispatch ISB1 KV stress sweeps.
.github/workflows/collect-results.yml	Adds ISB1 operator summary + gate report steps (but currently only for exact prefix `isb1`).
.github/configs/isb1-qwen-1m-preview.yaml	Adds gated/manual 1m preview config for Qwen3.5.
.github/configs/isb1-mechanism-fp8-kv.yaml	Adds FP8 KV mechanism_eval config (quality eval pending).
.github/configs/isb1-mechanism-baseline.yaml	Adds baseline mechanism_eval config entries.
.gitignore	Adds additional ignore patterns (.DS_Store, prompt-exports/, .claude).
.gitattributes	Configures Git LFS tracking for datasets/isb1 exports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

+      - name: ISB1 operator summary
+        if: inputs.result-prefix == 'isb1'
+        run: |
+          pip install tabulate
+          python3 utils/summarize_isb1.py results/ >> $GITHUB_STEP_SUMMARY
+
+      - name: ISB1 gate report
+        if: inputs.result-prefix == 'isb1'
+        run: |


      - name: Print summary
+        if: inputs.result-prefix != 'isb1'
        run: |


+exports/**/*.json linguist-generated=true
+exports/**/*.json text eol=lf


…races Add ISB1 (Inference Standard Benchmark, v1) — a multi-turn, long-context KV cache stress testing dataset targeting realistic production KV-cache pressure patterns. ## What this adds 35 synthetic multi-turn traces across 7 context bands (8K → 1M+ tokens): - 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal - KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout - Real conversation content with 60-95% prefix overlap (enables prefix cache testing) - Context assets from 15KB to 6.6MB inlined into traces for honest token counts Export bundles for vLLM + SGLang replay: - extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200) - preview/long_context_500k: Qwen 3.5 500K context stress test - preview/long_context_1m: Qwen 3.5 1M context stress test 10 KV stress sweep configs: - 3 models × 2 GPUs × 2 engines - Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s ## Benchmark infrastructure - benchmark_export_replay.py: replay harness with actual_context_len telemetry - process_result_isb1.py: result aggregation with KV metrics - Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes - Pareto frontier: throughput vs p99 TTFT at each concurrency level

- Keep only configs whose (runtime, hardware, model) triples exist in the export files — eliminates sweep generator failures - Fix canonical-model-id to match export metadata (e.g., gpt_oss_120b not gptoss) - Fix support-status to match export tiers (reviewed_preview vs unsupported) - Remove configs for engines/GPUs not yet in exports (SGLang, Dynamo, TRT, Atom, AMD) — these need export metadata updates before they can be added back - Add workload-type field required by sweep generator schema - Remove disagg/multinode fields not in KV stress schema Sweep generator now passes: exit code 0, produces valid matrix rows.

…mbos Export metadata now includes all valid (runtime, hardware, model) triples from nvidia-master.yaml + amd-master.yaml: - 8 runtimes: vllm, sglang, trt, atom, sglang-disagg, dynamo-* - 9 GPU types: H100, H200, B200, B300, GB200, GB300, MI300X, MI325X, MI355X - 6 models: DSR1, GPT-OSS, Qwen 3.5, GLM-5, Kimi K2.5, MiniMax M2.5 87 KV stress configs with correct canonical-model-id and support-status matching export metadata. Sweep generator passes (exit code 0). MI355X configs sweep to 512 concurrent users (288GB HBM advantage).

…2.0 manifests, prefix-aware replay Final closure pass landing PR SemiAnalysisAI#1032 end-to-end across every (runtime, hardware, canonical-model) triple currently in the export metadata. Sweep configs: - Consolidate the sweep config under its canonical name isb1-kv-stress.yaml - Rewrite isb1-master / isb1-triattn-preview / isb1-qwen-1m-preview: drop/demote dead stanzas, flatten paths (strip /vllm//sglang/ subdirs and __vllm/__sglang suffixes), repoint qwen3.5 to _qwen3.5 basename - isb1-master shrinks 1723 -> 863 lines (50 -> 26 stanzas); 1M preview drops the vllm stanza (sglang-only in reality) - All produced rows resolve to real bundle cells at declared tier Manifests -> manifest_version 0.2.0 with single-bundle exports for preview/long_context_500k (gptoss + qwen3.5) and preview/long_context_1m. Consumer replay (utils/bench_serving/benchmark_export_replay.py): hydrate v0.2.0 prefix-aware bundles — thin per-cell deltas join a shared workload prefix via prefix_ref, LRU-cached (max 8) across cells in the same bundle. Pre-0.2.0 bundles replay unchanged. Producer-sync verifier (utils/verify_producer_sync.py): extend coverage to core + extension_32k + extension_64k; silently skip subtrees absent on both sides, report asymmetric ones. Docs: coexistence and both preview READMEs updated with flat paths, canonical config name, and the sglang-only preview reality. Tests: 262/262 pass across utils/ (107 sweep-config + new test_benchmark_export_replay.py for the prefix-aware consumer + test_verify_producer_sync.py for broadened verifier coverage).

… clean support vocabulary README.md: - Remove dead links to docs removed in the prior cleanup pass (COVERAGE_AUDIT, LONG_CONTEXT_TRUTH_MATRIX, SUPPORT_MATRIX, RUNBOOKs, INVESTIGATION) - Replace stale 50-export-files count with post-flatten per-subtree inventory (23 bundles + 3 manifests = 26 total, consolidating framework-specific variants into flat single files) - Add explicit five-class support-status vocabulary section - Keep safe/unsafe claim boundary Coexistence doc: - Strip planning/negotiation sections (Recommended PR Structure and maintainer-request list) — not coexistence-technical - Update data-directory layout to show flat paths - Update ISB1 workflow name to run-isb1-kv-stress-sweep.yml - Add support-status vocabulary section GMI_EXECUTION_PLAN.md: - Prepend support-status framing (reviewed_preview, dataset_replay_verified, not live-serving certification) - Fix stale nested paths to flat: extension_131k/vllm/ -> extension_131k/ - Fix preview bundle names: strip __vllm/__sglang suffixes - Update final result-pipeline sentence to cite actual analyzer scripts

…istries + hard gate Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status= supported for a lossy technique unless a registered quality benchmark has completed. Follow-up to PR SemiAnalysisAI#1032. All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db(). New files: - utils/mechanism_eval.py Env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate. - datasets/isb1/registry/mechanism_variant_registry.json 9 registered mechanism/variant pairs covering baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash. - datasets/isb1/registry/quality_eval_registry.json 4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval, math_500. - .github/configs/isb1-mechanism-baseline.yaml DSR1 (H100) and Qwen3.5 (B200) baseline cells. - .github/configs/isb1-mechanism-fp8-kv.yaml Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it). - .github/workflows/run-isb1-mechanism-eval.yml Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl. - utils/test_mechanism_eval.py (13 tests). - utils/test_process_result_isb1_mechanism.py (3 subprocess tests). Extended files: - utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row. - utils/gate_isb1.py — new mechanism_compression_quality gate enforcing: (1) any non-baseline mechanism_variant must resolve in the registry; (2) quality_eval_status in {pending, completed, failed, not_required}; (3) supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id; (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate. - datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags. - utils/test_gate_isb1.py — 7 new mechanism-gate tests. Full suite: 285 passed, 2 pre-existing warnings. References — public literature the registries are grounded in: KV cache quantization (mechanism: kv_quantization) - fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang. - turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes. KV cache compression (mechanism: kv_compression) - kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation. Compressed attention (mechanism: compressed_attention) - triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Speculative decoding (mechanism: speculative_decoding) - mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437). - eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe). - medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774. - dflash: umbrella slot for DeepFlash-style draft stacks. Quality benchmarks (quality_eval_registry.json) - ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M. - longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads. - humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. - math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.

OCWC22 · 2026-04-17T07:31:02Z

Closing — PR should not target the upstream repo as a copy-of-#1032. Reopening as a fork-local stacked PR on OCWC22/InferenceX with PR #1032's branch as the base, so the diff shows only the new mechanism_eval work.

OCWC22 requested review from a team and Copilot April 17, 2026 06:24

github-project-automation bot added this to InferenceMAX Board Apr 17, 2026

claude bot reviewed Apr 17, 2026

View reviewed changes

Copilot started reviewing on behalf of OCWC22 April 17, 2026 06:24 View session

OCWC22 changed the title ~~feat(isb1): mechanism_eval schema — registries + hard gate for compression quality~~ [experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality Apr 17, 2026

Copilot AI reviewed Apr 17, 2026

View reviewed changes

OCWC22 force-pushed the isb1/mechanism-eval-schema branch from 7f8cd0c to 5c6b82f Compare April 17, 2026 07:21

OCWC22 added 6 commits April 17, 2026 00:27

OCWC22 force-pushed the isb1/mechanism-eval-schema branch from 5c6b82f to 2df18f2 Compare April 17, 2026 07:28

OCWC22 closed this Apr 17, 2026

github-project-automation bot moved this to Done in InferenceMAX Board Apr 17, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1052

[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1052
OCWC22 wants to merge 6 commits intoSemiAnalysisAI:mainfrom
OCWC22:isb1/mechanism-eval-schema

OCWC22 commented Apr 17, 2026 •

edited

Loading

Uh oh!

claude bot left a comment

Uh oh!

Copilot AI left a comment

Uh oh!

OCWC22 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		exports/*/.json linguist-generated=true
		exports/*/.json text eol=lf

Conversation

OCWC22 commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

TL;DR

What is ISB1?

Why this matters

What changed, in plain language

Who this is for

Worked example

Complete file list

Backward compatibility notes

Claim boundary

References — what we read for KV cache and compression

KV cache quantization (mechanism: kv_quantization)

KV cache compression (mechanism: kv_compression)

Compressed attention (mechanism: compressed_attention)

Speculative decoding (mechanism: speculative_decoding)

Quality benchmarks (quality_eval_registry.json)

Test plan

Uh oh!

claude bot left a comment

Choose a reason for hiding this comment

Claude Code Review

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

OCWC22 commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

OCWC22 commented Apr 17, 2026 •

edited

Loading

KV cache quantization (`mechanism: kv_quantization`)

KV cache compression (`mechanism: kv_compression`)

Compressed attention (`mechanism: compressed_attention`)

Speculative decoding (`mechanism: speculative_decoding`)

Quality benchmarks (`quality_eval_registry.json`)