Skip to content

[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1052

Closed
OCWC22 wants to merge 6 commits intoSemiAnalysisAI:mainfrom
OCWC22:isb1/mechanism-eval-schema
Closed

[experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality#1052
OCWC22 wants to merge 6 commits intoSemiAnalysisAI:mainfrom
OCWC22:isb1/mechanism-eval-schema

Conversation

@OCWC22
Copy link
Copy Markdown

@OCWC22 OCWC22 commented Apr 17, 2026

TL;DR

Adds a way for ISB1 replay rows to declare which optimization technique they exercise (baseline, FP8 KV quantization, KV compression, compressed attention, speculative decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled supported for a lossy technique unless a registered quality benchmark (RULER, LongBench v2, HumanEval, or MATH-500) has completed.

This is a schema-only, backward-compatible change. Existing ISB1 rows, configs, and result databases keep working unchanged — new fields default to null and legacy SQLite databases auto-migrate on first open.

Follow-up to #1032.

What is ISB1?

ISB1 (Inference Standard Benchmark, v1) is this repo's replay-based inference benchmark. It lives at datasets/isb1/. Each replay bundle captures a synthetic multi-turn trace (coding or chat workload, 8K–1M context band) together with enough metadata to faithfully re-run it against any vLLM/SGLang/Dynamo stack. Results go through:

  1. utils/benchmark_export_replay.py — replays the bundle against a running server
  2. utils/process_result_isb1.py — aggregates the raw result into a single JSON row
  3. utils/gate_isb1.py — evaluates advisory gates (control-lane health, 131K/500K/1M preview coverage)
  4. datasets/isb1/scripts/isb1_results_db.py — ingests rows into SQLite for cross-run analysis

All current ISB1 rows are implicitly "baseline" — no quantization, no compression, no speculative decoding. Once people start running the same benchmark with FP8 KV cache, KV compression, EAGLE-class drafts, etc., you need a way to tell those rows apart from baseline and to gate quality claims. That is what this PR adds.

Why this matters

Today, nothing in the ISB1 schema stops someone from publishing a throughput chart comparing:

  • a baseline FP8 weights + FP16 KV run, and
  • a more-aggressive FP8 weights + FP8 E4M3 KV run

as if they are on the same accuracy axis. The KV-quantized run is faster but can degrade chain-of-thought quality on reasoning models (see the References section below — this is a well-studied failure mode, not a hypothesis). Without a gate, a faster-but-worse config can silently be marked supported and cited in downstream material.

This PR closes that gap by requiring any support_status=supported row that claims a lossy technique (KV quantization / KV compression / compressed attention) to ship with a completed, registered accuracy benchmark. Rows with the technique applied but no completed benchmark stay at reviewed_preview and can't be cited as supported until the eval lands.

What changed, in plain language

1. Every ISB1 row now carries a mechanism field.
Defaults to "baseline". Other values: kv_quantization, kv_compression, compressed_attention, speculative_decoding. A mechanism_variant sub-field names the specific technique (e.g. fp8_e4m3, eagle3).

2. Two JSON registries whitelist accepted values.

  • datasets/isb1/registry/mechanism_variant_registry.json — the 9 mechanism/variant pairs we accept. Adding new ones requires a PR that amends the registry.
  • datasets/isb1/registry/quality_eval_registry.json — the 4 accepted accuracy benchmarks (RULER v1 for long-context retrieval, LongBench v2 for long-context reasoning, HumanEval for code, MATH-500 for math).

Unregistered mechanism/variant pairs do not break the pipeline — they just fail the gate and show up in the gate report.

3. A new hard gate in gate_isb1.py: mechanism_compression_quality.
Four rules:

  • mechanism_variant registered — any non-baseline row must resolve in the registry.
  • quality_eval_status must be one of pending | completed | failed | not_required.
  • support_status=supported + compression technique ⇒ quality_eval_status=completed with a registered quality_eval_id. This is the hard rule.
  • mechanism=speculative_decoding ⇒ must carry draft_model_id and speculative_acceptance_rate.

4. SQLite schema gains 16 optional columns.
All default to NULL. Existing databases migrate in place on first connect_db() (idempotent ALTER TABLE pattern that matches how ISB1 already handles schema evolution).

5. Two example configs + one dispatch workflow.

  • .github/configs/isb1-mechanism-baseline.yaml — DSR1 (H100) and Qwen3.5 (B200) baseline cells.
  • .github/configs/isb1-mechanism-fp8-kv.yaml — same two cells with FP8 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run lands (the gate will fail if you try to flip them to supported without a completed eval).
  • .github/workflows/run-isb1-mechanism-eval.yml — dispatch workflow that routes mechanism configs through the existing benchmark-isb1-tmpl.yml.

Who this is for

  • ISB1 maintainers who want to start reporting non-baseline configs without losing the accuracy guarantee.
  • Framework authors (vLLM / SGLang / Dynamo / llm-d) who want to submit replay rows for their custom KV paths and need a clear way to declare "this is still preview — full quality eval pending."
  • Hardware evaluators who want to compare a baseline row against a quantized row and need to know, at a glance, whether the quantized row has a completed accuracy benchmark.

If you are running ISB1 today and all you care about is baseline throughput, nothing changes — your configs continue to work and your rows are auto-classified as mechanism=baseline.

Worked example

An operator wants to publish an FP8-KV throughput number for DeepSeek-R1 on H100 at the supported tier.

  1. They clone .github/configs/isb1-mechanism-fp8-kv.yaml and keep mechanism: kv_quantization, mechanism_variant: fp8_e4m3, kv-cache-dtype: fp8.
  2. They try to flip support-status: supported in the config.
  3. They run gate_isb1.py on the aggregated result.
  4. The mechanism_compression_quality gate fails with failed_criteria: ["supported+compression ⇒ quality_eval_status == completed"] because quality_eval_status is still pending and no RULER run has landed.
  5. The operator runs RULER against the FP8-KV config, records quality_eval_status: completed and the quality delta, and re-runs the gate. It passes.

Without this PR, step 4 silently succeeds and the row ships as supported.

Complete file list

New files (7):

  • utils/mechanism_eval.py — env-driven field catalog, registry loaders, validators
  • datasets/isb1/registry/mechanism_variant_registry.json
  • datasets/isb1/registry/quality_eval_registry.json
  • .github/configs/isb1-mechanism-baseline.yaml
  • .github/configs/isb1-mechanism-fp8-kv.yaml
  • .github/workflows/run-isb1-mechanism-eval.yml
  • utils/test_mechanism_eval.py (13 tests)
  • utils/test_process_result_isb1_mechanism.py (3 subprocess integration tests)

Extended files (4):

  • utils/process_result_isb1.py — emits 14 mechanism fields + mechanism_eval_validation record.
  • utils/gate_isb1.py — new mechanism_compression_quality gate.
  • datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations + matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, CLI ingest flags.
  • utils/test_gate_isb1.py — 7 new mechanism-gate tests.

Total: +1514 / −1 lines, 12 files.

Backward compatibility notes

  • Env vars controlling the new fields (e.g. MECHANISM, QUALITY_EVAL_ID) are all optional. Unset = null (except MECHANISM which defaults to "baseline").
  • Database migration is additive and idempotent. Existing databases get the new columns on next open; all existing queries and columns keep working.
  • Gate evaluation handles baseline rows explicitly — they pass every criterion trivially and do not require any new env vars or registry lookups.
  • Registry load failures degrade to advisory ("registry_load_error: ..." appears in the validation record's issues list) rather than breaking the pipeline.

Claim boundary

  • Every row in this PR's configs is benchmark_certification_status=dataset_replay_verified.
  • No row claims live_benchmark_certification.
  • No row in this PR is promoted to supported tier for a compression mechanism — the included FP8-KV configs stay at reviewed_preview specifically because RULER has not run yet. The gate is the mechanism that keeps them there.

References — what we read for KV cache and compression

The mechanism and quality-eval registries are grounded in public research. Each registered value in mechanism_variant_registry.json and quality_eval_registry.json maps to one of the sources below.

KV cache quantization (mechanism: kv_quantization)

  • fp8_e4m3 — Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA / Intel / Arm, 2022), arXiv:2209.05433. Defines the E4M3 / E5M2 formats used by the engine-native FP8 KV paths in vLLM and SGLang.
  • turboquant_class — umbrella slot for Hadamard-rotated 4-bit KV schemes in the TurboQuant / KVQuant lineage (e.g. Hooper et al., "KVQuant", 2024, arXiv:2401.18079). Operators submitting rows in this class cite their specific implementation in mechanism_notes.

KV cache compression (mechanism: kv_compression)

  • kvtc_class — umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; the exact paper citation travels with each submitted row.

Compressed attention (mechanism: compressed_attention)

  • triattention_class — umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Operator-submitted rows carry the specific implementation citation.

Speculative decoding (mechanism: speculative_decoding)

  • mtp — Multi-Token Prediction head, as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437).
  • eagle3 — EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe).
  • medusa — Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774.
  • dflash — umbrella slot for DeepFlash-style draft stacks; specific citation travels with each submitted row.

Quality benchmarks (quality_eval_registry.json)

  • ruler_v1 — Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M.
  • longbench_v2 — Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads.
  • humaneval — Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. Code-generation accuracy under compression.
  • math_500 — 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Math reasoning — detects chain-of-thought degradation from aggressive KV quantization, which is the specific failure mode the hard gate is designed to catch.

Test plan

  • utils/test_mechanism_eval.py — 13 tests covering env parsing, numeric coercion, registry validation (registered / unregistered / bad status / missing draft), baseline pass-through, registry file validity, the hard-rule predicate matrix, idempotent migrations, and legacy-schema upgrade path.
  • utils/test_gate_isb1.py — 7 new mechanism-gate tests (baseline trivially passes, supported+FP8 without eval fails, reviewed_preview+FP8 passes, supported+FP8 with completed registered eval passes, unregistered variant fails, speculative without draft fails, speculative with full fields passes).
  • utils/test_process_result_isb1_mechanism.py — 3 subprocess integration tests verifying env-driven defaults, registered FP8-KV field surfacing, and unregistered-variant flagging.
  • Full suite: 285 passed, 2 pre-existing warnings.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@OCWC22 OCWC22 changed the title feat(isb1): mechanism_eval schema — registries + hard gate for compression quality [experimental] feat(isb1): mechanism_eval schema — registries + hard gate for compression quality Apr 17, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR extends the ISB1 replay “consumer” surface with mechanism/quality-eval metadata, adds supporting registries + workflows, and introduces additional tooling/scripts for replay validation and KV stress experimentation across runners.

Changes:

  • Add mechanism/quality-eval registries and workflow plumbing to run ISB1 mechanism sweeps and (intended) gate them via gate_isb1.py.
  • Add runner/script-selection improvements (shared resolver) and multiple new single-node benchmark scripts for ISB1 replay/KV stress variants.
  • Add a set of ISB1 ops/dev tools (producer/consumer export sync verifier, GMI sweep utilities, Pareto/sweep analysis scripts) plus docs and LFS-tracked export bundles.

Reviewed changes

Copilot reviewed 130 out of 134 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
utils/verify_producer_sync.py New script to byte-compare producer vs consumer ISB1 export JSON subtrees.
utils/test_verify_producer_sync.py Tests for producer/consumer sync verification behavior.
utils/test_summarize_isb1.py Tests for ISB1 operator summary generation.
utils/test_process_result_isb1_mechanism.py Subprocess integration tests ensuring mechanism_eval fields land in aggregated ISB1 results.
utils/test_process_result.py Updates process_result tests (incl. ISB1 replay guards) and base env requirements.
utils/process_result.py Adds guards to prevent ISB1 replay payloads/benchmarks from using the throughput processor.
runners/lib_single_node_script.sh New helper to resolve the correct single-node benchmark script path (including ISB1 replay framework-specific scripts).
runners/launch_h200-nb.sh Switches to using the shared script resolver.
runners/launch_h200-dgxc-slurm.sh Switches to using the shared script resolver.
runners/launch_h200-cw.sh Switches to using the shared script resolver.
runners/launch_h100-dgxc-slurm.sh Switches to using the shared script resolver.
runners/launch_h100-cw.sh Switches to using the shared script resolver.
runners/launch_h100-cr.sh Switches to using the shared script resolver and expands env propagation for ISB1 replay.
runners/launch_b200-nb.sh Switches to using the shared script resolver.
runners/launch_b200-dgxc.sh Switches to using the shared script resolver and expands env propagation for ISB1 replay.
runners/launch_b200-dgxc-slurm.sh Switches to using the shared script resolver.
experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_vllm.sh New experimental kv-cache-tester trace replay runner (vLLM, H200).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_sglang.sh New experimental kv-cache-tester trace replay runner (SGLang, H200).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_vllm.sh New experimental kv-cache-tester trace replay runner (vLLM, B200).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_sglang.sh New experimental kv-cache-tester trace replay runner (SGLang, B200).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_vllm.sh New experimental kv-cache-tester trace replay runner (vLLM, H200, GPT-OSS).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_sglang.sh New experimental kv-cache-tester trace replay runner (SGLang, H200, GPT-OSS).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_vllm.sh New experimental kv-cache-tester trace replay runner (vLLM, B200, GPT-OSS).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_sglang.sh New experimental kv-cache-tester trace replay runner (SGLang, B200, GPT-OSS).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_h200_vllm.sh New experimental kv-cache-tester trace replay runner (vLLM, H200, DSR1).
experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_b200_vllm.sh New experimental kv-cache-tester trace replay runner (vLLM, B200, DSR1).
experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_h200.sh New experimental LMCache-enabled vLLM launcher (H200).
experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_b200.sh New experimental LMCache-enabled vLLM launcher (B200).
experimental/multiturn/vllm_benchmark/launch/README.md Notes for experimental LMCache launch helpers.
experimental/multiturn/vllm_benchmark/kv-cache-tester/traces/.gitkeep Placeholder to keep traces directory in repo.
experimental/multiturn/vllm_benchmark/kv-cache-tester/README.md Placeholder README describing expected external kv-cache-tester population.
experimental/multiturn/vllm_benchmark/aiperf_traces/generate_aiperf_traces.py Utility to generate synthetic AIPerf-style traces.
experimental/multiturn/vllm_benchmark/README.md High-level README for experimental multiturn benchmark parity surface.
experimental/multiturn/vllm_benchmark/.gitignore Ignore rules for experimental multiturn artifacts.
experimental/multiturn/README.md Rewrites experimental multiturn notes with explicit “not official ISB1” boundary and pointers to canonical docs.
experimental/README.md Clarifies experimental directory intent and points to official ISB1 support docs.
datasets/isb1/scripts/plot_pareto.py New script to compute Pareto frontier (throughput vs p99 TTFT) from DB/JSON.
datasets/isb1/scripts/gpu_profile_collector.sh New helper to poll nvidia-smi into a CSV until terminated.
datasets/isb1/scripts/gmi_test_matrix.sh New curated GMI test-matrix runner.
datasets/isb1/scripts/gmi_kv_sweep.sh New GMI KV stress sweep harness (users × offload modes).
datasets/isb1/scripts/gmi_full_suite.sh New “full suite” runner for key ISB1 replay combinations.
datasets/isb1/scripts/generate_qwen35_low_band_exports.py Generator to derive Qwen3.5 low-band exports from runnable GPT-OSS cells.
datasets/isb1/scripts/collect_sweep_results.py Aggregates sweep results from SQLite or agg_*.json directory and emits CSV/JSON summaries.
datasets/isb1/scripts/analyze_benchmark_distributions.py Analyzer for ISL/OSL/turn distributions from ISB1 exports or trace directories.
datasets/isb1/scripts/adapt_trace_replay_result.py Adapter from kv-cache-tester CSV outputs into ISB1 replay JSON-like schema.
datasets/isb1/registry/quality_eval_registry.json Registry of allowed quality-eval harness IDs and metadata.
datasets/isb1/registry/mechanism_variant_registry.json Registry of allowed mechanism×variant pairs + mechanism sets.
datasets/isb1/exports/preview/long_context_500k/manifest_qwen3.5.json New LFS-tracked preview manifest pointer.
datasets/isb1/exports/preview/long_context_500k/manifest.json New LFS-tracked preview manifest pointer.
datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_qwen3.5_xlc2_500k_preview_v1.json New LFS-tracked 500k preview export pointer.
datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_gptoss_xlc2_500k_preview_v1.json New LFS-tracked 500k preview export pointer.
datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_qwen3.5_xlc2_500k_preview_v1.json New LFS-tracked 500k preview export pointer.
datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_gptoss_xlc2_500k_preview_v1.json New LFS-tracked 500k preview export pointer.
datasets/isb1/exports/preview/long_context_500k/README.md Documentation for bounded 500k preview lane claim boundary and consumer contract.
datasets/isb1/exports/preview/long_context_1m/manifest.json New LFS-tracked 1m preview manifest pointer.
datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__coding_qwen3.5_ulc2_1m_preview_v1.json New LFS-tracked 1m preview export pointer.
datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__chat_qwen3.5_ulc2_1m_preview_v1.json New LFS-tracked 1m preview export pointer.
datasets/isb1/exports/preview/long_context_1m/README.md Documentation for gated 1m preview lane claim boundary.
datasets/isb1/exports/extension_64k/code_64k1k_qwen3.5.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_64k/code_64k1k.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_64k/chat_64k1k_qwen3.5.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_64k/chat_64k1k.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_32k/code_32k1k_qwen3.5.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_32k/code_32k1k.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_32k/chat_32k1k_qwen3.5.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_32k/chat_32k1k.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/code_131k1k_qwen3.5.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/code_131k1k.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/chat_131k1k_qwen3.5.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/chat_131k1k_dsr1.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/extension_131k/chat_131k1k.json New/updated LFS-tracked extension export pointer.
datasets/isb1/exports/core/code_8k1k_qwen3.5.json New/updated LFS-tracked core export pointer.
datasets/isb1/exports/core/code_8k1k.json New/updated LFS-tracked core export pointer.
datasets/isb1/exports/core/chat_8k1k_qwen3.5.json New/updated LFS-tracked core export pointer.
datasets/isb1/exports/core/chat_8k1k.json New/updated LFS-tracked core export pointer.
datasets/isb1/README.md New ISB1 consumer package README (coverage, inventory, claim boundary).
datasets/isb1/GMI_EXECUTION_PLAN.md New bare-metal execution plan/runbook for GMI.
datasets/isb1/COEXISTENCE_WITH_KV_CACHE_TESTER.md New coexistence doc clarifying ISB1 vs kv-cache-tester roles.
datasets/isb1/.gitattributes Adds linguist-generated marker and (currently conflicting) EOL/text attributes for exports.
benchmarks/single_node/qwen3.5triattn_fp8_h200_vllm.sh New TriAttention-enabled vLLM benchmark script (Qwen3.5, H200).
benchmarks/single_node/qwen3.5triattn_fp8_h100_vllm.sh New TriAttention-enabled vLLM benchmark script (Qwen3.5, H100).
benchmarks/single_node/qwen3.5_fp8_h200_vllm.sh New/updated vLLM single-node script with ISB1 replay handling hooks.
benchmarks/single_node/qwen3.5_fp8_h200_sglang.sh New/updated SGLang single-node script with ISB1 replay handling hooks.
benchmarks/single_node/qwen3.5_fp8_h100_vllm.sh New vLLM single-node script (H100).
benchmarks/single_node/qwen3.5_fp8_h100_sglang.sh New SGLang single-node script (H100).
benchmarks/single_node/qwen3.5_fp8_b200_vllm.sh New vLLM single-node script (B200).
benchmarks/single_node/qwen3.5_fp8_b200_sglang.sh New SGLang single-node script (B200).
benchmarks/single_node/gptosstriattn_fp4_h200_vllm.sh New TriAttention-enabled vLLM benchmark script (GPT-OSS, H200).
benchmarks/single_node/gptosstriattn_fp4_h100_vllm.sh New TriAttention-enabled vLLM benchmark script (GPT-OSS, H100).
benchmarks/single_node/gptoss_fp4_h200_sglang.sh New SGLang single-node script (GPT-OSS, H200).
benchmarks/single_node/gptoss_fp4_h200.sh Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/gptoss_fp4_h100_sglang.sh New SGLang single-node script (GPT-OSS, H100).
benchmarks/single_node/gptoss_fp4_h100.sh Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/gptoss_fp4_b200_sglang.sh New SGLang single-node script (GPT-OSS, B200).
benchmarks/single_node/gptoss_fp4_b200.sh Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/dsr1triattn_fp8_h200_vllm.sh New TriAttention-enabled vLLM benchmark script (DSR1, H200).
benchmarks/single_node/dsr1triattn_fp8_h100_vllm.sh New TriAttention-enabled vLLM benchmark script (DSR1, H100).
benchmarks/single_node/dsr1_fp8_h200_vllm.sh New vLLM single-node script (DSR1, H200).
benchmarks/single_node/dsr1_fp8_h200.sh Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/dsr1_fp8_b200_vllm.sh New vLLM single-node script (DSR1, B200).
benchmarks/single_node/dsr1_fp8_b200.sh Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
benchmarks/single_node/dsr1_fp4_b200.sh Updates baseline script to align with ISB1 replay/offload hooks and run_single_node_benchmark.
.github/workflows/run-isb1-mechanism-eval.yml New workflow to dispatch ISB1 mechanism_eval sweeps.
.github/workflows/run-isb1-kv-stress-sweep.yml New workflow to dispatch ISB1 KV stress sweeps.
.github/workflows/collect-results.yml Adds ISB1 operator summary + gate report steps (but currently only for exact prefix isb1).
.github/configs/isb1-qwen-1m-preview.yaml Adds gated/manual 1m preview config for Qwen3.5.
.github/configs/isb1-mechanism-fp8-kv.yaml Adds FP8 KV mechanism_eval config (quality eval pending).
.github/configs/isb1-mechanism-baseline.yaml Adds baseline mechanism_eval config entries.
.gitignore Adds additional ignore patterns (.DS_Store, prompt-exports/, .claude).
.gitattributes Configures Git LFS tracking for datasets/isb1 exports.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +42 to +50
- name: ISB1 operator summary
if: inputs.result-prefix == 'isb1'
run: |
pip install tabulate
python3 utils/summarize_isb1.py results/ >> $GITHUB_STEP_SUMMARY

- name: ISB1 gate report
if: inputs.result-prefix == 'isb1'
run: |
Comment on lines 31 to 33
- name: Print summary
if: inputs.result-prefix != 'isb1'
run: |
Comment on lines +1 to +2
exports/**/*.json linguist-generated=true
exports/**/*.json text eol=lf
@OCWC22 OCWC22 force-pushed the isb1/mechanism-eval-schema branch from 7f8cd0c to 5c6b82f Compare April 17, 2026 07:21
OCWC22 added 6 commits April 17, 2026 00:27
…races

Add ISB1 (Inference Standard Benchmark, v1) — a multi-turn, long-context
KV cache stress testing dataset targeting realistic production KV-cache
pressure patterns.

## What this adds

35 synthetic multi-turn traces across 7 context bands (8K → 1M+ tokens):
- 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal
- KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout
- Real conversation content with 60-95% prefix overlap (enables prefix cache testing)
- Context assets from 15KB to 6.6MB inlined into traces for honest token counts

Export bundles for vLLM + SGLang replay:
- extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200)
- preview/long_context_500k: Qwen 3.5 500K context stress test
- preview/long_context_1m: Qwen 3.5 1M context stress test

10 KV stress sweep configs:
- 3 models × 2 GPUs × 2 engines
- Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s

## Benchmark infrastructure
- benchmark_export_replay.py: replay harness with actual_context_len telemetry
- process_result_isb1.py: result aggregation with KV metrics
- Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes
- Pareto frontier: throughput vs p99 TTFT at each concurrency level
- Keep only configs whose (runtime, hardware, model) triples exist in
  the export files — eliminates sweep generator failures
- Fix canonical-model-id to match export metadata (e.g., gpt_oss_120b
  not gptoss)
- Fix support-status to match export tiers (reviewed_preview vs
  unsupported)
- Remove configs for engines/GPUs not yet in exports (SGLang, Dynamo,
  TRT, Atom, AMD) — these need export metadata updates before they
  can be added back
- Add workload-type field required by sweep generator schema
- Remove disagg/multinode fields not in KV stress schema

Sweep generator now passes: exit code 0, produces valid matrix rows.
…mbos

Export metadata now includes all valid (runtime, hardware, model) triples
from nvidia-master.yaml + amd-master.yaml:
- 8 runtimes: vllm, sglang, trt, atom, sglang-disagg, dynamo-*
- 9 GPU types: H100, H200, B200, B300, GB200, GB300, MI300X, MI325X, MI355X
- 6 models: DSR1, GPT-OSS, Qwen 3.5, GLM-5, Kimi K2.5, MiniMax M2.5

87 KV stress configs with correct canonical-model-id and support-status
matching export metadata. Sweep generator passes (exit code 0).

MI355X configs sweep to 512 concurrent users (288GB HBM advantage).
…2.0 manifests, prefix-aware replay

Final closure pass landing PR SemiAnalysisAI#1032 end-to-end across every (runtime,
hardware, canonical-model) triple currently in the export metadata.

Sweep configs:
- Consolidate the sweep config under its canonical name isb1-kv-stress.yaml
- Rewrite isb1-master / isb1-triattn-preview / isb1-qwen-1m-preview:
  drop/demote dead stanzas, flatten paths (strip /vllm//sglang/ subdirs
  and __vllm/__sglang suffixes), repoint qwen3.5 to _qwen3.5 basename
- isb1-master shrinks 1723 -> 863 lines (50 -> 26 stanzas); 1M preview
  drops the vllm stanza (sglang-only in reality)
- All produced rows resolve to real bundle cells at declared tier

Manifests -> manifest_version 0.2.0 with single-bundle exports for
preview/long_context_500k (gptoss + qwen3.5) and preview/long_context_1m.

Consumer replay (utils/bench_serving/benchmark_export_replay.py):
hydrate v0.2.0 prefix-aware bundles — thin per-cell deltas join a
shared workload prefix via prefix_ref, LRU-cached (max 8) across cells
in the same bundle. Pre-0.2.0 bundles replay unchanged.

Producer-sync verifier (utils/verify_producer_sync.py): extend coverage
to core + extension_32k + extension_64k; silently skip subtrees absent
on both sides, report asymmetric ones.

Docs: coexistence and both preview READMEs updated with flat paths,
canonical config name, and the sglang-only preview reality.

Tests: 262/262 pass across utils/ (107 sweep-config + new
test_benchmark_export_replay.py for the prefix-aware consumer +
test_verify_producer_sync.py for broadened verifier coverage).
… clean support vocabulary

README.md:
- Remove dead links to docs removed in the prior cleanup pass
  (COVERAGE_AUDIT, LONG_CONTEXT_TRUTH_MATRIX, SUPPORT_MATRIX,
  RUNBOOKs, INVESTIGATION)
- Replace stale 50-export-files count with post-flatten per-subtree
  inventory (23 bundles + 3 manifests = 26 total, consolidating
  framework-specific variants into flat single files)
- Add explicit five-class support-status vocabulary section
- Keep safe/unsafe claim boundary

Coexistence doc:
- Strip planning/negotiation sections (Recommended PR Structure and
  maintainer-request list) — not coexistence-technical
- Update data-directory layout to show flat paths
- Update ISB1 workflow name to run-isb1-kv-stress-sweep.yml
- Add support-status vocabulary section

GMI_EXECUTION_PLAN.md:
- Prepend support-status framing (reviewed_preview,
  dataset_replay_verified, not live-serving certification)
- Fix stale nested paths to flat: extension_131k/vllm/ -> extension_131k/
- Fix preview bundle names: strip __vllm/__sglang suffixes
- Update final result-pipeline sentence to cite actual analyzer scripts
…istries + hard gate

Extends the ISB1 replay result schema with a backward-compatible set of
optional fields so every row declares which optimization technique it
exercises (baseline, kv_quantization, kv_compression, compressed_attention,
speculative_decoding) and which quality benchmark backs any lossy-technique
claim. A hard gate then prevents a row from being labeled support_status=
supported for a lossy technique unless a registered quality benchmark has
completed.

Follow-up to PR SemiAnalysisAI#1032.

All new fields default to NULL (mechanism defaults to "baseline") so
pre-existing rows, configs, and SQLite databases are unaffected until they
opt into the mechanism_eval vocabulary. The database migration is
idempotent; legacy schemas upgrade in place on first connect_db().

New files:
- utils/mechanism_eval.py
  Env-driven field catalog (14 fields), registry loaders, validation
  helpers, and the row_requires_completed_quality_eval predicate.
- datasets/isb1/registry/mechanism_variant_registry.json
  9 registered mechanism/variant pairs covering baseline, fp8_e4m3,
  turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa,
  dflash.
- datasets/isb1/registry/quality_eval_registry.json
  4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval,
  math_500.
- .github/configs/isb1-mechanism-baseline.yaml
  DSR1 (H100) and Qwen3.5 (B200) baseline cells.
- .github/configs/isb1-mechanism-fp8-kv.yaml
  Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and
  held at reviewed_preview until the RULER run completes (the gate
  blocks promotion to supported without it).
- .github/workflows/run-isb1-mechanism-eval.yml
  Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl.
- utils/test_mechanism_eval.py (13 tests).
- utils/test_process_result_isb1_mechanism.py (3 subprocess tests).

Extended files:
- utils/process_result_isb1.py — emits 14 mechanism fields + a
  mechanism_eval_validation record attached to every processed row.
- utils/gate_isb1.py — new mechanism_compression_quality gate enforcing:
  (1) any non-baseline mechanism_variant must resolve in the registry;
  (2) quality_eval_status in {pending, completed, failed, not_required};
  (3) supported + compression mechanism ⇒ quality_eval_status == completed
      with a registered quality_eval_id;
  (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate.
- datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE
  migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS,
  and CLI ingest flags.
- utils/test_gate_isb1.py — 7 new mechanism-gate tests.

Full suite: 285 passed, 2 pre-existing warnings.

References — public literature the registries are grounded in:

KV cache quantization (mechanism: kv_quantization)
- fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning"
  (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2
  formats used by engine-native FP8 KV paths in vLLM and SGLang.
- turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV
  schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a
  representative reference. Specific implementation citations travel
  with each submitted row via mechanism_notes.

KV cache compression (mechanism: kv_compression)
- kvtc_class: umbrella slot for tensor-codebook / product-quantization
  KV compressors. The class label reflects the architecture pattern;
  each submitted row cites its specific implementation.

Compressed attention (mechanism: compressed_attention)
- triattention_class: umbrella slot for sparse-/hybrid-attention
  variants that change the attention-computation surface rather than
  the stored KV format.

Speculative decoding (mechanism: speculative_decoding)
- mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3
  (DeepSeek-AI, 2024, arXiv:2412.19437).
- eagle3: EAGLE-family speculative decoding (Li et al., original
  EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent
  iterations of the same draft-model recipe).
- medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration
  Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774.
- dflash: umbrella slot for DeepFlash-style draft stacks.

Quality benchmarks (quality_eval_registry.json)
- ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of
  Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654.
  Primary long-context retrieval signal for KV quantization and
  compression at 32K–1M.
- longbench_v2: Bai et al., "LongBench v2: Towards Deeper
  Understanding and Reasoning on Realistic Long-context Multitasks"
  (THUDM, 2024), arXiv:2412.15204. Complements RULER for
  reasoning-heavy long-context workloads.
- humaneval: Chen et al., "Evaluating Large Language Models Trained
  on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374.
- math_500: 500-problem subset of the MATH dataset (Hendrycks et al.,
  "Measuring Mathematical Problem Solving With the MATH Dataset",
  2021, arXiv:2103.03874). Detects chain-of-thought degradation from
  aggressive KV quantization — the specific failure mode the hard
  gate is designed to catch.
@OCWC22 OCWC22 force-pushed the isb1/mechanism-eval-schema branch from 5c6b82f to 2df18f2 Compare April 17, 2026 07:28
@OCWC22
Copy link
Copy Markdown
Author

OCWC22 commented Apr 17, 2026

Closing — PR should not target the upstream repo as a copy-of-#1032. Reopening as a fork-local stacked PR on OCWC22/InferenceX with PR #1032's branch as the base, so the diff shows only the new mechanism_eval work.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Development

Successfully merging this pull request may close these issues.

2 participants