[experimental] add multi-turn KV cache stress benchmark traces#1032
[experimental] add multi-turn KV cache stress benchmark traces#1032OCWC22 wants to merge 9 commits intoSemiAnalysisAI:mainfrom
Conversation
There was a problem hiding this comment.
Pull request overview
Adds an ISB1 “KV cache stress / multi-turn replay” benchmarking surface (data + configs + runners + analysis utilities) to enable realistic long-context, high-prefix-overlap replay and offload-mode sweeps, while keeping it isolated from the existing experimental multiturn/kv-cache-tester lane.
Changes:
- Add committed ISB1 export bundles (including preview 500K/1M lanes) and supporting ISB1 dataset documentation.
- Add ISB1 KV-stress sweep workflow/config plus result summarization + gating utilities and tests.
- Add/extend runner + single-node benchmark scripts (vLLM/SGLang + TriAttention variants) and GMI helper scripts for running/collecting sweeps.
Reviewed changes
Copilot reviewed 147 out of 150 changed files in this pull request and generated 1 comment.
Show a summary per file
| File | Description |
|---|---|
| utils/verify_producer_sync.py | New utility to compare producer vs consumer export trees for selected ISB1 subtrees. |
| utils/test_verify_producer_sync.py | Tests for verify_producer_sync utility (pass + content mismatch). |
| utils/test_summarize_isb1.py | Tests for ISB1 operator summary output formatting/sections. |
| utils/test_process_result.py | Adds guards/tests ensuring ISB1 replay-style results don’t go through throughput processor. |
| utils/test_gate_isb1.py | Tests for ISB1 gating logic and strict failure behavior. |
| utils/process_result.py | Adds “fail fast” guards for ISB1 replay env/payload in throughput result processor. |
| runners/lib_single_node_script.sh | New helper to resolve benchmark script paths (runtime-aware for ISB1 replay). |
| runners/launch_h200-nb.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h200-dgxc-slurm.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h200-cw.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h100-dgxc-slurm.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h100-cw.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_h100-cr.sh | Uses new script resolver; expands env passthrough for ISB1 replay/kv-stress. |
| runners/launch_b200-nb.sh | Uses new script resolver; executes resolved benchmark script. |
| runners/launch_b200-dgxc.sh | Uses new script resolver; expands env passthrough for ISB1 replay/kv-stress. |
| runners/launch_b200-dgxc-slurm.sh | Uses new script resolver; executes resolved benchmark script; ensures cleanup. |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_h200_sglang.sh | Adds experimental trace-replay runner script (SGLang). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_qwen3.5_fp8_b200_sglang.sh | Adds experimental trace-replay runner script (SGLang). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_h200_sglang.sh | Adds experimental trace-replay runner script (SGLang). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_gptoss_fp4_b200_sglang.sh | Adds experimental trace-replay runner script (SGLang). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_h200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/scripts/trace_replay_dsr1_fp8_b200_vllm.sh | Adds experimental trace-replay runner script (vLLM). |
| experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_h200.sh | Adds experimental LMCache-enabled vLLM launcher (H200). |
| experimental/multiturn/vllm_benchmark/launch/lmcache_vllm_b200.sh | Adds experimental LMCache-enabled vLLM launcher (B200). |
| experimental/multiturn/vllm_benchmark/launch/README.md | Docs for experimental LMCache launch helpers. |
| experimental/multiturn/vllm_benchmark/kv-cache-tester/traces/.gitkeep | Placeholder for external trace assets directory. |
| experimental/multiturn/vllm_benchmark/kv-cache-tester/README.md | Placeholder README describing expected kv-cache-tester population. |
| experimental/multiturn/vllm_benchmark/aiperf_traces/generate_aiperf_traces.py | Script to generate synthetic AIPerf-style sessions for replay. |
| experimental/multiturn/vllm_benchmark/README.md | Docs describing experimental parity surface and links to ISB1 scripts. |
| experimental/multiturn/vllm_benchmark/.gitignore | Ignores generated artifacts in experimental multiturn bench area. |
| experimental/multiturn/README.md | Replaces older notes with scoped “experimental notes” guidance and pointers to ISB1 ground truth. |
| experimental/README.md | Updates experimental directory warning + pointers to ISB1 ground truth docs. |
| datasets/isb1/scripts/plot_pareto.py | Adds Pareto frontier computation + optional plotting (TTFT p99 vs throughput). |
| datasets/isb1/scripts/gpu_profile_collector.sh | Adds nvidia-smi polling helper for GPU utilization/power logging. |
| datasets/isb1/scripts/gmi_test_matrix.sh | Adds a curated “matrix” driver for running portable benchmarks. |
| datasets/isb1/scripts/gmi_kv_sweep.sh | Adds concurrency × offload-mode sweep driver for portable benchmarks. |
| datasets/isb1/scripts/gmi_full_suite.sh | Adds full-suite portable runner across models/engines/bands (with skips). |
| datasets/isb1/scripts/generate_qwen35_low_band_exports.py | Generates Qwen3.5-specific low-band export bundles by rewriting filtered cells. |
| datasets/isb1/scripts/collect_sweep_results.py | Aggregates sweep results from DB or JSON dir; computes cliffs/benefits. |
| datasets/isb1/scripts/analyze_benchmark_distributions.py | Analyzes token/turn distributions for ISB1 exports or kv-cache traces. |
| datasets/isb1/scripts/adapt_trace_replay_result.py | Adapts kv-cache trace replay outputs into ISB1 replay JSON schema. |
| datasets/isb1/exports/preview/long_context_500k/manifest_qwen3.5.json | Adds preview 500k manifest (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/manifest.json | Adds preview 500k manifest (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_qwen3.5_xlc2_500k_preview_v1__vllm.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_qwen3.5_xlc2_500k_preview_v1__sglang.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_gptoss_xlc2_500k_preview_v1__vllm.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__coding_gptoss_xlc2_500k_preview_v1__sglang.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_qwen3.5_xlc2_500k_preview_v1__vllm.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_qwen3.5_xlc2_500k_preview_v1__sglang.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_gptoss_xlc2_500k_preview_v1__vllm.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/inferencex_trace_replay__chat_gptoss_xlc2_500k_preview_v1__sglang.json | Adds preview 500k export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_500k/README.md | Documents bounded 500k-class preview lanes and claim boundary. |
| datasets/isb1/exports/preview/long_context_1m/manifest.json | Adds preview 1m manifest (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__coding_qwen3.5_ulc2_1m_preview_v1__vllm.json | Adds preview 1m export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__coding_qwen3.5_ulc2_1m_preview_v1__sglang.json | Adds preview 1m export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__chat_qwen3.5_ulc2_1m_preview_v1__vllm.json | Adds preview 1m export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/inferencex_trace_replay__chat_qwen3.5_ulc2_1m_preview_v1__sglang.json | Adds preview 1m export bundle (Git LFS pointer). |
| datasets/isb1/exports/preview/long_context_1m/README.md | Documents gated 1M preview lane and manual config boundary. |
| datasets/isb1/exports/extension_64k/vllm/code_64k1k_qwen3.5.json | Adds extension 64k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/vllm/code_64k1k.json | Adds extension 64k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/vllm/chat_64k1k_qwen3.5.json | Adds extension 64k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/vllm/chat_64k1k.json | Adds extension 64k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/sglang/code_64k1k_qwen3.5.json | Adds extension 64k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/sglang/code_64k1k.json | Adds extension 64k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/sglang/chat_64k1k_qwen3.5.json | Adds extension 64k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_64k/sglang/chat_64k1k.json | Adds extension 64k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/vllm/code_32k1k_qwen3.5.json | Adds extension 32k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/vllm/code_32k1k.json | Adds extension 32k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/vllm/chat_32k1k_qwen3.5.json | Adds extension 32k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/vllm/chat_32k1k.json | Adds extension 32k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/sglang/code_32k1k_qwen3.5.json | Adds extension 32k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/sglang/code_32k1k.json | Adds extension 32k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/sglang/chat_32k1k_qwen3.5.json | Adds extension 32k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_32k/sglang/chat_32k1k.json | Adds extension 32k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/code_131k1k_qwen3.5.json | Adds/updates extension 131k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/code_131k1k.json | Adds/updates extension 131k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/chat_131k1k_qwen3.5.json | Adds/updates extension 131k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/chat_131k1k_dsr1.json | Adds/updates extension 131k DSR1 bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/vllm/chat_131k1k.json | Adds/updates extension 131k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/code_131k1k_qwen3.5.json | Adds/updates extension 131k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/code_131k1k.json | Adds/updates extension 131k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/chat_131k1k_qwen3.5.json | Adds/updates extension 131k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/chat_131k1k_dsr1.json | Adds/updates extension 131k DSR1 bundle (Git LFS pointer). |
| datasets/isb1/exports/extension_131k/sglang/chat_131k1k.json | Adds/updates extension 131k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/core/vllm/code_8k1k_qwen3.5.json | Adds core 8k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/core/vllm/code_8k1k.json | Adds core 8k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/core/vllm/chat_8k1k_qwen3.5.json | Adds core 8k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/core/vllm/chat_8k1k.json | Adds core 8k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/core/sglang/code_8k1k_qwen3.5.json | Adds core 8k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/core/sglang/code_8k1k.json | Adds core 8k generic bundle (Git LFS pointer). |
| datasets/isb1/exports/core/sglang/chat_8k1k_qwen3.5.json | Adds core 8k Qwen bundle (Git LFS pointer). |
| datasets/isb1/exports/core/sglang/chat_8k1k.json | Adds core 8k generic bundle (Git LFS pointer). |
| datasets/isb1/README.md | Adds ISB1 consumer-package README with coverage inventory and claim boundary. |
| datasets/isb1/GMI_EXECUTION_PLAN.md | Adds execution plan/runbook for external GMI KV-stress benchmarking. |
| datasets/isb1/COEXISTENCE_WITH_KV_CACHE_TESTER.md | Adds coexistence plan doc for ISB1 vs kv-cache-tester surfaces. |
| datasets/isb1/.gitattributes | Adds attributes for exports (linguist + EOL handling). |
| benchmarks/single_node/qwen3.5triattn_fp8_h200_vllm.sh | Adds TriAttention vLLM benchmark script (H200). |
| benchmarks/single_node/qwen3.5triattn_fp8_h100_vllm.sh | Adds TriAttention vLLM benchmark script (H100). |
| benchmarks/single_node/qwen3.5_fp8_h200_vllm.sh | Adds/updates Qwen3.5 vLLM script (H200) with ISB1-aware prefix/offload behavior. |
| benchmarks/single_node/qwen3.5_fp8_h200_sglang.sh | Adds Qwen3.5 SGLang script (H200) with ISB1-aware radix/offload behavior. |
| benchmarks/single_node/qwen3.5_fp8_h100_vllm.sh | Adds Qwen3.5 vLLM script (H100). |
| benchmarks/single_node/qwen3.5_fp8_h100_sglang.sh | Adds Qwen3.5 SGLang script (H100). |
| benchmarks/single_node/qwen3.5_fp8_b200_vllm.sh | Adds Qwen3.5 vLLM script (B200). |
| benchmarks/single_node/qwen3.5_fp8_b200_sglang.sh | Adds Qwen3.5 SGLang script (B200). |
| benchmarks/single_node/gptosstriattn_fp4_h200_vllm.sh | Adds TriAttention vLLM benchmark script for GPT-OSS (H200). |
| benchmarks/single_node/gptosstriattn_fp4_h100_vllm.sh | Adds TriAttention vLLM benchmark script for GPT-OSS (H100). |
| benchmarks/single_node/gptoss_fp4_h200_sglang.sh | Adds GPT-OSS SGLang script (H200). |
| benchmarks/single_node/gptoss_fp4_h200.sh | Updates GPT-OSS H200 script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/gptoss_fp4_h100_sglang.sh | Adds GPT-OSS SGLang script (H100). |
| benchmarks/single_node/gptoss_fp4_h100.sh | Updates GPT-OSS H100 script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/gptoss_fp4_b200_sglang.sh | Adds GPT-OSS SGLang script (B200). |
| benchmarks/single_node/gptoss_fp4_b200.sh | Updates GPT-OSS B200 script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/dsr1triattn_fp8_h200_vllm.sh | Adds TriAttention vLLM benchmark script for DSR1 (H200). |
| benchmarks/single_node/dsr1triattn_fp8_h100_vllm.sh | Adds TriAttention vLLM benchmark script for DSR1 (H100). |
| benchmarks/single_node/dsr1_fp8_h200_vllm.sh | Adds DSR1 vLLM script (H200). |
| benchmarks/single_node/dsr1_fp8_h200.sh | Updates DSR1 H200 SGLang script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/dsr1_fp8_b200_vllm.sh | Adds DSR1 vLLM script (B200). |
| benchmarks/single_node/dsr1_fp8_b200.sh | Updates DSR1 B200 SGLang script to be ISB1-aware and align to run_single_node_benchmark. |
| benchmarks/single_node/dsr1_fp4_b200.sh | Updates DSR1 FP4 B200 SGLang script to be ISB1-aware and align to run_single_node_benchmark. |
| .gitignore | Adds ignores for macOS metadata + local prompt exports + .claude. |
| .github/workflows/run-isb1-kv-stress-sweep.yml | Adds workflow_dispatch sweep driver for ISB1 KV-stress matrix runs. |
| .github/workflows/collect-results.yml | Adds ISB1-specific summary + gating report generation and uploads. |
| .github/configs/isb1-qwen-1m-preview.yaml | Adds a manual-only gated config for 1M Qwen preview runs. |
| .github/configs/isb1-kv-stress.yaml | Adds dedicated KV-stress sweep config (separate from isb1-master). |
| .gitattributes | Tracks ISB1 export JSON under Git LFS. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
af64122 to
1b9b79c
Compare
1b9b79c to
ef90b64
Compare
…races Add ISB-1 (Inference Stress Benchmark) — a multi-turn, long-context KV cache stress testing dataset for InferenceX V3. ## What this adds **35 synthetic multi-turn traces** across 7 context bands (8K → 1M+ tokens): - 6 workload families: long_chat, coding, agent, rag, cache_stress, multimodal - KV stress patterns: prefix reuse, offload cliff, compaction, reactivation, fanout - Real conversation content with 60-95% prefix overlap (enables prefix cache testing) - Context assets from 15KB to 6.6MB inlined into traces for honest token counts **Export bundles** for vLLM + SGLang replay: - extension_131k: DeepSeek-R1, GPT-OSS, Qwen 3.5 (H200/B200) - preview/long_context_500k: Qwen 3.5 500K context stress test - preview/long_context_1m: Qwen 3.5 1M context stress test **10 KV stress sweep configs** (isb1-kv-stress-pr993.yaml): - 3 models × 2 GPUs × 2 engines - Sweep: 2→256 concurrent users × on/off/noprefix offload modes × 1800s ## Coexistence with kv-cache-tester This dataset complements PR SemiAnalysisAI#993's kv-cache-tester (522 real Claude Code traces): - kv-cache-tester: real workload distribution, natural performance profile - ISB1: controlled KV stress patterns that force offload cliffs and cache pressure No files in experimental/multiturn/ are modified. Separate config files, separate data directory (datasets/isb1/), shared replay infrastructure. ## Benchmark infrastructure - benchmark_export_replay.py: replay harness with actual_context_len telemetry - process_result_isb1.py: result aggregation with KV metrics - Prometheus metrics: kv_cache_usage, prefix_cache_hits, kv_offload_bytes - Pareto frontier: throughput vs p99 TTFT at each concurrency level
ef90b64 to
fbe9f79
Compare
- Keep only configs whose (runtime, hardware, model) triples exist in the export files — eliminates sweep generator failures - Fix canonical-model-id to match export metadata (e.g., gpt_oss_120b not gptoss) - Fix support-status to match export tiers (reviewed_preview vs unsupported) - Remove configs for engines/GPUs not yet in exports (SGLang, Dynamo, TRT, Atom, AMD) — these need export metadata updates before they can be added back - Add workload-type field required by sweep generator schema - Remove disagg/multinode fields not in KV stress schema Sweep generator now passes: exit code 0, produces valid matrix rows.
|
Some good stuff in here. Will collab async on this one and take some stuff from this PR into experimental/agentic-benchmark MVP. |
…mbos Export metadata now includes all valid (runtime, hardware, model) triples from nvidia-master.yaml + amd-master.yaml: - 8 runtimes: vllm, sglang, trt, atom, sglang-disagg, dynamo-* - 9 GPU types: H100, H200, B200, B300, GB200, GB300, MI300X, MI325X, MI355X - 6 models: DSR1, GPT-OSS, Qwen 3.5, GLM-5, Kimi K2.5, MiniMax M2.5 87 KV stress configs with correct canonical-model-id and support-status matching export metadata. Sweep generator passes (exit code 0). MI355X configs sweep to 512 concurrent users (288GB HBM advantage).
…prefix-aware replay Final closure pass landing PR#1032 end-to-end for SLURM + InferenceX + kv-cache-tester across every (runtime, hardware, canonical-model) triple currently in the export metadata. Sweep configs: - Rename isb1-kv-stress-pr993.yaml -> isb1-kv-stress.yaml - Rewrite isb1-master / isb1-triattn-preview / isb1-qwen-1m-preview: drop/demote dead stanzas, flatten paths (strip /vllm//sglang/ subdirs and __vllm/__sglang suffixes), repoint qwen3.5 to _qwen3.5 basename - isb1-master shrinks 1723 -> 863 lines (50 -> 26 stanzas); 1M preview drops the vllm stanza (sglang-only in reality) - All produced rows resolve to real bundle cells at declared tier Manifests -> manifest_version 0.2.0 with single-bundle exports for preview/long_context_500k (gptoss + qwen3.5) and preview/long_context_1m. Consumer replay (utils/bench_serving/benchmark_export_replay.py): hydrate v0.2.0 prefix-aware bundles — thin per-cell deltas join a shared workload prefix via prefix_ref, LRU-cached (max 8) across cells in the same bundle. Pre-0.2.0 bundles replay unchanged. Producer-sync verifier (utils/verify_producer_sync.py): extend coverage to core + extension_32k + extension_64k; silently skip subtrees absent on both sides, report asymmetric ones. Docs: COEXISTENCE_WITH_KV_CACHE_TESTER + both preview READMEs updated with flat paths, new config name, and the sglang-only preview reality. Tests: 262/262 pass across utils/ (107 sweep-config + new test_benchmark_export_replay.py for the prefix-aware consumer + test_verify_producer_sync.py for broadened verifier coverage).
57cbf1b to
fa132a7
Compare
… clean support vocabulary README.md: - Remove dead links to docs removed in 5f6aba7 (COVERAGE_AUDIT, LONG_CONTEXT_TRUTH_MATRIX, SUPPORT_MATRIX, RUNBOOKs, INVESTIGATION) - Replace stale 50-export-files count with post-flatten per-subtree inventory (23 bundles + 3 manifests = 26 total, consolidating framework-specific variants into flat single files) - Add explicit five-class support-status vocabulary section - Keep safe/unsafe claim boundary COEXISTENCE_WITH_KV_CACHE_TESTER.md: - Strip planning/negotiation sections (Recommended PR Structure and maintainer-request list) — not coexistence-technical - Replace possessive references with PR-number references throughout (kv-cache-tester -> PR SemiAnalysisAI#993, ISB1 -> PR SemiAnalysisAI#1032) - Update data-directory layout to show flat paths - Update ISB1 workflow name to run-isb1-kv-stress-sweep.yml - Add support-status vocabulary section GMI_EXECUTION_PLAN.md: - Prepend support-status framing (reviewed_preview, dataset_replay_verified, not live-serving certification) - Fix stale nested paths to flat: extension_131k/vllm/ -> extension_131k/ - Fix preview bundle names: strip __vllm/__sglang suffixes - Update final result-pipeline sentence to cite actual analyzer scripts
Merge-sweep closure summaryHead is now What landed in this sweep
Copilot reviewOnly inline comment was on Producer / consumer sync status
Scope boundary
Understood on the async collab directionRe the earlier comment about taking pieces into an
Flagging |
…istries + hard gate Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status= supported for a lossy technique unless a registered quality benchmark has completed. Follow-up to PR SemiAnalysisAI#1032. All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db(). New files: - utils/mechanism_eval.py Env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate. - datasets/isb1/registry/mechanism_variant_registry.json 9 registered mechanism/variant pairs covering baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash. - datasets/isb1/registry/quality_eval_registry.json 4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval, math_500. - .github/configs/isb1-mechanism-baseline.yaml DSR1 (H100) and Qwen3.5 (B200) baseline cells. - .github/configs/isb1-mechanism-fp8-kv.yaml Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it). - .github/workflows/run-isb1-mechanism-eval.yml Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl. - utils/test_mechanism_eval.py (13 tests). - utils/test_process_result_isb1_mechanism.py (3 subprocess tests). Extended files: - utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row. - utils/gate_isb1.py — new mechanism_compression_quality gate enforcing: (1) any non-baseline mechanism_variant must resolve in the registry; (2) quality_eval_status in {pending, completed, failed, not_required}; (3) supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id; (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate. - datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags. - utils/test_gate_isb1.py — 7 new mechanism-gate tests. Full suite: 285 passed, 2 pre-existing warnings. References — public literature the registries are grounded in: KV cache quantization (mechanism: kv_quantization) - fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang. - turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes. KV cache compression (mechanism: kv_compression) - kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation. Compressed attention (mechanism: compressed_attention) - triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Speculative decoding (mechanism: speculative_decoding) - mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437). - eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe). - medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774. - dflash: umbrella slot for DeepFlash-style draft stacks. Quality benchmarks (quality_eval_registry.json) - ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M. - longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads. - humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. - math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.
…2.0 manifests, prefix-aware replay Final closure pass landing PR SemiAnalysisAI#1032 end-to-end across every (runtime, hardware, canonical-model) triple currently in the export metadata. Sweep configs: - Consolidate the sweep config under its canonical name isb1-kv-stress.yaml - Rewrite isb1-master / isb1-triattn-preview / isb1-qwen-1m-preview: drop/demote dead stanzas, flatten paths (strip /vllm//sglang/ subdirs and __vllm/__sglang suffixes), repoint qwen3.5 to _qwen3.5 basename - isb1-master shrinks 1723 -> 863 lines (50 -> 26 stanzas); 1M preview drops the vllm stanza (sglang-only in reality) - All produced rows resolve to real bundle cells at declared tier Manifests -> manifest_version 0.2.0 with single-bundle exports for preview/long_context_500k (gptoss + qwen3.5) and preview/long_context_1m. Consumer replay (utils/bench_serving/benchmark_export_replay.py): hydrate v0.2.0 prefix-aware bundles — thin per-cell deltas join a shared workload prefix via prefix_ref, LRU-cached (max 8) across cells in the same bundle. Pre-0.2.0 bundles replay unchanged. Producer-sync verifier (utils/verify_producer_sync.py): extend coverage to core + extension_32k + extension_64k; silently skip subtrees absent on both sides, report asymmetric ones. Docs: coexistence and both preview READMEs updated with flat paths, canonical config name, and the sglang-only preview reality. Tests: 262/262 pass across utils/ (107 sweep-config + new test_benchmark_export_replay.py for the prefix-aware consumer + test_verify_producer_sync.py for broadened verifier coverage).
…istries + hard gate Extends the ISB1 replay result schema with a backward-compatible set of optional fields so every row declares which optimization technique it exercises (baseline, kv_quantization, kv_compression, compressed_attention, speculative_decoding) and which quality benchmark backs any lossy-technique claim. A hard gate then prevents a row from being labeled support_status= supported for a lossy technique unless a registered quality benchmark has completed. Follow-up to PR SemiAnalysisAI#1032. All new fields default to NULL (mechanism defaults to "baseline") so pre-existing rows, configs, and SQLite databases are unaffected until they opt into the mechanism_eval vocabulary. The database migration is idempotent; legacy schemas upgrade in place on first connect_db(). New files: - utils/mechanism_eval.py Env-driven field catalog (14 fields), registry loaders, validation helpers, and the row_requires_completed_quality_eval predicate. - datasets/isb1/registry/mechanism_variant_registry.json 9 registered mechanism/variant pairs covering baseline, fp8_e4m3, turboquant_class, kvtc_class, triattention_class, mtp, eagle3, medusa, dflash. - datasets/isb1/registry/quality_eval_registry.json 4 registered quality benchmarks: ruler_v1, longbench_v2, humaneval, math_500. - .github/configs/isb1-mechanism-baseline.yaml DSR1 (H100) and Qwen3.5 (B200) baseline cells. - .github/configs/isb1-mechanism-fp8-kv.yaml Same two cells with FP8 E4M3 KV quantization, wired to ruler_v1 and held at reviewed_preview until the RULER run completes (the gate blocks promotion to supported without it). - .github/workflows/run-isb1-mechanism-eval.yml Dispatch workflow routing mechanism configs through benchmark-isb1-tmpl. - utils/test_mechanism_eval.py (13 tests). - utils/test_process_result_isb1_mechanism.py (3 subprocess tests). Extended files: - utils/process_result_isb1.py — emits 14 mechanism fields + a mechanism_eval_validation record attached to every processed row. - utils/gate_isb1.py — new mechanism_compression_quality gate enforcing: (1) any non-baseline mechanism_variant must resolve in the registry; (2) quality_eval_status in {pending, completed, failed, not_required}; (3) supported + compression mechanism ⇒ quality_eval_status == completed with a registered quality_eval_id; (4) speculative_decoding ⇒ draft_model_id + speculative_acceptance_rate. - datasets/isb1/scripts/isb1_results_db.py — 16 additive ALTER TABLE migrations plus matching SCHEMA_SQL, INSERT_COLUMNS, GROUPABLE_COLUMNS, and CLI ingest flags. - utils/test_gate_isb1.py — 7 new mechanism-gate tests. Full suite: 285 passed, 2 pre-existing warnings. References — public literature the registries are grounded in: KV cache quantization (mechanism: kv_quantization) - fp8_e4m3: Micikevicius et al., "FP8 Formats for Deep Learning" (NVIDIA/Intel/Arm, 2022), arXiv:2209.05433. Defines the E4M3/E5M2 formats used by engine-native FP8 KV paths in vLLM and SGLang. - turboquant_class: umbrella slot for Hadamard-rotated 4-bit KV schemes; Hooper et al., "KVQuant", 2024, arXiv:2401.18079, is a representative reference. Specific implementation citations travel with each submitted row via mechanism_notes. KV cache compression (mechanism: kv_compression) - kvtc_class: umbrella slot for tensor-codebook / product-quantization KV compressors. The class label reflects the architecture pattern; each submitted row cites its specific implementation. Compressed attention (mechanism: compressed_attention) - triattention_class: umbrella slot for sparse-/hybrid-attention variants that change the attention-computation surface rather than the stored KV format. Speculative decoding (mechanism: speculative_decoding) - mtp: Multi-Token Prediction head as used at scale in DeepSeek-V3 (DeepSeek-AI, 2024, arXiv:2412.19437). - eagle3: EAGLE-family speculative decoding (Li et al., original EAGLE, 2024, arXiv:2401.15077; EAGLE-2 and EAGLE-3 are subsequent iterations of the same draft-model recipe). - medusa: Cai et al., "Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads", 2024, arXiv:2401.10774. - dflash: umbrella slot for DeepFlash-style draft stacks. Quality benchmarks (quality_eval_registry.json) - ruler_v1: Hsieh et al., "RULER: What's the Real Context Size of Your Long-Context Language Models?" (NVIDIA, 2024), arXiv:2404.06654. Primary long-context retrieval signal for KV quantization and compression at 32K–1M. - longbench_v2: Bai et al., "LongBench v2: Towards Deeper Understanding and Reasoning on Realistic Long-context Multitasks" (THUDM, 2024), arXiv:2412.15204. Complements RULER for reasoning-heavy long-context workloads. - humaneval: Chen et al., "Evaluating Large Language Models Trained on Code" (OpenAI Codex paper, 2021), arXiv:2107.03374. - math_500: 500-problem subset of the MATH dataset (Hendrycks et al., "Measuring Mathematical Problem Solving With the MATH Dataset", 2021, arXiv:2103.03874). Detects chain-of-thought degradation from aggressive KV quantization — the specific failure mode the hard gate is designed to catch.
Trim this branch to ISB1 data exports + processing/replay contract files only.\n\nRemoved non-scope changes from this PR branch (workflows/configs, benchmark runners/scripts, GMI harness docs/scripts, experimental multiturn assets, and auxiliary ISB1 tooling), preserving them on fork-only bookmark branches:\n- isb1/kv-stress-tooling\n- isb1/agentic-benchmark-runners\n- isb1/gmi-harness\n\nThis keeps upstream cherry-pick review focused on dataset exports and contract guards.
|
@cquil11 Thanks for the cherry-pick guidance (not merge-as-a-whole) — I trimmed this PR accordingly. New scope: ISB1 data + contract only (datasets/isb1 exports + README/LFS attrs, and replay/process_result ISB1 guard/tests). Fork-only preservation branches (OCWC22/InferenceX):
If this reduced slice looks good, could you review/cherry-pick from this branch first and we can follow with focused PRs from the bookmarks? |
…ntic-benchmark Moving off PR SemiAnalysisAI#1032 — per data-PR narrowing, per-GPU recipe cells do not belong in the ISB1 data-contribution diff. Parked here on the fork for possible follow-up contribution to experimental/agentic-benchmark when that branch exists upstream. No upstream PR opened from this branch.
…a+contract only
Second trim pass. Reverts 12 consortium-owned files to merge-base state
and removes 1 net-new per-GPU recipe:
- benchmarks/single_node/qwen3.5_{bf16,fp8}_mi{300x,325x,355x}.sh (reverted)
- benchmarks/single_node/qwen3.5_fp8_b300_mtp.sh (removed — preserved on
fork branch isb1/agentic-benchmark-runners)
- runners/launch_b300-nv.sh (reverted)
- .github/configs/{amd,nvidia}-master.yaml (reverted)
- .github/workflows/{benchmark-tmpl,pr-recipe-reminder}.yml (reverted)
- perf-changelog.yaml (reverted)
Rationale: per-GPU recipe cells and cross-cutting CI config are owned by
AMD/NVIDIA contributors, not by a data-contribution PR. Matches Cam's
cherry-pick-not-merge guidance and InferenceX consortium ownership model.
Remaining PR scope: datasets/isb1/** + utils/** (replay contract +
process_result ISB1 guard + tests) + top-level .gitattributes.
|
Second trim pass landed ( Reverted or removed all cross-cutting / per-GPU-cell edits:
Remaining scope is strictly data + contract:
No consortium-owned files touched, no CI edits, no per-GPU recipe edits. Zero net deletions against main. Should be trivially cherry-pickable into |
Summary
Add multi-turn, long-context KV cache stress testing traces for realistic inference benchmarking.
Why this matters
Current benchmarks use random data — no prefix caching, no multi-turn, no KV cache reuse. This adds realistic multi-turn traces that:
Sweep configuration
Each config produces a throughput vs p99 TTFT Pareto frontier across concurrency levels and offload modes.
Context bands
Coexistence with kv-cache-tester (PR #993)
This complements kv-cache-tester's 522 real Claude Code traces:
No files in
experimental/multiturn/are modified. Separate directory (datasets/isb1/), separate configs.Test plan
generate_sweep_configs.pydry-run resolves all configsbenchmark_export_replay.py