feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332
feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332wu6u3tw wants to merge 6 commits into
Conversation
|
MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅ |
There was a problem hiding this comment.
Code Review
This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.
Important
The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.
…review Address gemini-code-assist review on PR mlcommons#332: - CLI catches OSError (PermissionError etc.) and write_verdict failures, not just FileNotFoundError/ValueError — all map to exit 2. - _audit_marker tolerates non-dict results.json (isinstance guards) instead of raising AttributeError. - _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Update summaryAll review feedback has been addressed. Here is what changed since the original submission: Architecture (main concern)
Config shape
audit: "test04"
datasets:
- name: wan22_prompts
path: wan22_prompts.jsonl
type: "performance"
samples: 50 # reference phase query count (50–144)
- name: wan22_audit
path: wan22_prompts.jsonl
type: "audit"
samples: 25 # audit phase query count (25–50)
audit_sample_index: 0Robustness
Testing
Example config
|
9057190 to
b547f1d
Compare
cdbae64 to
eae1234
Compare
b190d21 to
e0de06f
Compare
385630c to
c1e48bf
Compare
|
All review feedback has been addressed. Here's a summary of what changed: Architecture Sample counts & index SingleStream Durations Robustness fixes (Gemini)
Cleanup
|
nvzhihanj
left a comment
There was a problem hiding this comment.
Review Council — first-principles design review
Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough
Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.
Review Council — Multi-AI Code Review (first-principles design review)Reviewed by: Claude · Depth: thorough Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline. 🔴 Re-design / Must-fix
🟡 Should-fix
🔵 Consider
Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance. |
1dfaf3c to
89179d5
Compare
History squashed to 3 commitsThe branch was force-pushed (
Each commit independently passes Naming note: this version names the test |
6411c17 to
bb12443
Compare
|
@viraatc @arekay-nv can you review it with impl is already updated in this PR. |
bb12443 to
5abd63b
Compare
|
Per offline discussion with @nvzhihanj, will modify the dump json file layout under |
Design plan (docs/compliance_audit_plan.md, incl. an ASCII program-flow diagram showing every decision gate and its exit code), the compliance-module entry in AGENTS.md, and the WAN2.2 Offline/SingleStream submission example configs. All audit output nests under <report_dir>/audit/. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generic AuditTest framework (compliance/): AuditTest protocol + RunSpec/RunStats/RunArtifacts + registry; OutputCachingAudit implements MLPerf TEST04 caching detection — reference vs fixed-sample phase, fails if audit QPS exceeds reference QPS by > threshold. run_audit (commands/audit.py) runs phases back-to-back, validates unpaced load + sample_index, refuses to certify an incomplete phase, and writes verify_OUTPUT_CACHING_TEST.txt + audit_result.json atomically under <report_dir>/audit/. Wired via the YAML audit: block and a generic SampleOrderSpec + SingleSampleOrder seam. RunStats.from_report reads the Report.qps attribute (matches the current Report API). Also folds in incidental branch changes touching these files: metrics-aggregator --ready-file flag, service launcher ready-check timeout, and the aiohttp + msgpack==1.2.1 CVE bumps (msgpack clears GHSA-6v7p-g79w-8964). Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unit tests for verify_output_caching, plan_runs/verify, RunStats.from_report, the run_audit guards (load-pattern, incomplete phase, interrupt-skips-audit), SampleOrderSpec/SingleSampleOrder, and the atomic result writer; plus the end-to-end audit: flow asserting artifacts land under <report_dir>/audit/. Rejected-load-pattern parametrization is derived from the LoadPatternType enum so it stays correct regardless of which patterns exist on the base branch. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
5abd63b to
d9cdeda
Compare
| - name: wan22_vbench | ||
| path: examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl | ||
| type: "accuracy" | ||
| samples: 20 |
There was a problem hiding this comment.
Just to make sure, we are only running 20 samples for singlestream accuracy? I don't see it being mentioned in https://github.com/mlcommons/inference/tree/master/text_to_video/wan-2.2-t2v-a14b
| default=0, | ||
| help="Identity to send in the readiness signal", | ||
| ) | ||
| parser.add_argument( |
There was a problem hiding this comment.
What is this? seems like an artifacts that should be removed
| if isinstance(result, AuditResult): | ||
| sys.exit(0 if result.passed else 1) |
There was a problem hiding this comment.
This might actually be a valid change. @arekay-nv @nv-alicheng should we remove the if and put proper exit code for the benchmark results?
| logger.info(f"Partial results saved to {ctx.report_dir}") | ||
|
|
||
| if config.audit is not None: | ||
| from inference_endpoint.commands.audit import run_audit |
There was a problem hiding this comment.
This is a small module so I don't think we should lazy import
| logger.warning("Benchmark interrupted by user") | ||
| # Salvage partial results (finally), then propagate: an interrupted | ||
| # run must not silently roll into the long compliance audit phases. | ||
| logger.warning("Benchmark interrupted by user; skipping audit") |
There was a problem hiding this comment.
Doesn't seem necessary because not all run have audit
| SetupError: Config invalid for audit (missing audit block, paced load, bad index). | ||
| ExecutionError: A phase benchmark run failed. | ||
| """ | ||
| from ..commands.benchmark.execute import ( |
There was a problem hiding this comment.
Lazy import, please fix
| if load_type not in (LoadPatternType.MAX_THROUGHPUT, LoadPatternType.CONCURRENCY): | ||
| raise SetupError( | ||
| "Compliance audit requires an unpaced load pattern (max_throughput or concurrency). " | ||
| f"Got: {load_type.value}" | ||
| ) |
There was a problem hiding this comment.
Don't think this is a valid assert. Poisson can also run TEST04
| for check_spec in specs: | ||
| idx = check_spec.sample_order.fixed_index | ||
| if idx is not None and not (0 <= idx < n_samples): | ||
| raise SetupError( | ||
| f"Audit phase '{check_spec.label}': sample_index={idx} " | ||
| f"is out of range [0, {n_samples}) for dataset with " | ||
| f"{n_samples} samples" | ||
| ) |
There was a problem hiding this comment.
If you look at this chunk of code, it's checking N^2 times of the index. Not sure why it's needed but doesn't look right.
| n_requested = ( | ||
| spec.n_samples if spec.n_samples is not None else report.n_samples_issued | ||
| ) |
There was a problem hiding this comment.
I see quite a different places where n_samples, n_requested are read from spec and report. Do we want n_samples_issued to be equal to n_samples or dataset samples?
| MLPerf Inference TEST04 compliance test. | ||
|
|
||
| Pass criterion (MLCommons-faithful): | ||
| Each phase completed ≥ requested * (1 - threshold) |
There was a problem hiding this comment.
What is this? number of samples?
| return RunStats.from_report(self.report, self.n_requested) | ||
|
|
||
|
|
||
| class AuditTest(Protocol): |
There was a problem hiding this comment.
@nv-alicheng to review if this is the right way to register audit tests
| @dataclass(frozen=True, slots=True) | ||
| class SampleOrderSpec: | ||
| """Generic sample-ordering selector consumed by create_sample_order. | ||
|
|
||
| fixed_index is None -> without-replacement (the normal default). | ||
| fixed_index set -> always issue that one fixed dataset index. | ||
| """ | ||
|
|
||
| fixed_index: int | None = None | ||
|
|
||
| @classmethod | ||
| def without_replacement(cls) -> SampleOrderSpec: | ||
| return cls(fixed_index=None) | ||
|
|
||
| @classmethod | ||
| def single(cls, index: int) -> SampleOrderSpec: | ||
| return cls(fixed_index=index) |
There was a problem hiding this comment.
I thought sample order is part of the load pattern traits. @viraatc @nv-alicheng will this be a duplicate?
| # `random` module, so an order constructed without an explicit rng can't | ||
| # couple its draws to unrelated global state. Reproducible runs pass a | ||
| # seeded rng (see create_sample_order). | ||
| self.rng = rng if rng is not None else random.Random() |
There was a problem hiding this comment.
Wait, should this be touched in this way?
@nv-alicheng to help review
| if ready_file is not None: | ||
| cmd += ["--ready-file", str(ready_file)] |
There was a problem hiding this comment.
Why do we need this file?
There was a problem hiding this comment.
I see the reason, sounds like it should be a temp file instead of a arg specified file
…not run_benchmark Addresses review feedback (no function-level imports). The lazy imports existed only to break an execute<->audit import cycle: run_benchmark imported run_audit, and run_audit imported setup/run/finalize from execute. Break the cycle so all imports are top-level: - run_benchmark no longer dispatches the audit; it returns the run's report_dir. - cli._run dispatches the audit after the main run: run_audit(config, report_dir / "audit") and maps PASS/FAIL to the exit code. (Interrupt still skips the audit: run_benchmark re-raises and cli never reaches the dispatch.) - audit.py imports execute helpers at module top (one-way, no cycle); execute.py no longer imports compliance/audit at all. Tests updated for the new dispatch point: the interrupt test targets cli._run, the audit e2e test calls run_benchmark + run_audit, and the incomplete-phase guard test patches the execute helpers as bound in commands.audit. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…for TEST04 Per review (nvzhihanj on mlcommons#332): the SampleOrder.__init__ rng default (`random` module → per-instance random.Random()) is a pre-existing load-gen line unrelated to the output-caching audit, and the right default is a load-gen ownership call. Revert it here to keep this PR scoped to compliance; the global-RNG-sharing concern can be addressed in its own load-gen PR. Removes the accompanying TestDefaultRng test as well. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…/.ready marker Per review (nvzhihanj on mlcommons#332): the readiness sentinel shouldn't be a CLI-arg-specified file. Remove the --ready-file argument and instead have the aggregator always touch <metrics_output_dir>/.ready once its signal handlers are registered. The signal-handling test polls that marker (still an exact "ready to receive signals" gate, replacing the flaky fixed sleep), with no test-only CLI surface on the production subprocess. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
| ║ ╱──────────────╲ yes ╱─────────────────╲ ║ │ | ||
| ║ ╱ first phase? ╲───►╱ every spec's ╲──╫──no─────┘ | ||
| ║ ╲────────┬────────╱ ╲ sample_index in ╱ ║ (out of range) | ||
| ║ │ no ╲ range [0,N)? ╱ ║ | ||
| ║ │ ╲──────┬────────-╱ ║ | ||
| ║ │◄─────────────────────┘ yes ║ | ||
| ║ ▼ ║ |
There was a problem hiding this comment.
can you elaborate on this part?
|
|
||
| ### Test matrix (LLM-relevant subset) | ||
|
|
||
| | Test | Detects | Category | Required for | |
There was a problem hiding this comment.
Are these categories defined somewhere in this doc or are they MLPerf terms.
| A single protocol covers **both** categories — orchestrators (must execute a | ||
| specially-configured run) and analyzers (pure post-run). An analyzer is just an audit whose | ||
| plan is a single normal run, so the orchestration loop never special-cases a category. |
There was a problem hiding this comment.
Can we move this definition to the top before we use it. Would help ease the narrative.
| class OutputCachingTestConfig(BaseModel): | ||
| model_config = ConfigDict(frozen=True, extra="forbid") | ||
| test: Literal[AuditTestId.OUTPUT_CACHING_TEST] | ||
| samples: int # reference phase count (required, ge=1) | ||
| audit_samples: int | None = None # audit phase count; None = equals `samples` | ||
| sample_index: int = 0 # MLPerf performance_issue_same_index | ||
| threshold: float = 0.10 # caching tolerance (MLPerf TEST04-specific) |
There was a problem hiding this comment.
If we wanted to add two OutputCachingTest instances, would we duplicate the block or put a list in sample_index?
| ```yaml | ||
| # Full WAN 2.2 Offline submission: performance + VBench accuracy + TEST04 audit. | ||
| # One command runs all three under a single report_dir: | ||
| # inference-endpoint benchmark from-config \ | ||
| # examples/09_Wan22_VideoGen_Example/offline_wan22_submission.yaml | ||
| # | ||
| # Execution order (run_benchmark): | ||
| # 1. performance run — full 248-prompt dataset (the submission perf result) | ||
| # 2. accuracy scoring — VBench over the produced videos | ||
| # 3. audit (TEST04) — reference + fixed-sample phases (equal counts here), then result | ||
| # | ||
| # NOTE: the `audit:` block is implemented per docs/compliance_audit_plan.md | ||
| # (the `compliance/` module). The performance + accuracy portion mirrors | ||
| # offline_wan22_accuracy.yaml. | ||
|
|
||
| name: "submission-wan22-video-generation" | ||
| version: "1.0" | ||
| type: "submission" | ||
| benchmark_mode: "offline" # required for type: submission | ||
|
|
||
| model_params: | ||
| name: "wan22" | ||
| max_new_tokens: 1 # ignored by VideoGenAdapter; kept >0 for api_type debug swaps | ||
| streaming: "off" # WAN 2.2 uses non-streaming HTTP POST/response | ||
|
|
||
| datasets: | ||
| # Performance dataset drives request issuance (the submission perf run). | ||
| - name: wan22_perf | ||
| path: examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl | ||
| type: "performance" | ||
| samples: 248 | ||
|
|
||
| # Accuracy dataset reuses the same prompts; videos are scored VBench-style. | ||
| - name: wan22_vbench | ||
| path: examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl | ||
| type: "accuracy" | ||
| samples: 248 | ||
| accuracy_config: | ||
| eval_method: "vbench" | ||
| ground_truth: "prompt" # VBench input is (prompt, video), not a GT comparison | ||
| num_repeats: 1 | ||
|
|
||
| # TEST04 caching audit — additive post-step. Runs its OWN reference + fixed-sample | ||
| # phases at equal counts (the audit count may be lowered to shorten the phase). | ||
| audit: | ||
| test: "output_caching_test" | ||
| samples: 64 # reference phase count (subset of the 248 prompts) | ||
| audit_samples: 64 # audit (fixed-sample) phase count; lower (e.g. 32) to shorten the audit phase | ||
| sample_index: 3 # MLCommons audit.config performance_issue_same_index=3 | ||
| threshold: 0.10 # audit qps must stay < reference qps * (1 + threshold) | ||
|
|
||
| settings: | ||
| runtime: | ||
| # NOTE: runs are count-driven (n_samples_to_issue / audit.samples). min_duration_ms is | ||
| # NOT enforced as a duration floor by the current stop logic (counts take priority); | ||
| # MLCommons' 10-min minimum / AND-semantics is future work. Only max_duration_ms caps. | ||
| max_duration_ms: 14400000 # 4-hour ceiling | ||
| scheduler_random_seed: 42 | ||
| dataloader_random_seed: 42 | ||
| n_samples_to_issue: 248 # applies to the perf/accuracy run; audit uses audit.samples | ||
|
|
||
| load_pattern: | ||
| type: "max_throughput" | ||
|
|
||
| endpoint_config: | ||
| endpoints: | ||
| - "http://localhost:8000" | ||
| api_type: "videogen" | ||
| api_key: null | ||
|
|
||
| report_dir: logs/wan22_submission | ||
| ``` |
There was a problem hiding this comment.
Lets not replicate the file here -becomes hard to keep both in sync.
| # audit_result.json) nest under <report_dir>/audit/ so they don't | ||
| # intermingle with the main run's top-level output. | ||
| result = run_audit(config, report_dir / "audit") | ||
| sys.exit(0 if result.passed else 1) |
There was a problem hiding this comment.
I don't think we should be calling system.exit - is this needed?
| class RunSpec: | ||
| """Declarative description of one audit phase. | ||
|
|
||
| ``n_samples = None`` means "issue the benchmark's default count" (full | ||
| dataset / duration-driven) — it flows through to | ||
| ``RuntimeSettings.n_samples_to_issue`` unchanged. | ||
| """ | ||
|
|
||
| label: str | ||
| n_samples: int | None | ||
| sample_order: SampleOrderSpec | ||
|
|
||
|
|
||
| @dataclass(frozen=True, slots=True) | ||
| class RunStats: |
There was a problem hiding this comment.
Can we rename these to AuditRunSpec and AuditRunStats so its clear that they are related to the audit phase. Otherwise we can move them to a shared/central location.
There was a problem hiding this comment.
This is a confusing folder name - probably audit or audit_test might be better. This resembles a test folder for unit tests of a component like the one where you added test_output_caching.py.
| 1. Each phase completed ≥ (1 - threshold) of its requested queries. | ||
| 2. audit_qps < ref_qps * (1 + threshold) |
There was a problem hiding this comment.
So there are two constraints - one for the number of samples completed and one for the qps? Both use the same threshold but in different directions.
| threshold: float = Field( | ||
| 0.10, | ||
| gt=0, | ||
| lt=1, | ||
| description="Caching tolerance: audit_qps must stay < ref_qps * (1 + threshold)", | ||
| ) |
There was a problem hiding this comment.
note - we are also using this for n_completed
Summary
Adds an extensible MLPerf compliance-audit framework with TEST04 (caching detection) as the first test, driven by an
audit:block in the benchmark YAML. This PR carries the full redesign: the approved design plan, the implementation, tests, and runnable WAN2.2 examples.TEST04 issues one fixed sample for every query in an audit phase; if repeating an identical request makes the SUT meaningfully faster, it is serving from cache. Pass iff the audit run is at most 10% faster than the reference (matching upstream
compliance/TEST04/verify_performance.py).Design (the two axes)
SampleOrderSpec(WITHOUT_REPLACEMENT | SINGLE(index)) carried on aRunSpec. No test-specific knowledge leaks into the load generator.AuditTest.verify(runs) -> AuditVerdict, registered per test.A generic orchestrator (
commands/audit.py::run_audit) runs eachRunSpecphase back-to-back via the existingsetup_benchmark/run_benchmark_asyncpath, then verifies and writes the verdict. Adding TEST01/06/07/09 later is a new registry entry, not cross-cutting edits.Config shape
AuditConfigis a discriminated-union-ready sub-model onBenchmarkConfig(parallel toAccuracyConfig) — noDatasetType.AUDIT, no audit fields pollutingDataset, notest04boolean inRuntimeSettings.What's included
compliance/__init__.py—AuditTestprotocol +RunSpec/RunStats/RunArtifacts+ registrycompliance/verdict.py—AuditVerdict+ atomicwrite_verdict(tmp → fsync → rename → fsync)compliance/tests/test04.py—Test04Audit+verify_test04commands/audit.py— genericrun_auditorchestratorconfig/schema.py—AuditTestId+Test04Config/AuditConfig+BenchmarkConfig.auditload_generator—SampleOrderSpec+SingleSampleOrder+ factory dispatchdocs/compliance_audit_plan.md— the design planoffline_wan22_submission.yaml,single_stream_wan22_submission.yamlExit codes
benchmark from-configwith anaudit:block exits 0 (PASS) / 1 (FAIL); errors propagate via the standard handler using the repo-wide scheme (InputValidationError→ 2,SetupError→ 3,ExecutionError→ 4). The on-diskaudit_verdict.jsonis the durable record.Testing
Unit + integration green;
pre-commit run --all-filesclean. The e2e test exercises the fullaudit:→run_audit→AuditVerdictflow for both max_throughput (offline) and concurrency=1 (single-stream).🤖 Generated with Claude Code