Skip to content

feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332

Open
wu6u3tw wants to merge 6 commits into
mlcommons:mainfrom
wu6u3tw:feat/test04-compliance
Open

feat(compliance): MLPerf TEST04 caching audit — extensible AuditTest framework#332
wu6u3tw wants to merge 6 commits into
mlcommons:mainfrom
wu6u3tw:feat/test04-compliance

Conversation

@wu6u3tw

@wu6u3tw wu6u3tw commented Jun 3, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds an extensible MLPerf compliance-audit framework with TEST04 (caching detection) as the first test, driven by an audit: block in the benchmark YAML. This PR carries the full redesign: the approved design plan, the implementation, tests, and runnable WAN2.2 examples.

TEST04 issues one fixed sample for every query in an audit phase; if repeating an identical request makes the SUT meaningfully faster, it is serving from cache. Pass iff the audit run is at most 10% faster than the reference (matching upstream compliance/TEST04/verify_performance.py).

Design (the two axes)

  • Axis A — run modification: expressed as a generic typed SampleOrderSpec (WITHOUT_REPLACEMENT | SINGLE(index)) carried on a RunSpec. No test-specific knowledge leaks into the load generator.
  • Axis B — verification: a pure post-run check, AuditTest.verify(runs) -> AuditVerdict, registered per test.

A generic orchestrator (commands/audit.py::run_audit) runs each RunSpec phase back-to-back via the existing setup_benchmark / run_benchmark_async path, then verifies and writes the verdict. Adding TEST01/06/07/09 later is a new registry entry, not cross-cutting edits.

Config shape

audit:
  test: test04
  samples: 64         # reference phase query count
  audit_samples: 64   # audit (fixed-sample) phase count
  sample_index: 3     # MLCommons performance_issue_same_index
  threshold: 0.10     # audit qps must stay < ref qps * (1 + threshold)

AuditConfig is a discriminated-union-ready sub-model on BenchmarkConfig (parallel to AccuracyConfig) — no DatasetType.AUDIT, no audit fields polluting Dataset, no test04 boolean in RuntimeSettings.

What's included

  • compliance/__init__.pyAuditTest protocol + RunSpec/RunStats/RunArtifacts + registry
  • compliance/verdict.pyAuditVerdict + atomic write_verdict (tmp → fsync → rename → fsync)
  • compliance/tests/test04.pyTest04Audit + verify_test04
  • commands/audit.py — generic run_audit orchestrator
  • config/schema.pyAuditTestId + Test04Config/AuditConfig + BenchmarkConfig.audit
  • load_generatorSampleOrderSpec + SingleSampleOrder + factory dispatch
  • Unit tests + e2e integration test (offline + single-stream) against the echo server
  • docs/compliance_audit_plan.md — the design plan
  • WAN2.2 submission examples: offline_wan22_submission.yaml, single_stream_wan22_submission.yaml

Exit codes

benchmark from-config with an audit: block exits 0 (PASS) / 1 (FAIL); errors propagate via the standard handler using the repo-wide scheme (InputValidationError → 2, SetupError → 3, ExecutionError → 4). The on-disk audit_verdict.json is the durable record.

Testing

Unit + integration green; pre-commit run --all-files clean. The e2e test exercises the full audit:run_auditAuditVerdict flow for both max_throughput (offline) and concurrency=1 (single-stream).

🤖 Generated with Claude Code

@wu6u3tw wu6u3tw requested a review from a team June 3, 2026 20:53
@github-actions

github-actions Bot commented Jun 3, 2026

Copy link
Copy Markdown

MLCommons CLA bot All contributors have signed the MLCommons CLA ✍️ ✅

@gemini-code-assist gemini-code-assist Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code Review

This pull request implements the MLPerf TEST04 compliance audit to detect result caching by repeatedly issuing a single fixed sample and comparing the throughput against a reference run. It introduces configuration options, validation guards, a SingleSampleOrder generator, and a compliance verification module with a CLI tool and tests. The review feedback focuses on improving the robustness of the compliance verifier, specifically by handling potential OSError exceptions during file writes, catching AttributeError when parsing non-dictionary JSON configurations, and gracefully handling malformed snapshot files during parsing.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

Comment thread src/inference_endpoint/compliance/__main__.py Outdated
Comment thread src/inference_endpoint/compliance/test04.py Outdated
Comment thread src/inference_endpoint/compliance/test04.py Outdated
wu6u3tw added a commit to wu6u3tw/endpoints that referenced this pull request Jun 3, 2026
…review

Address gemini-code-assist review on PR mlcommons#332:
- CLI catches OSError (PermissionError etc.) and write_verdict failures,
  not just FileNotFoundError/ValueError — all map to exit 2.
- _audit_marker tolerates non-dict results.json (isinstance guards) instead
  of raising AttributeError.
- _run_stats_from_dir rejects a non-dict snapshot with a clear ValueError.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@wu6u3tw wu6u3tw requested review from arekay-nv and nv-alicheng June 3, 2026 22:19
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22_test04.yaml Outdated
@wu6u3tw

wu6u3tw commented Jun 4, 2026

Copy link
Copy Markdown
Collaborator Author

Update summary

All review feedback has been addressed. Here is what changed since the original submission:

Architecture (main concern)

  • audit: test04 now runs both phases in a single command — reference run then audit run back-to-back against the same endpoint, with automatic comparison and verdict output. No more 3-step manual workflow.

Config shape

  • type: audit dataset replaces the old settings.runtime.test04_sample_index and audit_n_samples runtime variables. Reference and audit sample counts are now independent and co-located with the dataset config — consistent with how type: accuracy datasets carry their own accuracy_config.
audit: "test04"

datasets:
  - name: wan22_prompts
    path: wan22_prompts.jsonl
    type: "performance"
    samples: 50          # reference phase query count (50–144)

  - name: wan22_audit
    path: wan22_prompts.jsonl
    type: "audit"
    samples: 25          # audit phase query count (25–50)
    audit_sample_index: 0

Robustness

  • Warning logged when audit: test04 is set but no type: audit dataset is present (previously silent fallback to index 0).
  • Phase failures (SetupError/ExecutionError) are caught and logged cleanly — no unhandled traceback, verdict not lost.
  • Report.from_snapshot wrapped in try/except in _run_stats_from_dir — malformed snapshots exit with code 2 instead of crashing.
  • Pre-flight audit_sample_index bounds check before dataset load.

Testing

  • New e2e integration test (test_audit_test04_two_phase_flow) exercises the full run_benchmark → two-phase flow against the echo server and asserts both phase subdirs are created and the flow completes gracefully.

Example config

  • Renamed offline_wan22_test04.yamlwan22_audit_test04.yaml per review suggestion.

@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 9057190 to b547f1d Compare June 4, 2026 23:14
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from cdbae64 to eae1234 Compare June 4, 2026 23:40
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated
Comment thread tests/unit/compliance/test_audit_test04.py Outdated
Comment thread README.md Outdated
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 3 times, most recently from b190d21 to e0de06f Compare June 5, 2026 21:03
Comment thread examples/09_Wan22_VideoGen_Example/offline_wan22.yaml
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from 385630c to c1e48bf Compare June 5, 2026 21:22
@wu6u3tw

wu6u3tw commented Jun 5, 2026

Copy link
Copy Markdown
Collaborator Author

All review feedback has been addressed. Here's a summary of what changed:

Architecture
audit: test04 now runs reference and audit phases in a single command back-to-back against the same endpoint — no more 3-step workflow, no endpoint-change risk. A single type: audit dataset entry drives both phases (carrying ref_samples, audit_samples, audit_sample_index).

Sample counts & index
ref_samples: 50, audit_samples: 25 — sized for WAN2.2 throughput. audit_sample_index: 3 — fixed per MLCommons audit.config (performance_issue_same_index=3 for WAN2.2).

SingleStream
Added wan22_single_stream_test04.yaml (concurrency=1, ref/audit samples=20 matching MLCommons min_query_count).

Durations
Perf configs: min=10min, max=4hr. Audit configs: min=10min, max=2hr. The 10-min minimum documents MLCommons compliance intent; counts take priority in the current session stop logic, with AND-semantics available as a future improvement.

Robustness fixes (Gemini)

  • write_verdict moved inside try-except in CLI
  • _audit_marker uses isinstance guards — no AttributeError possible
  • Report.from_snapshot wrapped in try/except (KeyError, TypeError) in _run_stats_from_dir

Cleanup

  • Test renamed to test_audit_test04.py
  • README.md removed from diff (rebased onto main)
  • Orphaned type: audit datasets in non-TEST04 configs now emit a warning; multiple audit datasets raise InputValidationError

Comment thread examples/09_Wan22_VideoGen_Example/wan22_audit_test04.yaml Outdated

@nvzhihanj nvzhihanj left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review Council — first-principles design review

Reviewed by: Claude (Codex review timed out on this 2046-line diff at xhigh reasoning) · Depth: thorough

Focus: design issues warranting re-design for a modular, extensible audit-test framework (TEST04 is the first of several). 11 findings; see the tiered summary comment. The ref_samples dead-write (#1) was independently verified against the source.

Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread src/inference_endpoint/compliance/__init__.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/config/runtime_settings.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
Comment thread tests/integration/commands/test_benchmark_command.py Outdated
Comment thread src/inference_endpoint/config/schema.py Outdated
Comment thread src/inference_endpoint/compliance/test04.py Outdated
Comment thread src/inference_endpoint/compliance/test04.py Outdated
Comment thread src/inference_endpoint/commands/benchmark/execute.py Outdated
@nvzhihanj

Copy link
Copy Markdown
Collaborator

Review Council — Multi-AI Code Review (first-principles design review)

Reviewed by: Claude · Depth: thorough
Codex review timed out on this 2046-line diff at xhigh reasoning (the load-gen + compliance surface is large); this pass is Claude-led. The one HIGH bug below was independently verified against the source.

Framing: TEST04 is the first MLPerf compliance/audit test and is meant to become a modular, extensible framework. The findings below are design-led — what would adding the next audit (TEST01/05) cost, and where does TEST04-specific knowledge leak into general-purpose code. 11 findings, all posted inline.

🔴 Re-design / Must-fix

# File Line Cat Why it needs a re-design
1 commands/benchmark/execute.py 1151 bug ref_samples is a dead write. Dataset.samples is consumed nowhere; ref_config never sets n_samples_to_issue, so the reference phase runs duration-driven and ignores ref_samples while the audit phase honors audit_samples → the compared phases run mismatched counts. Set n_samples_to_issue=ref_samples.
2 compliance/__init__.py 18 design No AuditTest abstraction. run_benchmark hardcodes if audit==TEST04; package exports only test04_*. Adding TEST01/05 = cross-cutting edits everywhere. Introduce an AuditTest protocol (plan_runs+verify) registered by AuditMode.
3 config/schema.py 82 design DatasetType.AUDIT is a fake dataset type the loader ignores, carrying test params on the shared Dataset model, then converted to PERFORMANCE. Move params to a structured audit: block; drop the fake type.
4 config/runtime_settings.py 90 design test04 boolean leaks into core load-gen. RuntimeSettings.test04/test04_sample_index + create_sample_order's if settings.test04. Use a generic sample-order strategy selector, not a per-test flag.

🟡 Should-fix

# File Line Cat Summary
5 commands/benchmark/execute.py 113 design _OVERRIDE_TEST04_SAMPLE_INDEX stringly-typed magic kwarg through **runtime_overrides; pass a typed run_spec instead.
6 commands/benchmark/execute.py 1146 design Two-phase model_copy surgery is fragile (root cause of #1; ref phase also skips _validate_audit_test04). Use a declarative RunSpec + validate before any phase runs.
7 tests/integration/commands/test_benchmark_command.py 209 testing _run_benchmark_test04 has no unit test; the one integration test asserts verdict OR error with min_duration_ms=0 — the regime that hides bug #1.
8 config/schema.py 666 design audit bare top-level enum; params scattered, threshold hardcoded. Use a structured compliance sub-config (like accuracy_config).
9 compliance/test04.py 206 design QPS compared across phases with different counts/contents (upstream TEST04 uses the same query set); completion guard only protects the FAIL direction. Extends the existing fairness thread; compounded by #1.

🔵 Consider

# File Line Cat Summary
10 compliance/test04.py 175 design verify_test04_dirs vs verify_test04_from_reports duplication; dir-swap guard in one path only. Collapse to one core + thin adapters.
11 commands/benchmark/execute.py 446 bug audit_sample_index bound-checked vs requested counts, not the loaded dataset size, until phase 2 — an out-of-range index wastes a full reference run.

Through-line: #1, #5, #6, #7 are all symptoms of the same root cause — TEST04 is bolted onto run_benchmark via per-phase config surgery and untyped overrides instead of a first-class audit-test abstraction (#2). Fixing #2/#3/#4 (an AuditTest that emits typed RunSpecs + a generic ordering strategy) would dissolve most of the others structurally.

Dedup: none overlap existing inline comments except #9, which extends the maintainer's existing fairness thread with upstream-parity / guard-direction substance.

Comment thread examples/09_Wan22_VideoGen_Example/single_stream_wan22.yaml Outdated
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 1dfaf3c to 89179d5 Compare June 22, 2026 23:50
@wu6u3tw

wu6u3tw commented Jun 22, 2026

Copy link
Copy Markdown
Collaborator Author

History squashed to 3 commits

The branch was force-pushed (89179d5) to reorganize the feature into three focused commits. No content changed vs the prior tip — only the commit layout. Heads-up: earlier inline review comments may now show as outdated since they pointed at the previous commits.

# Commit Summary
1 docs(compliance): output-caching audit (MLPerf TEST04) design + examples Design plan (docs/compliance_audit_plan.md), the compliance entry in AGENTS.md, and the WAN2.2 Offline/SingleStream submission example configs (perf + accuracy + audit in one from-config run).
2 feat(compliance): output-caching audit (MLPerf TEST04) implementation Generic AuditTest framework (compliance/): protocol + RunSpec/RunStats/RunArtifacts + registry; OutputCachingAudit implements TEST04 caching detection (reference vs fixed-sample phase; fails if audit QPS > reference × (1 + threshold)). run_audit orchestrator runs phases back-to-back, validates unpaced load + sample_index, refuses to certify an incomplete phase, and atomically writes verify_OUTPUT_CACHING_TEST.txt + audit_result.json. Wired via the YAML audit: block and a generic SampleOrderSpec/SingleSampleOrder seam.
3 test(compliance): unit + integration tests for the output-caching audit Unit tests for the verify core, plan_runs/verify, RunStats.from_report, the run_audit guards (load-pattern allow-list, incomplete-phase abort, interrupt-skips-audit), SampleOrderSpec/SingleSampleOrder, and the atomic result writer; plus the end-to-end audit: flow (offline + single-stream).

Each commit independently passes pre-commit run --all-files (ruff, ruff-format, mypy, prettier, template regen, license, uv.lock).

Naming note: this version names the test output_caching_test (AuditTestId.OUTPUT_CACHING_TEST, OutputCachingTestConfig, compliance/result.py); the artifact is verify_OUTPUT_CACHING_TEST.txt. "TEST04" is retained in prose/comments as the upstream MLPerf test this re-implements.

@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch 2 times, most recently from 6411c17 to bb12443 Compare June 23, 2026 00:07
@wu6u3tw

wu6u3tw commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

@viraatc @arekay-nv can you review it with impl is already updated in this PR.

@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from bb12443 to 5abd63b Compare June 23, 2026 01:18
@wu6u3tw

wu6u3tw commented Jun 23, 2026

Copy link
Copy Markdown
Collaborator Author

Per offline discussion with @nvzhihanj, will modify the dump json file layout under audit/output_caching_test.json

wu6u3tw and others added 3 commits June 25, 2026 15:02
Design plan (docs/compliance_audit_plan.md, incl. an ASCII program-flow
diagram showing every decision gate and its exit code), the compliance-module
entry in AGENTS.md, and the WAN2.2 Offline/SingleStream submission example
configs. All audit output nests under <report_dir>/audit/.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Generic AuditTest framework (compliance/): AuditTest protocol +
RunSpec/RunStats/RunArtifacts + registry; OutputCachingAudit implements MLPerf
TEST04 caching detection — reference vs fixed-sample phase, fails if audit QPS
exceeds reference QPS by > threshold. run_audit (commands/audit.py) runs phases
back-to-back, validates unpaced load + sample_index, refuses to certify an
incomplete phase, and writes verify_OUTPUT_CACHING_TEST.txt + audit_result.json
atomically under <report_dir>/audit/. Wired via the YAML audit: block and a
generic SampleOrderSpec + SingleSampleOrder seam. RunStats.from_report reads the
Report.qps attribute (matches the current Report API).

Also folds in incidental branch changes touching these files: metrics-aggregator
--ready-file flag, service launcher ready-check timeout, and the aiohttp +
msgpack==1.2.1 CVE bumps (msgpack clears GHSA-6v7p-g79w-8964).

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Unit tests for verify_output_caching, plan_runs/verify, RunStats.from_report,
the run_audit guards (load-pattern, incomplete phase, interrupt-skips-audit),
SampleOrderSpec/SingleSampleOrder, and the atomic result writer; plus the
end-to-end audit: flow asserting artifacts land under <report_dir>/audit/.

Rejected-load-pattern parametrization is derived from the LoadPatternType enum
so it stays correct regardless of which patterns exist on the base branch.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@wu6u3tw wu6u3tw force-pushed the feat/test04-compliance branch from 5abd63b to d9cdeda Compare June 25, 2026 22:03
- name: wan22_vbench
path: examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl
type: "accuracy"
samples: 20

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just to make sure, we are only running 20 samples for singlestream accuracy? I don't see it being mentioned in https://github.com/mlcommons/inference/tree/master/text_to_video/wan-2.2-t2v-a14b

default=0,
help="Identity to send in the readiness signal",
)
parser.add_argument(

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? seems like an artifacts that should be removed

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@arekay-nv to review

Comment on lines +57 to +58
if isinstance(result, AuditResult):
sys.exit(0 if result.passed else 1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might actually be a valid change. @arekay-nv @nv-alicheng should we remove the if and put proper exit code for the benchmark results?

logger.info(f"Partial results saved to {ctx.report_dir}")

if config.audit is not None:
from inference_endpoint.commands.audit import run_audit

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a small module so I don't think we should lazy import

logger.warning("Benchmark interrupted by user")
# Salvage partial results (finally), then propagate: an interrupted
# run must not silently roll into the long compliance audit phases.
logger.warning("Benchmark interrupted by user; skipping audit")

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't seem necessary because not all run have audit

SetupError: Config invalid for audit (missing audit block, paced load, bad index).
ExecutionError: A phase benchmark run failed.
"""
from ..commands.benchmark.execute import (

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lazy import, please fix

Comment on lines +83 to +87
if load_type not in (LoadPatternType.MAX_THROUGHPUT, LoadPatternType.CONCURRENCY):
raise SetupError(
"Compliance audit requires an unpaced load pattern (max_throughput or concurrency). "
f"Got: {load_type.value}"
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't think this is a valid assert. Poisson can also run TEST04

Comment on lines +112 to +119
for check_spec in specs:
idx = check_spec.sample_order.fixed_index
if idx is not None and not (0 <= idx < n_samples):
raise SetupError(
f"Audit phase '{check_spec.label}': sample_index={idx} "
f"is out of range [0, {n_samples}) for dataset with "
f"{n_samples} samples"
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look at this chunk of code, it's checking N^2 times of the index. Not sure why it's needed but doesn't look right.

Comment on lines +141 to +143
n_requested = (
spec.n_samples if spec.n_samples is not None else report.n_samples_issued
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see quite a different places where n_samples, n_requested are read from spec and report. Do we want n_samples_issued to be equal to n_samples or dataset samples?

MLPerf Inference TEST04 compliance test.

Pass criterion (MLCommons-faithful):
Each phase completed ≥ requested * (1 - threshold)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this? number of samples?

return RunStats.from_report(self.report, self.n_requested)


class AuditTest(Protocol):

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nv-alicheng to review if this is the right way to register audit tests

Comment on lines +44 to +60
@dataclass(frozen=True, slots=True)
class SampleOrderSpec:
"""Generic sample-ordering selector consumed by create_sample_order.

fixed_index is None -> without-replacement (the normal default).
fixed_index set -> always issue that one fixed dataset index.
"""

fixed_index: int | None = None

@classmethod
def without_replacement(cls) -> SampleOrderSpec:
return cls(fixed_index=None)

@classmethod
def single(cls, index: int) -> SampleOrderSpec:
return cls(fixed_index=index)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought sample order is part of the load pattern traits. @viraatc @nv-alicheng will this be a duplicate?

# `random` module, so an order constructed without an explicit rng can't
# couple its draws to unrelated global state. Reproducible runs pass a
# seeded rng (see create_sample_order).
self.rng = rng if rng is not None else random.Random()

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wait, should this be touched in this way?
@nv-alicheng to help review

Comment on lines +68 to +69
if ready_file is not None:
cmd += ["--ready-file", str(ready_file)]

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need this file?

@nvzhihanj nvzhihanj Jun 25, 2026

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see the reason, sounds like it should be a temp file instead of a arg specified file

wu6u3tw and others added 3 commits June 25, 2026 17:08
…not run_benchmark

Addresses review feedback (no function-level imports). The lazy imports existed
only to break an execute<->audit import cycle: run_benchmark imported run_audit,
and run_audit imported setup/run/finalize from execute.

Break the cycle so all imports are top-level:
- run_benchmark no longer dispatches the audit; it returns the run's report_dir.
- cli._run dispatches the audit after the main run:
  run_audit(config, report_dir / "audit") and maps PASS/FAIL to the exit code.
  (Interrupt still skips the audit: run_benchmark re-raises and cli never reaches
  the dispatch.)
- audit.py imports execute helpers at module top (one-way, no cycle); execute.py
  no longer imports compliance/audit at all.

Tests updated for the new dispatch point: the interrupt test targets cli._run,
the audit e2e test calls run_benchmark + run_audit, and the incomplete-phase
guard test patches the execute helpers as bound in commands.audit.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…for TEST04

Per review (nvzhihanj on mlcommons#332): the SampleOrder.__init__ rng default
(`random` module → per-instance random.Random()) is a pre-existing load-gen
line unrelated to the output-caching audit, and the right default is a
load-gen ownership call. Revert it here to keep this PR scoped to compliance;
the global-RNG-sharing concern can be addressed in its own load-gen PR.

Removes the accompanying TestDefaultRng test as well.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
…/.ready marker

Per review (nvzhihanj on mlcommons#332): the readiness sentinel shouldn't be a
CLI-arg-specified file. Remove the --ready-file argument and instead have the
aggregator always touch <metrics_output_dir>/.ready once its signal handlers
are registered. The signal-handling test polls that marker (still an exact
"ready to receive signals" gate, replacing the flaky fixed sleep), with no
test-only CLI surface on the production subprocess.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Comment on lines +173 to +179
║ ╱──────────────╲ yes ╱─────────────────╲ ║ │
║ ╱ first phase? ╲───►╱ every spec's ╲──╫──no─────┘
║ ╲────────┬────────╱ ╲ sample_index in ╱ ║ (out of range)
║ │ no ╲ range [0,N)? ╱ ║
║ │ ╲──────┬────────-╱ ║
║ │◄─────────────────────┘ yes ║
║ ▼ ║

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you elaborate on this part?


### Test matrix (LLM-relevant subset)

| Test | Detects | Category | Required for |

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Are these categories defined somewhere in this doc or are they MLPerf terms.

Comment on lines +224 to +226
A single protocol covers **both** categories — orchestrators (must execute a
specially-configured run) and analyzers (pure post-run). An analyzer is just an audit whose
plan is a single normal run, so the orchestration loop never special-cases a category.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move this definition to the top before we use it. Would help ease the narrative.

Comment on lines +275 to +281
class OutputCachingTestConfig(BaseModel):
model_config = ConfigDict(frozen=True, extra="forbid")
test: Literal[AuditTestId.OUTPUT_CACHING_TEST]
samples: int # reference phase count (required, ge=1)
audit_samples: int | None = None # audit phase count; None = equals `samples`
sample_index: int = 0 # MLPerf performance_issue_same_index
threshold: float = 0.10 # caching tolerance (MLPerf TEST04-specific)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we wanted to add two OutputCachingTest instances, would we duplicate the block or put a list in sample_index?

Comment on lines +412 to +483
```yaml
# Full WAN 2.2 Offline submission: performance + VBench accuracy + TEST04 audit.
# One command runs all three under a single report_dir:
# inference-endpoint benchmark from-config \
# examples/09_Wan22_VideoGen_Example/offline_wan22_submission.yaml
#
# Execution order (run_benchmark):
# 1. performance run — full 248-prompt dataset (the submission perf result)
# 2. accuracy scoring — VBench over the produced videos
# 3. audit (TEST04) — reference + fixed-sample phases (equal counts here), then result
#
# NOTE: the `audit:` block is implemented per docs/compliance_audit_plan.md
# (the `compliance/` module). The performance + accuracy portion mirrors
# offline_wan22_accuracy.yaml.

name: "submission-wan22-video-generation"
version: "1.0"
type: "submission"
benchmark_mode: "offline" # required for type: submission

model_params:
name: "wan22"
max_new_tokens: 1 # ignored by VideoGenAdapter; kept >0 for api_type debug swaps
streaming: "off" # WAN 2.2 uses non-streaming HTTP POST/response

datasets:
# Performance dataset drives request issuance (the submission perf run).
- name: wan22_perf
path: examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl
type: "performance"
samples: 248

# Accuracy dataset reuses the same prompts; videos are scored VBench-style.
- name: wan22_vbench
path: examples/09_Wan22_VideoGen_Example/wan22_prompts.jsonl
type: "accuracy"
samples: 248
accuracy_config:
eval_method: "vbench"
ground_truth: "prompt" # VBench input is (prompt, video), not a GT comparison
num_repeats: 1

# TEST04 caching audit — additive post-step. Runs its OWN reference + fixed-sample
# phases at equal counts (the audit count may be lowered to shorten the phase).
audit:
test: "output_caching_test"
samples: 64 # reference phase count (subset of the 248 prompts)
audit_samples: 64 # audit (fixed-sample) phase count; lower (e.g. 32) to shorten the audit phase
sample_index: 3 # MLCommons audit.config performance_issue_same_index=3
threshold: 0.10 # audit qps must stay < reference qps * (1 + threshold)

settings:
runtime:
# NOTE: runs are count-driven (n_samples_to_issue / audit.samples). min_duration_ms is
# NOT enforced as a duration floor by the current stop logic (counts take priority);
# MLCommons' 10-min minimum / AND-semantics is future work. Only max_duration_ms caps.
max_duration_ms: 14400000 # 4-hour ceiling
scheduler_random_seed: 42
dataloader_random_seed: 42
n_samples_to_issue: 248 # applies to the perf/accuracy run; audit uses audit.samples

load_pattern:
type: "max_throughput"

endpoint_config:
endpoints:
- "http://localhost:8000"
api_type: "videogen"
api_key: null

report_dir: logs/wan22_submission
```

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lets not replicate the file here -becomes hard to keep both in sync.

# audit_result.json) nest under <report_dir>/audit/ so they don't
# intermingle with the main run's top-level output.
result = run_audit(config, report_dir / "audit")
sys.exit(0 if result.passed else 1)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we should be calling system.exit - is this needed?

Comment on lines +36 to +50
class RunSpec:
"""Declarative description of one audit phase.

``n_samples = None`` means "issue the benchmark's default count" (full
dataset / duration-driven) — it flows through to
``RuntimeSettings.n_samples_to_issue`` unchanged.
"""

label: str
n_samples: int | None
sample_order: SampleOrderSpec


@dataclass(frozen=True, slots=True)
class RunStats:

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we rename these to AuditRunSpec and AuditRunStats so its clear that they are related to the audit phase. Otherwise we can move them to a shared/central location.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a confusing folder name - probably audit or audit_test might be better. This resembles a test folder for unit tests of a component like the one where you added test_output_caching.py.

Comment on lines +86 to +87
1. Each phase completed ≥ (1 - threshold) of its requested queries.
2. audit_qps < ref_qps * (1 + threshold)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So there are two constraints - one for the number of samples completed and one for the qps? Both use the same threshold but in different directions.

Comment on lines +139 to +144
threshold: float = Field(
0.10,
gt=0,
lt=1,
description="Caching tolerance: audit_qps must stay < ref_qps * (1 + threshold)",
)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

note - we are also using this for n_completed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants