[AMD] feat: MiniMax M3 day-zero benchmark for MI325X by cquil11 · Pull Request #1748 · SemiAnalysisAI/InferenceX

cquil11 · 2026-06-13T23:48:29Z

Summary

add minimaxm3-fp8-mi325x-vllm for MiniMax M3 MXFP8 on MI325X
use vllm/vllm-openai-rocm:minimax-m3 and the official MI325X MXFP8 recipe shape
mirror the H200 non-MTP search space: TP4/TP8 latency, TP4/TP8 expert parallelism, and TP8 data-parallel attention across 1k1k and 8k1k
route Hugging Face cache to node-local /local-nvme/hf-hub-cache/ and runtime compiler caches to container-local /tmp
disable prefix caching for random-dataset benchmarks
mount /dev/kfd and /dev/dri explicitly for ROCm
use the default BF16 KV cache because FP8 KV corrupts MiniMax M3 generation on this MI325X/gfx942 image

Recipe Alignment

model: MiniMaxAI/MiniMax-M3-MXFP8
image: vllm/vllm-openai-rocm:minimax-m3
--block-size 128
--attention-backend TRITON_ATTN
--language-model-only
--no-enable-prefix-caching
MiniMax M3 tool/reasoning parsers with automatic tool choice
no MI355X-specific --enforce-eager workaround

Upstream reference: https://recipes.vllm.ai/MiniMaxAI/MiniMax-M3?hardware=mi325x&variant=mxfp8

Validation

Representative throughput smoke: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27482912444

vLLM started successfully on MI325X with the PR command
model downloaded through node-local /local-nvme/hf-hub-cache
all checkpoint shards loaded; CUDA graph capture completed
40 random 1k1k requests at TP4 / EP1 / concurrency 4 completed per runner
result processing, result upload, server-log upload, GPU-metrics upload, aggregation, and success-rate calculation succeeded

Targeted DPA accuracy validation: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27484953170

exact failed full-sweep point: TP1 x DP8 + EP, concurrency 512, 8k1k eval-only
server startup and all 1,319 GSM8K requests completed
GSM8K strict exact match: 0.9575
GSM8K flexible exact match: 0.9568
score validation, server-log upload, GPU-metrics upload, eval artifact upload, aggregation, and success-rate calculation succeeded

Failure diagnosis: --kv-cache-dtype fp8 produced deterministic repetitive/cross-prompt generation corruption and 1-2% GSM8K. On the same node, image, weights, and layouts, removing only FP8 KV restored correct generation with and without CUDA graphs. The PR therefore leaves KV cache at vLLM's default dtype.

Additional validation:

shell syntax, YAML parsing, matrix generation, and git diff --check pass
matrix matches the H200-aligned 31-point search space
/enroot resolves to local NVMe on every healthy compute node
XDG_CACHE_HOME and TRITON_CACHE_DIR use per-job local paths, avoiding stale NFS compiler artifacts

Full PR sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27485135330

Changelog Integrity

perf-changelog.yaml is current main byte-for-byte followed only by this PR's entry at the tail.

github-actions · 2026-06-13T23:48:43Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-06-13T23:49:04Z

Claude finished @cquil11's task in 2m 46s —— View job

Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

Read changed files and gather context
Validate master config and perf-changelog
Validate benchmark script code style and patterns
Validate launch script changes
Post review findings

LGTM — no blocking issues found.

Validation summary:

Config + perf-changelog both updated ✅
Perf-changelog entry appended at end of file ✅
Image vllm/vllm-openai-rocm:minimax-m3 is publicly accessible (Docker Hub) and matches MI355X ✅
Launch script has enroot import docker:// pattern for reproducibility ✅
Expert parallelism is correctly conditional on EP_SIZE / DP_ATTENTION ✅
Server launch command is properly formatted with separate lines ✅
MI325X search space is an exact match of the H200 non-MTP search space (verified both 1k1k and 8k1k) ✅
Benchmark script is functionally identical to the MI355X variant (minimaxm3_fp8_mi355x.sh), differing only in the header comment ✅

Note (non-blocking): The launch_mi325x-amds.sh changes (HF cache path to /local_nvme/ and /dev/kfd+/dev/dri mounts) apply to all MI325X benchmarks, not just MiniMax M3. The device mounts are additive so shouldn't break existing jobs, and moving to node-local NVMe is a reasonable infra choice for faster model loading — just noting the scope.

claude · 2026-06-13T23:52:29Z

Claude finished @cquil11's task in 1m 10s —— View job

Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

Read changed files and gather context
Validate master config and perf-changelog
Validate container image accessibility
Validate benchmark script (code style, EP pattern, server formatting)
Validate launch script (enroot import, device mounts)
Validate model prefix
Post review findings

LGTM — no blocking issues found.

Validation summary:

Master config + perf-changelog: Both updated, perf-changelog entry correctly appended at EOF ✅
Model prefix: minimaxm3 matches existing entries across AMD and NVIDIA configs ✅
Container image: vllm/vllm-openai-rocm:minimax-m3 on Docker Hub — publicly accessible ✅
Enroot import: launch_mi325x-amds.sh has the enroot import docker:// pattern ✅
Expert parallelism: Correctly conditional on EP_SIZE / DP_ATTENTION with the standard pattern ✅
Server command formatting: Arguments on separate lines with \ continuations ✅
H200 search space alignment: MI325X search space is an exact match of the H200 non-MTP search space (both 1k1k and 8k1k) ✅
MI355X vs MI325X diff: Only differences are the header comment and removal of --enforce-eager (intentional per PR description — MI355X-specific workaround) ✅

Note (non-blocking): The launch_mi325x-amds.sh changes (HF cache path to /local_nvme/ and /dev/kfd+/dev/dri mounts) apply to all MI325X benchmarks, not just MiniMax M3. The device mounts are additive and the NVMe cache path is a reasonable infra choice — just noting the blast radius.

github-actions · 2026-06-14T00:12:29Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27483150384
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27483150384

github-actions · 2026-06-14T00:30:38Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27483150384
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27483150384

github-actions · 2026-06-14T01:28:16Z

see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27483489611
see unofficial run visualizer at https://inferencex.semianalysis.com/evaluation?unofficialRun=27483489611

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5ec3e11. Configure here.}

cursor · 2026-06-14T01:50:03Z

+    --no-enable-prefix-caching \
+    --tool-call-parser minimax_m3 \
+    --reasoning-parser minimax_m3 \
+    --enable-auto-tool-choice > "$SERVER_LOG" 2>&1 &


Missing FP8 KV cache flag

Medium Severity

The new MI325X vllm serve invocation omits --kv-cache-dtype fp8 even though the PR recipe alignment, changelog, and the existing minimaxm3_fp8_mi355x.sh baseline all specify FP8 KV cache. Without it, vLLM may use a non-FP8 KV default, skewing memory headroom and throughput versus the official MI325X MXFP8 recipe and other MiniMax M3 entries.

^{Reviewed by Cursor Bugbot for commit 5ec3e11. Configure here.}

feat: add MiniMax M3 MI325X day-zero benchmark

c485637

github-project-automation Bot added this to InferenceMAX Board Jun 13, 2026

cquil11 marked this pull request as ready for review June 13, 2026 23:48

cquil11 requested a review from a team June 13, 2026 23:48

cquil11 requested review from 1am9trash, billishyahao, chunfangamd, seungrokj and yctseng0211 as code owners June 13, 2026 23:48

chore: link MiniMax M3 MI325X changelog

3119614

cquil11 marked this pull request as draft June 13, 2026 23:48

cquil11 marked this pull request as ready for review June 13, 2026 23:49

fix: align MI325X MiniMax M3 with upstream recipe

de3c5dc

cquil11 marked this pull request as draft June 13, 2026 23:50

cquil11 marked this pull request as ready for review June 13, 2026 23:51

fix(amd): disable MiniMax M3 prefix caching

8e1834a

cquil11 added the full-sweep-fail-fast label Jun 14, 2026

fix(amd): keep MI325X compiler caches local

78e5ba5

fix(amd): disable MiniMax M3 FP8 KV cache on MI325X

5ec3e11

cursor Bot reviewed Jun 14, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMD] feat: MiniMax M3 day-zero benchmark for MI325X#1748

[AMD] feat: MiniMax M3 day-zero benchmark for MI325X#1748
cquil11 wants to merge 6 commits into
mainfrom
codex/minimaxm3-mi325x-dayzero

cquil11 commented Jun 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

claude Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

claude Bot commented Jun 13, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cquil11 commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Recipe Alignment

Validation

Changelog Integrity

Uh oh!

github-actions Bot commented Jun 13, 2026

Uh oh!

claude Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

Uh oh!

claude Bot commented Jun 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review: PR #1748 — MiniMax M3 MI325X day-zero benchmark

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

github-actions Bot commented Jun 14, 2026

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 14, 2026

Choose a reason for hiding this comment

Missing FP8 KV cache flag

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

cquil11 commented Jun 13, 2026 •

edited

Loading

claude Bot commented Jun 13, 2026 •

edited

Loading

claude Bot commented Jun 13, 2026 •

edited

Loading