Skip to content

[DO NOT MERGE] [Klaud Cold] experimental: MiniMax-M3 MI325X conc 4/8 — apply vllm#45639 (AITER AR + Gemma-RMS fusion)#1772

Open
functionstackx wants to merge 3 commits into
mainfrom
experimental/minimaxm3-mi325-arfusion-45639-dnm
Open

[DO NOT MERGE] [Klaud Cold] experimental: MiniMax-M3 MI325X conc 4/8 — apply vllm#45639 (AITER AR + Gemma-RMS fusion)#1772
functionstackx wants to merge 3 commits into
mainfrom
experimental/minimaxm3-mi325-arfusion-45639-dnm

Conversation

@functionstackx

@functionstackx functionstackx commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

[DO NOT MERGE] — experimental hardware validation of an upstream WIP vLLM PR. Not for merge.

What

MI325X (gfx942) counterpart of #1770. Benchmarks MiniMax-M3 MXFP8 on MI325X, conc 4 and 8 (TP8), with vllm-project/vllm#45639 (AITER fused all-reduce + Gemma-RMSNorm) applied in-place to the shipped vllm/vllm-openai-rocm:minimax-m3 image.

How

benchmarks/single_node/fixed_seq_len/minimaxm3arf_fp8_mi325x.sh:

  • applies the vendored #45639 diff with patch -p1 (installs patch via apt if missing): idempotent (proceeds if already applied), and hard-fails (exit 1) if it neither applies cleanly nor is already applied (image drifted from m3_release).
  • serves with the fusion enabled: VLLM_ROCM_USE_AITER=1, --compilation-config '{"custom_ops": ["-minimax_gemma_rms_norm"], "pass_config": {"fuse_allreduce_rms": true}}'. BF16 KV (gfx942 has no calibrated FP8 attention scales).
  • otherwise mirrors minimaxm3_fp8_mi325x.sh; carries a PROFILE=1 --profiler-config gate so the companion profiling PR reuses the recipe.

amd-master.yaml minimaxm3arf-fp8-mi325x-vllm — distinct model-prefix (minimaxm3arf) routes to the new recipe; conc 4 & 8, TP8. Prod recipe/config untouched.

Validation

bash -n ✓; YAML parses ✓; test-config → 2 jobs (TP8, conc 4 + 8, mi325x, minimaxm3arf_1k1k) ✓.

🤖 Generated with Claude Code


Note

Low Risk
Benchmark-only experimental config and recipe; no changes to production MiniMax-M3 paths. Runtime in-container patching is isolated to marked experimental jobs.

Overview
Adds an experimental, do-not-merge MI325X smoke path to validate vllm#45639 (AITER fused all-reduce + Gemma-RMSNorm for MiniMax-M3) on real gfx942 hardware before the upstream change ships in a rebuilt image.

A new minimaxm3arf-fp8-mi325x-vllm entry in amd-master.yaml uses model-prefix minimaxm3arf so jobs run minimaxm3arf_fp8_mi325x.sh instead of the production MI325X MiniMax-M3 recipe. The sweep is narrow: TP8 at conc 4/8 for 1k1k perf, plus an 8k1k conc-16 row so lm-eval can run under the existing eval policy.

The recipe vendors and applies the #45639 diff to the installed vLLM inside the minimax-m3 container (idempotent apply, hard-fail if the image no longer matches the patch base). Serving then turns on VLLM_ROCM_USE_AITER=1, fuse_allreduce_rms, and disables the minimax_gemma_rms_norm custom op so the fusion pass can match IR. DEBUG logging plus a post-startup grep “fusion-pass verdict” block records whether the pass registered and replaced patterns.

perf-changelog.yaml documents the new config key. Production minimaxm3* configs are unchanged.

Reviewed by Cursor Bugbot for commit eb422fe. Bugbot is set up for automated code reviews on this repo. Configure here.

functionstackx and others added 2 commits June 15, 2026 00:47
… applied

MI325X (gfx942) counterpart of #1770. Validate vllm-project/vllm#45639 ("[ROCm][M3]
Enable AITER AR + Gemma-RMS fusion for MiniMax-M3") on MI325X before an image
rebuild, by applying the PR diff in-place to the shipped minimax-m3 image.

- patches/vllm-45639-aiter-ar-gemma-rms.diff: vendored PR diff.
- minimaxm3arf_fp8_mi325x.sh: applies the diff (idempotent; HARD-FAILS if it
  neither applies nor is already applied), serves with VLLM_ROCM_USE_AITER=1 +
  --compilation-config (custom_ops -minimax_gemma_rms_norm, pass_config.fuse_allreduce_rms).
  BF16 KV (gfx942). Includes a PROFILE=1 --profiler-config gate so the same recipe
  serves the companion profiling PR.
- amd-master.yaml minimaxm3arf-fp8-mi325x-vllm: model-prefix minimaxm3arf routes
  to the new recipe; conc 4 and 8, TP8.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
echo "[vllm#45639] already applied to $VLLM_SP/vllm"
elif ( cd "$VLLM_SP" && patch -p1 --dry-run < "$PATCH_FILE" >/dev/null 2>&1 ); then
( cd "$VLLM_SP" && patch -p1 < "$PATCH_FILE" )
echo "[vllm#45639] applied to $VLLM_SP/vllm"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Patch apply errors ignored

High Severity

After a successful patch --dry-run, the script runs patch -p1 without checking its exit status and always prints that vllm#45639 was applied. A failed apply still starts vllm serve, so the job can finish while benchmarking an image that never received the fusion patch.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 65be443. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

…ity w/ #1770)

Mirror the #1770 updates onto the MI325X #45639 sweep:
- VLLM_LOGGING_LEVEL=DEBUG + a post-server-ready grep of the server log into the
  job log, printing the AITER AR+RMS fusion verdict (registration bail warnings
  => 0 patterns; "Replaced N patterns" / "fusion pass matches" => match count).
- 8k1k conc-16 TP8 row so mark_eval_entries marks an lm-eval (validate #45639
  fused-kernel correctness on gfx942); perf points stay at conc 4/8.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

There are 2 total unresolved issues (including 1 from previous review).

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit eb422fe. Configure here.

- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-list: [ 16 ] }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

conc-list breaks full-sweep

Medium Severity

The new single-node fixed-seq-len rows use only conc-list, but generate_full_sweep reads conc-start and conc-end for single-node entries and will raise KeyError when it hits this config during an unfiltered amd-master full sweep.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit eb422fe. Configure here.

@github-actions

Copy link
Copy Markdown
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant