[Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X#1753
[Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X#1753Oseltamivir wants to merge 18 commits into
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
Claude finished @Oseltamivir's task in 3m 50s —— View job Review of PR #1753
LGTM — no blocking issues found. Summary: This PR adds a fused CDNA3 MXFP8 MoE backend for MiniMax-M3 on MI300X via a runtime vLLM patch. I reviewed all four changed files and the 656-line Triton kernel patch in detail. Key findings:
|
functionstackx
left a comment
There was a problem hiding this comment.
plz create upstream PR and have it reviewed before merging this patch
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27490966693 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27491894470 |
|
Opened the requested upstream vLLM PR: vllm-project/vllm#45567. It is stacked on the active MiniMax M3 model branch/PR (#45381), includes the tested gfx94x MXFP8 kernel and benchmark, and passes all vLLM pre-commit hooks. The InferenceX patch has also been updated to the optimized tile selection and no longer uses split-K. |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27491936967 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495448403 |
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 2 total unresolved issues (including 1 from previous review).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit c3cdc37. Configure here.
| MXFP8_ORACLE="$VLLM_PACKAGE_ROOT/vllm/model_executor/layers/fused_moe/oracle/mxfp8.py" | ||
| if ! grep -q "Using fused CDNA3 (gfx94x)" "$MXFP8_ORACLE"; then | ||
| patch --batch --forward -d "$VLLM_PACKAGE_ROOT" -p1 < "$MXFP8_PATCH" | ||
| fi |
There was a problem hiding this comment.
MTP script skips MXFP8 patch
Medium Severity
Runtime MXFP8 patching was added only to the non-MTP MI300X benchmark script. launch_mi300x-amds.sh runs minimaxm3_fp8_mi300x_mtp.sh for spec-decoding: mtp configs, so those jobs never apply minimaxm3_mi300x_mxfp8.patch despite the MTP script claiming it mirrors this recipe.
Reviewed by Cursor Bugbot for commit c3cdc37. Configure here.
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495458775 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27495662629 |
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27502391149 |
1 similar comment
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27502391149 |
# Conflicts: # perf-changelog.yaml
Co-authored-by: OpenAI Codex <codex@openai.com>
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27506382432 |
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> Co-authored-by: OpenAI Codex <codex@openai.com>
Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com> # Conflicts: # perf-changelog.yaml
|
Packed-scale follow-up is pushed in
Full validation sweep: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27511311644 |
|
@Oseltamivir 's AI agent, remember to have ur search space start at conc=1 like i am fixing it rn #1760 |
|
Official sweep update: the pushed 8K/1K TP8 c64 job passed at 1167.724 tok/s/GPU. That is +2.83% over run 27506382432 and +19.52% over the original BF16-emulation run 27489075807. Job: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27511311644/job/81312766431 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27511311644 |
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27516635008 |
Co-authored-by: OpenAI Codex <codex@openai.com> Signed-off-by: Oseltamivir <58582368+Oseltamivir@users.noreply.github.com>
# Conflicts: # perf-changelog.yaml
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27516635008 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27519117381 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27522139191 |
|
see unofficial run visualizer at https://inferencex.semianalysis.com/inference?unofficialRun=27522193211 |


Summary
[ROCm] Add fused MiniMax M3 MXFP8 MoE for gfx94x vllm-project/vllm#45567 to the existing ROCm image
mainTP8 and TP8+EP8 configurations and concurrency matrixcompressed W8A8
[expert, K/32, N]so the native kernelloads adjacent output-column scales contiguously
Why this improves the end-to-end path
The previous native kernel benchmark used random routing. Capturing and
replaying InferenceX routes exposed a production-specific memory-layout issue.
For TP8 GEMM1,
K=6144, soK/32=192. The checkpoint layout[expert, N, K/32]makes scale bytes for adjacent output columns 192 bytesapart at a fixed K group. The new load-time layout
[expert, K/32, N]makes those bytes contiguous.Captured TP8 route latency:
The kernel gain is diluted end to end by attention, collectives, shared
experts, routing/alignment, and top-k reduction, but the serving curve still
improves materially.
Short-K GEMM2 specialization
The matched route corpus shows that the dominant TP decode GEMM2 is
N=6144, K=384withBLOCK_M=16. A dedicatedBLOCK_N=64, BLOCK_K=64, num_warps=2configuration replaces the generic32/128/1configuration only for this short-K gfx94x GEMM2 regime.Paired same-GPU replay:
GEMM1, expert-parallel GEMM2, CDNA4, and larger routed batches keep their
previous configurations. Numerical error remains in the same native W8A8
envelope.
Serving results
Reference runs:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27489075807
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27519117381
The final sweep completed with 29 successful jobs, no failures, 18 serving
result rows, and both accuracy-eval jobs passing. The TP8 and TP8+EP8
parallelism configurations are unchanged from
main.1K/1K
Concurrency 4 is on the BF16 fallback for 9,527 of 9,529 captured forwards,
so it does not measure the native kernel. The official point also had a
1.072 s p99 TTFT outlier versus 0.282 s in the baseline. Three exact-command
same-node repeats measured 80.01, 80.38, and 80.41 tok/s/GPU; the median is
+0.34% versus baseline.
8K/1K
Why not dequantize weights inside BF16 MMA
MiniMax M3 enters the MoE layer with BF16 hidden states, so native W8A8 does
launch an activation-quantization kernel before GEMM1. A Marlin-style W8A16
prototype was tested to avoid that launch by expanding compressed weights in
registers before BF16 MMA.
W8A16 avoids quantizing the
M*Kactivation tensor but repeatedly convertsmuch larger
N*Kexpert weight tiles. MI300X BF16 MFMA also consumes half asmany K elements per instruction as FP8 MFMA. It is more accurate, but 1.74x
slower on this production route.
Scale algebra
For each K=32 MXFP8 group:
The native kernel applies
s_a * s_bafter each K=32 partial dot, not afterthe complete matmul. This is valid because the scales are constant inside
that group; applying one scale across multiple groups would be incorrect.
Scope
This PR changes only
minimaxm3-fp8-mi300x-vllm. The separateminimaxm3-fp8-mi300x-vllm-mtpEAGLE3 recipe and its sweep are intentionallyunchanged and were not part of this validation matrix.
The runtime patch is applied only by
benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh. MI355X doesnot load it. Upstream scale packing is gated to gfx94x FNUZ, while
gfx950/MI355X uses OCP FP8.
MI355X sanity reference:
https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27452497472
completed successfully with 63 result rows and no failed benchmarks.
Validation
4a560dd8db67c270f5e2afb614558271b76f2294patchfails or the backend marker is absentpython -m py_compilepython -m pytest utils/matrix_logic/ -q: 156 passedbash -n benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.shgit diff --checkNote
Medium Risk
Runtime patching of vLLM MoE/quantization paths affects numerical behavior and serving correctness on MI300X only, but the change is large, experimental until upstream merges, and scoped to one benchmark recipe.
Overview
Adds a runtime vLLM patch for
minimaxm3-fp8-mi300x-vllm:minimaxm3_fp8_mi300x.shappliesminimaxm3_mi300x_mxfp8.patchto the installed package beforevllm serve, with idempotent apply and a hard fail if the gfx94x backend marker is missing.The patch introduces a fused CDNA3 (gfx942) MXFP8 MoE path—E4M3FNUZ weights, in-kernel E8M0 scale products, packed scales as
[expert, K/32, N], and MI300X Triton tile configs—plus a hybrid dispatch: BF16 experts for TP decode (≤8 tokens) and large prefill (≥832), BF16 emulation under expert parallelism, and native compressed W8A8 between those bands; for long context, every fifth layer can store BF16-only weights while others keep dual MXFP8/BF16 buffers.Config comments and
perf-changelog.yamlare updated to describe this hybrid recipe instead of pure BF16 emulation; the TP8 / TP8+EP8 sweep matrix is unchanged.Reviewed by Cursor Bugbot for commit d1638a0. Bugbot is set up for automated code reviews on this repo. Configure here.