perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X by Oseltamivir · Pull Request #1782 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-06-15T19:13:24Z

Summary

add a sparse local-route BF16 MoE path for the profiled MiniMax M3 TP8+EP8
long-context shape on MI300X
discard remote routes during expert alignment and size buffers for 16 local
experts instead of 128 global experts
use 16-row grouped-GEMM tiles that match the measured ~6.75 routes per local
expert instead of the existing 64-row tile
fuse split SwiGLU-OAI into the BF16 GEMM1 epilogue, eliminating the standalone
activation kernel and the 2x-intermediate GEMM1 output
retain the measured native/BF16 policy for short-context EP8

This PR is stacked on #1753 and contains only the incremental EP8 optimization.
It does not include the profiling branch, AITER allreduce/RMSNorm work,
temporary benchmark configuration, or perf-changelog.yaml changes.

Profile basis

The six-point MI300X profile found expert GEMM1+GEMM2 at 30.31 ms for 1k/c256
and 28.10 ms for 8k/c256. After collective fusion, expert GEMMs remained the
largest classified 8k/c256 phase at 28.79 ms across 114 calls.

At c256, MiniMax M3 has about 216 active tokens and top-k 4, or 864 routed rows
globally. EP8 owns 16 of 128 experts per rank, leaving about 108 local rows,
roughly 6.75 rows per local expert. The existing BF16 config uses a 64-row M
tile, so it can execute about 1,024 padded rows per rank for roughly 108 useful
rows. Global alignment also creates blocks for remote experts that do no useful
GEMM work.

Profile report:
https://github.com/SemiAnalysisAI/InferenceX/blob/profiling/experimental/minimax_m3_mi300x_profile.md

First-principles changes

Alignment remaps and retains only locally owned routes. Its allocation bound
is based on 16 local experts, while the device counter remains authoritative.
GEMM1 and GEMM2 use BLOCK_SIZE_M=16, matching the observed route density
and reducing padded expert-row computation by up to 4x versus the 64-row
tile.
GEMM1 loads each activation tile once, computes gate and up projections, and
applies split SwiGLU-OAI before storing. This halves its BF16 output traffic
and removes a separate activation launch.
GEMM2 applies router weights in FP32 as before.
The existing expert-map-aware fused reduction sums only local weighted rows.
It avoids direct atomic accumulation, which the profile identified as a poor
fit for the c256 top-k-4 shape.

The path is gated to the exact gfx94x MiniMax M3 EP8 BF16 shape. gfx95x and
other models/configurations are unchanged.

Validation

Static and local validation:

python -m pytest utils/matrix_logic/ -q: 156 passed
bash -n benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
runtime patch dry-runs and applies cleanly to the pinned image source
patched vLLM source passes Ruff, formatting, compileall, and
git diff --check
upstream branch includes local-route GEMM, alignment-allocation, and
expert-map reduction correctness tests

MI300X serving validation is pending infrastructure recovery. The exact six-job
matrix (c1/c16/c256 for 1k1k and 8k1k) was dispatched four times, but every
attempt failed before GPU allocation because the Slurm controller was
unreachable:

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27569397626

github-actions · 2026-06-15T19:13:41Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Oseltamivir added 4 commits June 15, 2026 09:43

perf(vllm): optimize MiniMax M3 MXFP8 EP routes

28e3f75

fix(vllm): exclude tests from runtime patch

8279f50

perf(vllm): keep MiniMax M3 EP weights compressed

b25eff5

perf(vllm): fuse MiniMax M3 BF16 EP experts

16c596a

github-project-automation Bot added this to InferenceMAX Board Jun 15, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X#1782

perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X#1782
Oseltamivir wants to merge 4 commits into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8

Oseltamivir commented Jun 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Jun 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Profile basis

First-principles changes

Validation

Uh oh!

github-actions Bot commented Jun 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Oseltamivir commented Jun 15, 2026 •

edited

Loading