Skip to content

perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X#1782

Draft
Oseltamivir wants to merge 4 commits into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8
Draft

perf(vllm): fuse MiniMax M3 BF16 EP experts on MI300X#1782
Oseltamivir wants to merge 4 commits into
feat/m3-mi300x-mxfp8from
codex/minimax-m3-mi300x-ep-mxfp8

Conversation

@Oseltamivir

@Oseltamivir Oseltamivir commented Jun 15, 2026

Copy link
Copy Markdown
Collaborator

Summary

  • add a sparse local-route BF16 MoE path for the profiled MiniMax M3 TP8+EP8
    long-context shape on MI300X
  • discard remote routes during expert alignment and size buffers for 16 local
    experts instead of 128 global experts
  • use 16-row grouped-GEMM tiles that match the measured ~6.75 routes per local
    expert instead of the existing 64-row tile
  • fuse split SwiGLU-OAI into the BF16 GEMM1 epilogue, eliminating the standalone
    activation kernel and the 2x-intermediate GEMM1 output
  • retain the measured native/BF16 policy for short-context EP8

This PR is stacked on #1753 and contains only the incremental EP8 optimization.
It does not include the profiling branch, AITER allreduce/RMSNorm work,
temporary benchmark configuration, or perf-changelog.yaml changes.

Profile basis

The six-point MI300X profile found expert GEMM1+GEMM2 at 30.31 ms for 1k/c256
and 28.10 ms for 8k/c256. After collective fusion, expert GEMMs remained the
largest classified 8k/c256 phase at 28.79 ms across 114 calls.

At c256, MiniMax M3 has about 216 active tokens and top-k 4, or 864 routed rows
globally. EP8 owns 16 of 128 experts per rank, leaving about 108 local rows,
roughly 6.75 rows per local expert. The existing BF16 config uses a 64-row M
tile, so it can execute about 1,024 padded rows per rank for roughly 108 useful
rows. Global alignment also creates blocks for remote experts that do no useful
GEMM work.

Profile report:
https://github.com/SemiAnalysisAI/InferenceX/blob/profiling/experimental/minimax_m3_mi300x_profile.md

First-principles changes

  1. Alignment remaps and retains only locally owned routes. Its allocation bound
    is based on 16 local experts, while the device counter remains authoritative.
  2. GEMM1 and GEMM2 use BLOCK_SIZE_M=16, matching the observed route density
    and reducing padded expert-row computation by up to 4x versus the 64-row
    tile.
  3. GEMM1 loads each activation tile once, computes gate and up projections, and
    applies split SwiGLU-OAI before storing. This halves its BF16 output traffic
    and removes a separate activation launch.
  4. GEMM2 applies router weights in FP32 as before.
  5. The existing expert-map-aware fused reduction sums only local weighted rows.
    It avoids direct atomic accumulation, which the profile identified as a poor
    fit for the c256 top-k-4 shape.

The path is gated to the exact gfx94x MiniMax M3 EP8 BF16 shape. gfx95x and
other models/configurations are unchanged.

Validation

Static and local validation:

  • python -m pytest utils/matrix_logic/ -q: 156 passed
  • bash -n benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi300x.sh
  • runtime patch dry-runs and applies cleanly to the pinned image source
  • patched vLLM source passes Ruff, formatting, compileall, and
    git diff --check
  • upstream branch includes local-route GEMM, alignment-allocation, and
    expert-map reduction correctness tests

MI300X serving validation is pending infrastructure recovery. The exact six-job
matrix (c1/c16/c256 for 1k1k and 8k1k) was dispatched four times, but every
attempt failed before GPU allocation because the Slurm controller was
unreachable:

https://github.com/SemiAnalysisAI/InferenceX/actions/runs/27569397626

@github-actions

Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your single node PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant