-
Notifications
You must be signed in to change notification settings - Fork 195
[Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X #1753
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
[Experimental][DNM till upstream PR merges][AMD] perf: hybrid MXFP8 MoE for MiniMax M3 on MI300X #1753
Changes from all commits
d93e4e4
6b70497
980f9c8
e9fa9b7
7f159d3
c3cdc37
60a0002
33584f9
2bfc584
a38f5ab
7678b0b
684b6a3
280c030
ba30da1
23925cc
dd871ac
1e3bfdd
d1638a0
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -1,10 +1,12 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # MiniMax-M3 MXFP8 MI300X (gfx942) single-node vLLM recipe. | ||
| # Reuses the dedicated ROCm image and the MI355X serving shape. Block size 128 | ||
| # is mandatory for MSA sparse attention. Keep the default BF16 KV cache on | ||
| # gfx942: the checkpoint has no calibrated q/prob scales for ROCm FP8 | ||
| # attention, and vLLM's fallback scale of 1.0 corrupts model accuracy. | ||
| # Reuses the dedicated ROCm image and applies the checked-in hybrid gfx94x | ||
| # MXFP8 MoE patch before starting vLLM. Block size 128 is mandatory for MSA | ||
| # sparse attention. Keep the default BF16 KV cache on gfx942: the checkpoint | ||
| # has no calibrated q/prob scales for ROCm FP8 attention, and vLLM's fallback | ||
| # scale of 1.0 corrupts model accuracy. | ||
| # Target image vLLM revision: 4a560dd8db67c270f5e2afb614558271b76f2294. | ||
|
|
||
| source "$(dirname "$0")/../../benchmark_lib.sh" | ||
|
|
||
|
|
@@ -24,6 +26,28 @@ if [[ -n "$SLURM_JOB_ID" ]]; then | |
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| VLLM_PACKAGE_ROOT="$( | ||
| python - <<'PY' | ||
| from pathlib import Path | ||
|
|
||
| import vllm | ||
|
|
||
| print(Path(vllm.__file__).resolve().parent.parent) | ||
| PY | ||
| )" | ||
| MXFP8_PATCH="$(dirname "$0")/minimaxm3_mi300x_mxfp8.patch" | ||
| MXFP8_ORACLE="$VLLM_PACKAGE_ROOT/vllm/model_executor/layers/fused_moe/oracle/mxfp8.py" | ||
| if ! grep -q "Using fused CDNA3 (gfx94x)" "$MXFP8_ORACLE"; then | ||
| if ! patch --batch --forward -d "$VLLM_PACKAGE_ROOT" -p1 < "$MXFP8_PATCH"; then | ||
| echo "Failed to apply the MI300X MXFP8 patch" >&2 | ||
| exit 1 | ||
| fi | ||
| fi | ||
| if ! grep -q "Using fused CDNA3 (gfx94x)" "$MXFP8_ORACLE"; then | ||
| echo "MI300X MXFP8 backend marker is missing after patching" >&2 | ||
| exit 1 | ||
| fi | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. MTP script skips MXFP8 patchMedium Severity Runtime MXFP8 patching was added only to the non-MTP MI300X benchmark script. Reviewed by Cursor Bugbot for commit c3cdc37. Configure here. |
||
|
|
||
| if [[ "$MODEL" != /* ]]; then hf download "$MODEL"; fi | ||
|
|
||
| if [ -n "$ROCR_VISIBLE_DEVICES" ]; then | ||
|
|
||


Uh oh!
There was an error while loading. Please reload this page.