-
Notifications
You must be signed in to change notification settings - Fork 194
Add minimax M3 MXFP8 MI355X vLLM EAGLE3 (related PR for upstreaming patch https://github.com/vllm-project/vllm/pull/45546) #1745
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
+255
−0
Merged
Changes from all commits
Commits
Show all changes
6 commits
Select commit
Hold shift + click to select a range
0bbf8b5
minimaxm3-fp8-mi355x-vllm-mtp: day-zero MiniMax-M3 EAGLE3 MI355X recipe
functionstackx 6f4d927
minimaxm3-fp8-mi355x-vllm-mtp: runtime-patch EAGLE3 to test on MI355X
functionstackx 651824f
perf-changelog: fill in PR link for minimaxm3-fp8-mi355x-vllm-mtp eag…
functionstackx 0822923
perf-changelog: reset PR link for mi355x eagle3 test (fresh PR)
functionstackx 149e11e
perf-changelog: fill in PR link for mi355x eagle3 test (#1745)
functionstackx 5d1ddae
Merge branch 'main' into feat/minimax-m3-mi355-eagle3
functionstackx File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
208 changes: 208 additions & 0 deletions
208
benchmarks/single_node/fixed_seq_len/minimaxm3_fp8_mi355x_mtp.sh
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,208 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # MiniMax-M3 MXFP8 MI355X (gfx950) single-node vLLM recipe with EAGLE3 | ||
| # speculative decoding — the spec-decoding=mtp variant of | ||
| # minimaxm3_fp8_mi355x.sh. Adds the Inferact/MiniMax-M3-EAGLE3 draft head via | ||
| # --speculative-config with 3 speculative tokens. Everything else mirrors the | ||
| # non-MTP recipe: MXFP8 from TP=4 on gfx950, mandatory --block-size 128, | ||
| # --language-model-only for the text-only benchmark, FP8 KV cache, | ||
| # --attention-backend TRITON_ATTN, and --enforce-eager. | ||
| # | ||
| # Unlike the CUDA recipes, the drafter needs no attention_backend override: | ||
| # the FlashInfer "page size 128 requires GQA/MQA" limitation that forced | ||
| # FLASH_ATTN for the EAGLE3 MHA head on Blackwell is FlashInfer/CUDA-specific. | ||
| # Here the whole server runs on TRITON_ATTN (set globally below), which serves | ||
| # the MHA draft fine. | ||
| # | ||
| # [AI generated draft test] The shipped vllm/vllm-openai-rocm:minimax-m3 image | ||
| # does NOT implement SupportsEagle3 on the AMD MiniMax-M3 model, so EAGLE3 | ||
| # engine init fails with "Model does not support EAGLE3 interface but | ||
| # aux_hidden_state_outputs was requested". This recipe applies that fix | ||
| # (functionstackx/vllm#1 — ported from nvidia/model.py) in-place to the | ||
| # installed vllm before serving, so we can validate EAGLE3 on real MI355X | ||
| # hardware ahead of an image rebuild. The patch is idempotent and fails the | ||
| # job loudly if the installed amd/model.py has drifted from the expected base. | ||
|
|
||
| source "$(dirname "$0")/../../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| EP_SIZE \ | ||
| DP_ATTENTION \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| DRAFT_MODEL="Inferact/MiniMax-M3-EAGLE3" | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| # MODEL stays a bare HF id on the mi355x single-node runner (weights are | ||
| # pre-staged in the mounted NFS HF cache, so this is a fast cache hit). The | ||
| # EAGLE3 draft is not staged; fetch it into the same cache. | ||
| if [[ "$MODEL" != /* ]]; then | ||
| hf download "$MODEL" | ||
| hf download "$DRAFT_MODEL" | ||
| fi | ||
|
|
||
| if [ -n "$ROCR_VISIBLE_DEVICES" ]; then | ||
| export HIP_VISIBLE_DEVICES="$ROCR_VISIBLE_DEVICES" | ||
| fi | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| export VLLM_ENGINE_READY_TIMEOUT_S=3600 | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| fi | ||
|
|
||
| PARALLEL_ARGS=(--tensor-parallel-size "$TP") | ||
| if [ "${DP_ATTENTION}" = "true" ]; then | ||
| PARALLEL_ARGS=( | ||
| --tensor-parallel-size 1 | ||
| --data-parallel-size "$TP" | ||
| --enable-expert-parallel | ||
| ) | ||
| elif [ "$EP_SIZE" -gt 1 ]; then | ||
| PARALLEL_ARGS+=(--enable-expert-parallel) | ||
| fi | ||
|
|
||
| # use 3 speculative tokens for all configs for now | ||
| NUM_SPEC_TOKENS=3 | ||
|
|
||
| # [AI generated draft test] Patch the installed AMD MiniMax-M3 model to add the | ||
| # SupportsEagle3 interface (functionstackx/vllm#1). Mirrors nvidia/model.py: | ||
| # adds EagleModelMixin to the inner model + aux-hidden-state emission, and | ||
| # SupportsEagle3 to the two outer classes. Idempotent; hard-fails if the | ||
| # installed file has drifted from the expected base (so we never silently run | ||
| # unpatched and mislabel the result). | ||
| python3 - <<'PYEOF' || { echo "EAGLE3 in-place patch failed" >&2; exit 1; } | ||
| import ast, importlib.util, pathlib, sys | ||
|
|
||
| spec = importlib.util.find_spec("vllm") | ||
| root = pathlib.Path(spec.submodule_search_locations[0]) | ||
| target = root / "models" / "minimax_m3" / "amd" / "model.py" | ||
| src = target.read_text() | ||
|
|
||
| if "EagleModelMixin" in src and "class MiniMaxM3Model(nn.Module, EagleModelMixin):" in src: | ||
| print(f"[eagle3-patch] already applied: {target}") | ||
| sys.exit(0) | ||
|
|
||
| edits = [ | ||
| ( | ||
| "from vllm.model_executor.models.interfaces import (\n" | ||
| " MultiModalEmbeddings,\n" | ||
| " SupportsMultiModal,\n" | ||
| ")", | ||
| "from vllm.model_executor.models.interfaces import (\n" | ||
| " EagleModelMixin,\n" | ||
| " MultiModalEmbeddings,\n" | ||
| " SupportsEagle3,\n" | ||
| " SupportsMultiModal,\n" | ||
| ")", | ||
| ), | ||
| ( | ||
| "class MiniMaxM3Model(nn.Module):", | ||
| "class MiniMaxM3Model(nn.Module, EagleModelMixin):", | ||
| ), | ||
| ( | ||
| " inputs_embeds: torch.Tensor | None = None,\n" | ||
| " ) -> torch.Tensor:\n" | ||
| " if inputs_embeds is not None:", | ||
| " inputs_embeds: torch.Tensor | None = None,\n" | ||
| " ) -> torch.Tensor | tuple[torch.Tensor, list[torch.Tensor]]:\n" | ||
| " if inputs_embeds is not None:", | ||
| ), | ||
| ( | ||
| " residual = None\n\n" | ||
| " for layer in self.layers[self.start_layer : self.end_layer]:\n" | ||
| " hidden_states, residual = layer(positions, hidden_states, residual)\n\n" | ||
| " hidden_states, _ = self.norm(hidden_states, residual)\n" | ||
| " return hidden_states", | ||
| " residual = None\n\n" | ||
| " # EAGLE3 is not yet compatible with pipeline parallel\n" | ||
| " aux_hidden_states = self._maybe_add_hidden_state([], 0, hidden_states, residual)\n" | ||
| " for idx, layer in enumerate(self.layers[self.start_layer : self.end_layer]):\n" | ||
| " hidden_states, residual = layer(positions, hidden_states, residual)\n" | ||
| " self._maybe_add_hidden_state(\n" | ||
| " aux_hidden_states, idx + 1, hidden_states, residual\n" | ||
| " )\n\n" | ||
| " hidden_states, _ = self.norm(hidden_states, residual)\n\n" | ||
| " if len(aux_hidden_states) > 0:\n" | ||
| " return hidden_states, aux_hidden_states\n" | ||
| " return hidden_states", | ||
| ), | ||
| ( | ||
| "class MiniMaxM3SparseForCausalLM(nn.Module):", | ||
| "class MiniMaxM3SparseForCausalLM(nn.Module, SupportsEagle3):", | ||
| ), | ||
| ( | ||
| "class MiniMaxM3SparseForConditionalGeneration(nn.Module, SupportsMultiModal):", | ||
| "class MiniMaxM3SparseForConditionalGeneration(nn.Module, SupportsMultiModal, SupportsEagle3):", | ||
| ), | ||
| ] | ||
|
|
||
| for old, new in edits: | ||
| count = src.count(old) | ||
| if count != 1: | ||
| sys.exit( | ||
| f"[eagle3-patch] anchor matched {count} times (expected 1); " | ||
| f"installed {target} has drifted from the expected base — aborting" | ||
| ) | ||
| src = src.replace(old, new) | ||
|
|
||
| ast.parse(src) | ||
| target.write_text(src) | ||
| print(f"[eagle3-patch] applied EAGLE3 support to {target}") | ||
| PYEOF | ||
|
|
||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve "$MODEL" --port "$PORT" \ | ||
| "${PARALLEL_ARGS[@]}" \ | ||
| --block-size 128 \ | ||
| --language-model-only \ | ||
| --max-model-len "$MAX_MODEL_LEN" \ | ||
| --kv-cache-dtype fp8 \ | ||
| --attention-backend TRITON_ATTN \ | ||
| --enforce-eager \ | ||
| --speculative-config "{\"method\": \"eagle3\", \"model\": \"$DRAFT_MODEL\", \"num_speculative_tokens\": $NUM_SPEC_TOKENS}" \ | ||
| --tool-call-parser minimax_m3 \ | ||
| --reasoning-parser minimax_m3 \ | ||
| --enable-auto-tool-choice > "$SERVER_LOG" 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| # Spec-decode acceptance rate degrades on raw random tokens; route prompts | ||
| # through the chat template as the other MTP recipes do. | ||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts "$((CONC * 10))" \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code \ | ||
| --use-chat-template | ||
|
|
||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| stop_gpu_monitor | ||
| set +x | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Draft download lacks NFS retry
Medium Severity
The recipe fetches the unstaged
Inferact/MiniMax-M3-EAGLE3draft with a singlehf downloadinto the shared NFS HF cache, while the MI355X launcher mounts that cache for MiniMax-M3 runs. Parallel matrix jobs can contend on hub lock files the same way as other MiniMax MTP recipes on network storage, but this script omits the retry loop those siblings use for the draft.Reviewed by Cursor Bugbot for commit 149e11e. Configure here.