Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
19 changes: 9 additions & 10 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3537,21 +3537,20 @@ minimaxm2.5-fp4-b200-vllm:
- isl: 1024
osl: 1024
search-space:
- { tp: 1, conc-start: 4, conc-end: 4 }
- { tp: 2, conc-start: 4, conc-end: 512 }
- { tp: 2, ep: 2, conc-start: 128, conc-end: 256 }
- { tp: 2, ep: 2, dp-attn: true, conc-start: 512, conc-end: 512 }
- { tp: 4, conc-start: 4, conc-end: 512 }
- { tp: 4, ep: 4, conc-start: 32, conc-end: 128 }
- { tp: 8, conc-start: 4, conc-end: 4 }
- { tp: 1, conc-start: 4, conc-end: 16 }
- { tp: 2, conc-start: 16, conc-end: 16 }
- { tp: 2, ep: 2, conc-start: 128, conc-end: 128 }
- { tp: 2, ep: 2, dp-attn: true, conc-start: 256, conc-end: 1024 }
- { tp: 4, conc-start: 4, conc-end: 16 }
- { tp: 4, ep: 4, conc-start: 64, conc-end: 128 }
- { tp: 8, conc-start: 4, conc-end: 8 }
- isl: 8192
osl: 1024
search-space:
- { tp: 1, conc-start: 4, conc-end: 32 }
- { tp: 1, conc-start: 256, conc-end: 512 }
- { tp: 2, conc-start: 4, conc-end: 512 }
- { tp: 1, conc-start: 256, conc-end: 256 }
- { tp: 2, ep: 2, conc-start: 128, conc-end: 512 }
- { tp: 4, conc-start: 4, conc-end: 512 }
- { tp: 4, conc-start: 4, conc-end: 8 }
- { tp: 8, conc-start: 4, conc-end: 4 }

# NOTE: At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/MiniMax/MiniMax-M2.html
Expand Down
2 changes: 2 additions & 0 deletions benchmarks/single_node/minimaxm2.5_fp4_b200.sh
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,8 @@ hf download "$MODEL"
SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

export VLLM_FLOAT32_MATMUL_PRECISION=high
Comment on lines 27 to +28
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This PR adds export VLLM_FLOAT32_MATMUL_PRECISION=high to minimaxm2.5_fp4_b200.sh but omits the same change from its B300 counterpart (minimaxm2.5_fp4_b300.sh), which explicitly states it reuses the B200 recipe as-is. Please add the same export to benchmarks/single_node/minimaxm2.5_fp4_b300.sh after the PORT assignment, before the DP_ATTENTION conditional block.

Extended reasoning...

What the bug is and how it manifests

The PR adds export VLLM_FLOAT32_MATMUL_PRECISION=high to benchmarks/single_node/minimaxm2.5_fp4_b200.sh (line 27) as a performance optimization for NVFP4 matmul operations on B200 hardware. However, the structurally identical B300 script (benchmarks/single_node/minimaxm2.5_fp4_b300.sh) does not receive the same update. After this PR merges, the two scripts diverge in a way that contradicts the B300 script's own documented intent.

The specific code path that triggers it

The B300 script (lines 3-5) includes an explicit comment: "At the time of submission ... this script reuses the existing MiniMax-M2.5 FP4 B200 vLLM recipe as-is until B300-specific tuning is available." Both scripts use the same Docker image (vllm/vllm-openai:v0.19.0-cu130), the same NVFP4 model, and identical vllm serve arguments. Before this PR they were functionally equivalent; after this PR, B200 benchmarks run with VLLM_FLOAT32_MATMUL_PRECISION=high set and B300 benchmarks do not.

Why existing code does not prevent it

There is no shared configuration or template mechanism that would automatically propagate the new env var to the B300 script. Each script is a standalone shell file, so the author must manually mirror changes. The B300 comment documents the intent to stay in sync with B200, but there is no enforcement.

What the impact would be

B200 and B300 benchmarks for the same MiniMax-M2.5 NVFP4 model will run under different PyTorch matmul precision settings, making the results incomparable. If VLLM_FLOAT32_MATMUL_PRECISION=high improves throughput or latency on B200 (motivating the change), the B300 numbers will be artificially lower than they should be, understating B300 performance relative to B200.

How to fix it

Add export VLLM_FLOAT32_MATMUL_PRECISION=high to benchmarks/single_node/minimaxm2.5_fp4_b300.sh immediately after line 30 (PORT=${PORT:-8888}), mirroring the placement in the B200 script. The perf-changelog.yaml entry should also reference the minimaxm2.5-fp4-b300-vllm config-key alongside minimaxm2.5-fp4-b200-vllm.

Step-by-step proof

  1. Before this PR, both B200 and B300 scripts set SERVER_LOG and PORT, then jump directly to the DP_ATTENTION conditional block with no env var in between.
  2. This PR inserts export VLLM_FLOAT32_MATMUL_PRECISION=high between the PORT line and the DP_ATTENTION check in the B200 script only.
  3. After the PR, grep VLLM_FLOAT32_MATMUL_PRECISION benchmarks/single_node/minimaxm2.5_fp4_b200.sh returns a match; the same grep on the B300 script returns nothing.
  4. A benchmark job launched against a B300 node using minimaxm2.5_fp4_b300.sh will therefore start the vLLM server without the high-precision matmul flag, while an equivalent B200 job benefits from it -- inconsistent with the B300 script's stated goal of reusing the B200 recipe.


if [ "${DP_ATTENTION}" = "true" ]; then
PARALLEL_ARGS="--tensor-parallel-size=1 --data-parallel-size=$TP --enable-expert-parallel"
elif [ "$EP_SIZE" -gt 1 ]; then
Expand Down
6 changes: 6 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1638,3 +1638,9 @@
description:
- "Add kv-cache-dtype fp8, max-cudagraph-capture-size 2048, max-num-batched-tokens, and stream-interval 20 to server launch args"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1047

- config-keys:
- minimaxm2.5-fp4-b200-vllm
description:
- "Add VLLM_FLOAT32_MATMUL_PRECISION=high"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1069
Loading