[WIP][NV] update minimaxm2.5 fp4 b200 vllm flag#1069
[WIP][NV] update minimaxm2.5 fp4 b200 vllm flag#1069hshrivastava-droid wants to merge 5 commits intomainfrom
Conversation
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow If additional help is needed, PR authors can reach out to core maintainers over Slack. |
|
|
||
| export VLLM_FLOAT32_MATMUL_PRECISION=high |
There was a problem hiding this comment.
🔴 This PR adds export VLLM_FLOAT32_MATMUL_PRECISION=high to minimaxm2.5_fp4_b200.sh but omits the same change from its B300 counterpart (minimaxm2.5_fp4_b300.sh), which explicitly states it reuses the B200 recipe as-is. Please add the same export to benchmarks/single_node/minimaxm2.5_fp4_b300.sh after the PORT assignment, before the DP_ATTENTION conditional block.
Extended reasoning...
What the bug is and how it manifests
The PR adds export VLLM_FLOAT32_MATMUL_PRECISION=high to benchmarks/single_node/minimaxm2.5_fp4_b200.sh (line 27) as a performance optimization for NVFP4 matmul operations on B200 hardware. However, the structurally identical B300 script (benchmarks/single_node/minimaxm2.5_fp4_b300.sh) does not receive the same update. After this PR merges, the two scripts diverge in a way that contradicts the B300 script's own documented intent.
The specific code path that triggers it
The B300 script (lines 3-5) includes an explicit comment: "At the time of submission ... this script reuses the existing MiniMax-M2.5 FP4 B200 vLLM recipe as-is until B300-specific tuning is available." Both scripts use the same Docker image (vllm/vllm-openai:v0.19.0-cu130), the same NVFP4 model, and identical vllm serve arguments. Before this PR they were functionally equivalent; after this PR, B200 benchmarks run with VLLM_FLOAT32_MATMUL_PRECISION=high set and B300 benchmarks do not.
Why existing code does not prevent it
There is no shared configuration or template mechanism that would automatically propagate the new env var to the B300 script. Each script is a standalone shell file, so the author must manually mirror changes. The B300 comment documents the intent to stay in sync with B200, but there is no enforcement.
What the impact would be
B200 and B300 benchmarks for the same MiniMax-M2.5 NVFP4 model will run under different PyTorch matmul precision settings, making the results incomparable. If VLLM_FLOAT32_MATMUL_PRECISION=high improves throughput or latency on B200 (motivating the change), the B300 numbers will be artificially lower than they should be, understating B300 performance relative to B200.
How to fix it
Add export VLLM_FLOAT32_MATMUL_PRECISION=high to benchmarks/single_node/minimaxm2.5_fp4_b300.sh immediately after line 30 (PORT=${PORT:-8888}), mirroring the placement in the B200 script. The perf-changelog.yaml entry should also reference the minimaxm2.5-fp4-b300-vllm config-key alongside minimaxm2.5-fp4-b200-vllm.
Step-by-step proof
- Before this PR, both B200 and B300 scripts set
SERVER_LOGandPORT, then jump directly to the DP_ATTENTION conditional block with no env var in between. - This PR inserts
export VLLM_FLOAT32_MATMUL_PRECISION=highbetween the PORT line and the DP_ATTENTION check in the B200 script only. - After the PR,
grep VLLM_FLOAT32_MATMUL_PRECISION benchmarks/single_node/minimaxm2.5_fp4_b200.shreturns a match; the same grep on the B300 script returns nothing. - A benchmark job launched against a B300 node using
minimaxm2.5_fp4_b300.shwill therefore start the vLLM server without the high-precision matmul flag, while an equivalent B200 job benefits from it -- inconsistent with the B300 script's stated goal of reusing the B200 recipe.
No description provided.