Skip to content

[WIP][NV] update minimaxm2.5 fp4 b200 vllm flag#1069

Open
hshrivastava-droid wants to merge 5 commits intomainfrom
minimaxm2.5_fp4_b200-v2
Open

[WIP][NV] update minimaxm2.5 fp4 b200 vllm flag#1069
hshrivastava-droid wants to merge 5 commits intomainfrom
minimaxm2.5_fp4_b200-v2

Conversation

@hshrivastava-droid
Copy link
Copy Markdown
Collaborator

No description provided.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines 27 to +28

export VLLM_FLOAT32_MATMUL_PRECISION=high
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 This PR adds export VLLM_FLOAT32_MATMUL_PRECISION=high to minimaxm2.5_fp4_b200.sh but omits the same change from its B300 counterpart (minimaxm2.5_fp4_b300.sh), which explicitly states it reuses the B200 recipe as-is. Please add the same export to benchmarks/single_node/minimaxm2.5_fp4_b300.sh after the PORT assignment, before the DP_ATTENTION conditional block.

Extended reasoning...

What the bug is and how it manifests

The PR adds export VLLM_FLOAT32_MATMUL_PRECISION=high to benchmarks/single_node/minimaxm2.5_fp4_b200.sh (line 27) as a performance optimization for NVFP4 matmul operations on B200 hardware. However, the structurally identical B300 script (benchmarks/single_node/minimaxm2.5_fp4_b300.sh) does not receive the same update. After this PR merges, the two scripts diverge in a way that contradicts the B300 script's own documented intent.

The specific code path that triggers it

The B300 script (lines 3-5) includes an explicit comment: "At the time of submission ... this script reuses the existing MiniMax-M2.5 FP4 B200 vLLM recipe as-is until B300-specific tuning is available." Both scripts use the same Docker image (vllm/vllm-openai:v0.19.0-cu130), the same NVFP4 model, and identical vllm serve arguments. Before this PR they were functionally equivalent; after this PR, B200 benchmarks run with VLLM_FLOAT32_MATMUL_PRECISION=high set and B300 benchmarks do not.

Why existing code does not prevent it

There is no shared configuration or template mechanism that would automatically propagate the new env var to the B300 script. Each script is a standalone shell file, so the author must manually mirror changes. The B300 comment documents the intent to stay in sync with B200, but there is no enforcement.

What the impact would be

B200 and B300 benchmarks for the same MiniMax-M2.5 NVFP4 model will run under different PyTorch matmul precision settings, making the results incomparable. If VLLM_FLOAT32_MATMUL_PRECISION=high improves throughput or latency on B200 (motivating the change), the B300 numbers will be artificially lower than they should be, understating B300 performance relative to B200.

How to fix it

Add export VLLM_FLOAT32_MATMUL_PRECISION=high to benchmarks/single_node/minimaxm2.5_fp4_b300.sh immediately after line 30 (PORT=${PORT:-8888}), mirroring the placement in the B200 script. The perf-changelog.yaml entry should also reference the minimaxm2.5-fp4-b300-vllm config-key alongside minimaxm2.5-fp4-b200-vllm.

Step-by-step proof

  1. Before this PR, both B200 and B300 scripts set SERVER_LOG and PORT, then jump directly to the DP_ATTENTION conditional block with no env var in between.
  2. This PR inserts export VLLM_FLOAT32_MATMUL_PRECISION=high between the PORT line and the DP_ATTENTION check in the B200 script only.
  3. After the PR, grep VLLM_FLOAT32_MATMUL_PRECISION benchmarks/single_node/minimaxm2.5_fp4_b200.sh returns a match; the same grep on the B300 script returns nothing.
  4. A benchmark job launched against a B300 node using minimaxm2.5_fp4_b300.sh will therefore start the vLLM server without the high-precision matmul flag, while an equivalent B200 job benefits from it -- inconsistent with the B300 script's stated goal of reusing the B200 recipe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant