[WIP][NV] update minimaxm2.5 fp4 b200 vllm flag by hshrivastava-droid · Pull Request #1069 · SemiAnalysisAI/InferenceX

hshrivastava-droid · 2026-04-17T19:43:53Z

No description provided.

github-actions · 2026-04-17T19:44:02Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-17T19:52:26Z


+export VLLM_FLOAT32_MATMUL_PRECISION=high


🔴 This PR adds export VLLM_FLOAT32_MATMUL_PRECISION=high to minimaxm2.5_fp4_b200.sh but omits the same change from its B300 counterpart (minimaxm2.5_fp4_b300.sh), which explicitly states it reuses the B200 recipe as-is. Please add the same export to benchmarks/single_node/minimaxm2.5_fp4_b300.sh after the PORT assignment, before the DP_ATTENTION conditional block.

Extended reasoning...

What the bug is and how it manifests

The PR adds export VLLM_FLOAT32_MATMUL_PRECISION=high to benchmarks/single_node/minimaxm2.5_fp4_b200.sh (line 27) as a performance optimization for NVFP4 matmul operations on B200 hardware. However, the structurally identical B300 script (benchmarks/single_node/minimaxm2.5_fp4_b300.sh) does not receive the same update. After this PR merges, the two scripts diverge in a way that contradicts the B300 script's own documented intent.

The specific code path that triggers it

The B300 script (lines 3-5) includes an explicit comment: "At the time of submission ... this script reuses the existing MiniMax-M2.5 FP4 B200 vLLM recipe as-is until B300-specific tuning is available." Both scripts use the same Docker image (vllm/vllm-openai:v0.19.0-cu130), the same NVFP4 model, and identical vllm serve arguments. Before this PR they were functionally equivalent; after this PR, B200 benchmarks run with VLLM_FLOAT32_MATMUL_PRECISION=high set and B300 benchmarks do not.

Why existing code does not prevent it

There is no shared configuration or template mechanism that would automatically propagate the new env var to the B300 script. Each script is a standalone shell file, so the author must manually mirror changes. The B300 comment documents the intent to stay in sync with B200, but there is no enforcement.

What the impact would be

B200 and B300 benchmarks for the same MiniMax-M2.5 NVFP4 model will run under different PyTorch matmul precision settings, making the results incomparable. If VLLM_FLOAT32_MATMUL_PRECISION=high improves throughput or latency on B200 (motivating the change), the B300 numbers will be artificially lower than they should be, understating B300 performance relative to B200.

How to fix it

Add export VLLM_FLOAT32_MATMUL_PRECISION=high to benchmarks/single_node/minimaxm2.5_fp4_b300.sh immediately after line 30 (PORT=${PORT:-8888}), mirroring the placement in the B200 script. The perf-changelog.yaml entry should also reference the minimaxm2.5-fp4-b300-vllm config-key alongside minimaxm2.5-fp4-b200-vllm.

Step-by-step proof

Before this PR, both B200 and B300 scripts set SERVER_LOG and PORT, then jump directly to the DP_ATTENTION conditional block with no env var in between.

This PR inserts export VLLM_FLOAT32_MATMUL_PRECISION=high between the PORT line and the DP_ATTENTION check in the B200 script only.

After the PR, grep VLLM_FLOAT32_MATMUL_PRECISION benchmarks/single_node/minimaxm2.5_fp4_b200.sh returns a match; the same grep on the B300 script returns nothing.

A benchmark job launched against a B300 node using minimaxm2.5_fp4_b300.sh will therefore start the vLLM server without the high-precision matmul flag, while an equivalent B200 job benefits from it -- inconsistent with the B300 script's stated goal of reusing the B200 recipe.

update vllm flag

7b40de9

hshrivastava-droid requested a review from a team April 17, 2026 19:43

github-project-automation bot added this to InferenceMAX Board Apr 17, 2026

hshrivastava-droid added NVIDIA sweep-enabled labels Apr 17, 2026

update PR number

3a29b80

claude bot reviewed Apr 17, 2026

View reviewed changes

update conc

3c10c2b

hshrivastava-droid requested review from jgangani and kedarpotdar-nv as code owners April 17, 2026 20:27

hshrivastava-droid added 2 commits April 20, 2026 09:45

conc update

5195d6e

Merge branch 'main' into minimaxm2.5_fp4_b200-v2

ce080a0

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[WIP][NV] update minimaxm2.5 fp4 b200 vllm flag#1069

[WIP][NV] update minimaxm2.5 fp4 b200 vllm flag#1069
hshrivastava-droid wants to merge 5 commits intomainfrom
minimaxm2.5_fp4_b200-v2

hshrivastava-droid commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hshrivastava-droid commented Apr 17, 2026

Uh oh!

github-actions bot commented Apr 17, 2026

Uh oh!

claude bot Apr 17, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant