Skip to content

Add B300 config: kimi-k2.5-fp4-vllm#1100

Merged
functionstackx merged 2 commits intomainfrom
reopen-1056-kimik2.5-fp4-b300
Apr 20, 2026
Merged

Add B300 config: kimi-k2.5-fp4-vllm#1100
functionstackx merged 2 commits intomainfrom
reopen-1056-kimik2.5-fp4-b300

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Apr 20, 2026

Warning

This PR is a reopen of #1056, which was reverted in #1099 due to an error with the first PR. The contents and description are otherwise identical to the original.

Summary

  • Add kimik2.5-fp4-b300-vllm benchmark config and the corresponding benchmarks/single_node/kimik2.5_fp4_b300.sh launch script
  • At the time of submission, the vLLM Kimi-K2.5 recipes page does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 FP4 (NVFP4) B200 vLLM recipe as-is until B300-specific tuning is available
  • Image: vllm/vllm-openai:v0.17.0 (same as B200), runner: b300, same TP/EP/concurrency search-space as B200

Test plan

  • CI config validation passes
  • Run kimik2.5-fp4-b300-vllm single-node benchmark on a B300 node and confirm server starts, benchmark completes, and result file is produced

🤖 Generated with Claude Code

@cquil11 cquil11 requested a review from a team April 20, 2026 13:47
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM — straightforward B300 config addition following established patterns; the stale pr-link in the changelog entry is a minor nit already flagged inline.

Extended reasoning...

Overview

This PR adds a new benchmark configuration kimik2.5-fp4-b300-vllm to three files: .github/configs/nvidia-master.yaml (new config block), benchmarks/single_node/kimik2.5_fp4_b300.sh (new launch script), and perf-changelog.yaml (new changelog entry). It is a direct re-addition of the change from PR #1056, which was reverted by PR #1099.

Security risks

None. These are benchmark configuration and shell script changes with no auth, crypto, or permission-sensitive code paths. The shell script follows the same pattern as all other single-node benchmark scripts in the repo.

Level of scrutiny

Low. The config block mirrors the existing kimik2.5-fp4-b200-vllm entry with only the runner changed to b300 and the image bumped to v0.19.0-cu130. The launch script is a copy of the B200 script with the same vLLM serve invocation. Comparable B300 entries for MiniMax-M2.5 FP4/FP8 (PRs #1054, #1055) used the identical approach.

Other factors

The only bug found is a [Nit]: the perf-changelog.yaml entry carries pr-link: .../pull/1056 (the reverted PR) instead of pull/1100. This is a cosmetic tracking issue that doesn't affect CI, benchmarks, or functionality. The inline comment already captures this.

Comment thread perf-changelog.yaml
- "Add Kimi-K2.5 FP4 (NVFP4) B300 vLLM benchmark"
- "Image: vllm/vllm-openai:v0.19.0-cu130"
- "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 FP4 B200 vLLM recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1056
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The perf-changelog.yaml entry for kimik2.5-fp4-b300-vllm has pr-link pointing to PR #1056, which was explicitly reverted by PR #1099. Since this PR (#1100) is the one that will actually land the change, the pr-link should reference https://github.com/SemiAnalysisAI/InferenceX/pull/1100 instead.

Extended reasoning...

What the bug is and how it manifests

The new perf-changelog.yaml entry for kimik2.5-fp4-b300-vllm (added in this PR) sets pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1056. PR #1056 was the original submission of this B300 config, but it was subsequently reverted by PR #1099 due to an error. The PR description for #1100 explicitly states: "This PR is a reopen of #1056, which was reverted in #1099 due to an error with the first PR." This means PR #1056 exists on GitHub in a reverted/superseded state, while PR #1100 is the change that will actually merge the configuration into main.

The specific code path that triggers it

In the diff, the new changelog block reads:

pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1056

This was carried over from the original PR #1056 submission without being updated to reflect the current PR number (#1100).

Why existing code does not prevent it

The changelog is a manually maintained YAML file with no automated validation that cross-checks pr-link values against the actual PR being merged. There is no CI check to enforce that the link matches the current PR. The error is a straightforward copy-paste oversight when reopening the PR.

What the impact would be

When anyone later looks up the changelog entry for kimik2.5-fp4-b300-vllm to understand when and how it was introduced, they will follow the pr-link and land on PR #1056, which GitHub shows as reverted. This is misleading — it appears the feature was reverted rather than merged. The actual merging PR (#1100) would not be linked anywhere in the changelog, making historical tracking inaccurate.

How to fix it

Change the pr-link in the new changelog entry from pull/1056 to pull/1100.

Step-by-step proof

  1. PR Add B300 config: kimi-k2.5-fp4-vllm #1056 was opened and merged, adding the kimik2.5-fp4-b300-vllm config.
  2. PR Revert "Add B300 config: kimi-k2.5-fp4-vllm (#1056)" [skip-sweep] #1099 explicitly reverted PR Add B300 config: kimi-k2.5-fp4-vllm #1056 (commit: "Revert 'Add B300 config: kimi-k2.5-fp4-vllm (Add B300 config: kimi-k2.5-fp4-vllm #1056)'").
  3. PR Add B300 config: kimi-k2.5-fp4-vllm #1100 (this PR) reopens the same change — its description confirms: "This PR is a reopen of Add B300 config: kimi-k2.5-fp4-vllm #1056, which was reverted in Revert "Add B300 config: kimi-k2.5-fp4-vllm (#1056)" [skip-sweep] #1099".
  4. The new changelog entry in this PR's diff sets pr-link to /pull/1056 (the reverted PR) instead of /pull/1100 (this PR).
  5. Comparable B300 entries in the same file correctly reference their own PR numbers: minimaxm2.5-fp8-b300-vllm -> Add B300 config: minimaxm2.5-fp8-vllm #1054, minimaxm2.5-fp4-b300-vllm -> Add B300 config: minimaxm2.5-fp4-vllm #1055, dsr1-fp8-b300-sglang-mtp -> Add B300 config: dsr1-fp8-sglang-mtp #1059.
  6. The pr-link should therefore be updated to https://github.com/SemiAnalysisAI/InferenceX/pull/1100.

@xinli-sw
Copy link
Copy Markdown

Hi @cquil11 , I think we could do two things to get better perf

  1. B300 could benefit from more parallel configs, we can run some more and report back
  2. We should also include changes from https://github.com/SemiAnalysisAI/InferenceX/pull/1047/changes

Depending on how fast you want the data, we can either have #1407 as a fast follow of this PR or just wait until that's ready. Thanks!

@cquil11
Copy link
Copy Markdown
Collaborator Author

cquil11 commented Apr 20, 2026

@xinli-sw No real rush, we can wait for y'all.

@functionstackx
Copy link
Copy Markdown
Contributor

hi @xinli-sw it would be if https://recipes.vllm.ai/ had b300 specific configs so that ppl can know what flags to use specifically for b300

@xinli-sw
Copy link
Copy Markdown

yup for sure, I think mostly just having better parallel configs so that B300's higher mem size, will update shortly

- Add kimik2.5-fp4-b300-vllm benchmark config and the corresponding benchmarks/single_node/kimik2.5_fp4_b300.sh launch script
- At the time of submission, the vLLM Kimi-K2.5 recipes page does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 FP4 (NVFP4) B200 vLLM recipe as-is until B300-specific tuning is available
- Image: vllm/vllm-openai:v0.17.0 (same as B200), runner: b300, same TP/EP/concurrency search-space as B200
@functionstackx functionstackx force-pushed the reopen-1056-kimik2.5-fp4-b300 branch from fc0ae0d to 3f24821 Compare April 20, 2026 22:33
@functionstackx
Copy link
Copy Markdown
Contributor

@claude look at the fix in #1104 for b200, add it to this b300 PR too

Some attention backends require the KV-cache block count to be a multiple
of 128 tokens. With block_size=32 that means block_num must be divisible
by 4. The previous ISL + OSL + 200 padding did not guarantee this
alignment (e.g. ISL=8192, OSL=1024 rounds to 295 blocks, not divisible
by 4), triggering:

  RuntimeError: Expected block_num % (128 / block_size) == 0,
  got block_num=295 and block_size=32

Bumping the headroom to +256 (any multiple of 128) keeps the context
length aligned. Mirrors the fix from #1104.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 20, 2026

Claude finished @functionstackx's task in 2m 14s —— View job


Applying B200 fix from PR #1104 to this B300 PR

Result

The fix from PR #1104 (bump max-model-len headroom from +200 to +256 for block-size alignment) is already included in this branch via commit 69a2584:

File Status
benchmarks/benchmark_lib.sh (setup_eval_context) +256
benchmarks/single_node/gptoss_fp4_b200.sh +256
benchmarks/single_node/gptoss_fp4_h200.sh +256
utils/matrix_logic/generate_sweep_configs.py (4 sites) +256
utils/matrix_logic/test_generate_sweep_configs.py +256

The B300 kimi script (kimik2.5_fp4_b300.sh) itself doesn't need a direct change because it receives MAX_MODEL_LEN from the environment variable set by generate_sweep_configs.py, which already uses the +256 headroom. No additional changes needed.
|

@functionstackx functionstackx merged commit f34a83b into main Apr 20, 2026
25 checks passed
@functionstackx functionstackx deleted the reopen-1056-kimik2.5-fp4-b300 branch April 20, 2026 23:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants