Skip to content

Add B300 config: kimi-k2.5-int4-vllm#1071

Open
cquil11 wants to merge 2 commits intomainfrom
reopen-1057-kimik2.5-int4-b300
Open

Add B300 config: kimi-k2.5-int4-vllm#1071
cquil11 wants to merge 2 commits intomainfrom
reopen-1057-kimik2.5-int4-b300

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Apr 17, 2026

Warning

This PR is a reopen of #1057, which was reverted in #1070 due to an error with the first PR. The contents and description are otherwise identical to the original.

Summary

  • Add kimik2.5-int4-b300-vllm benchmark config and the corresponding benchmarks/single_node/kimik2.5_int4_b300.sh launch script
  • At the time of submission, the vLLM Kimi-K2.5 recipes page does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 INT4 B200 vLLM recipe as-is until B300-specific tuning is available
  • Image: vllm/vllm-openai:v0.15.1 (same as B200), runner: b300, same TP=8 and concurrency 4-64 search-space as B200

Test plan

  • CI config validation passes
  • Run kimik2.5-int4-b300-vllm single-node benchmark on a B300 node and confirm server starts, benchmark completes, and result file is produced

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

2 similar comments
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread perf-changelog.yaml
Comment on lines +1473 to 1481
- kimik2.5-int4-b300-vllm
description:
- "Add Kimi-K2.5 INT4 B300 vLLM benchmark"
- "Image: vllm/vllm-openai:v0.19.0-cu130"
- "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 INT4 B200 vLLM recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1057

- config-keys:
- gptoss-fp4-mi300x-vllm
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new kimik2.5-int4-b300-vllm entry in perf-changelog.yaml has two documentation issues: (1) pr-link points to the reverted PR #1057 instead of the current PR #1071, and (2) the entry was inserted before the existing gptoss-fp4-mi300x-vllm entry rather than appended at the very end of the file, violating the AGENTS.md ordering convention. Both can be fixed by moving the new entry to the bottom of the file and updating the pr-link to https://github.com/SemiAnalysisAI/InferenceX/pull/1071.

Extended reasoning...

Issue 1: Wrong PR link

The new kimik2.5-int4-b300-vllm entry carries pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1057. PR #1057 was the original submission that was subsequently reverted entirely by PR #1070. The current PR, #1071, is explicitly described as a reopen of #1057 with otherwise identical contents. Since PR #1057 never landed in main (it was reverted), the canonical PR that actually introduces this config is #1071. The pr-link field is used as historical documentation linking a config to the PR that merged it, so pointing it at a reverted PR is incorrect — anyone following the link will reach a reverted (and confusing) PR rather than the one that introduced the change.

Issue 2: Entry inserted in the wrong position

AGENTS.md line 159 states: "The file is read in chronological order: oldest at the top, newest at the bottom. New entries MUST be appended to the END of the file — never insert in the middle or prepend." The diff shows the new kimik2.5-int4-b300-vllm block was inserted between the kimik2.5-fp4-b300-vllm entry (PR #1056) and the gptoss-fp4-mi300x-vllm entry (PR #1053). After the PR merges, gptoss-fp4-mi300x-vllm remains the last entry in the file, while the newer kimik2.5-int4-b300-vllm entry sits above it — a clear ordering inversion.

Why existing code doesn't prevent this

There is no automated enforcement of the append-only rule or the pr-link value in the CI pipeline. The AGENTS.md instruction is a human convention. When the author copied the entry from PR #1057, both the pr-link value and the insertion position were carried over as-is, bypassing the update.

Impact

The impact is limited to documentation accuracy. The wrong pr-link means anyone using the changelog to trace config history will land on the reverted PR, potentially causing confusion about whether the config is actually live. The ordering violation makes the changelog harder to read chronologically and sets a precedent for future out-of-order insertions.

How to fix

Move the entire kimik2.5-int4-b300-vllm block to the very bottom of perf-changelog.yaml (after the gptoss-fp4-mi300x-vllm entry) and update its pr-link to https://github.com/SemiAnalysisAI/InferenceX/pull/1071.

Step-by-step proof

  1. The last entry before this PR was gptoss-fp4-mi300x-vllm (PR [AMD][MI300X] Expand GPT-OSS FP4 TP=1 concurrency from 64 to 256 #1053).
  2. The diff shows the new block inserted at the position before gptoss-fp4-mi300x-vllm — confirmed by the diff context showing - config-keys: [gptoss-fp4-mi300x-vllm]… appearing immediately after the added block.
  3. After the merge, tailing perf-changelog.yaml shows the order: …kimik2.5-fp4-b300-vllmkimik2.5-int4-b300-vllmgptoss-fp4-mi300x-vllm, so PR [AMD][MI300X] Expand GPT-OSS FP4 TP=1 concurrency from 64 to 256 #1053's entry is still last despite being older.
  4. The pr-link value in the new entry is pull/1057, which maps to the reverted PR, not the current one (pull/1071).

Comment on lines +2010 to +2014
# NOTE: At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html
# does not have a B300-specific recipe, so this config reuses the existing
# Kimi-K2.5 INT4 B200 vLLM recipe as-is until B300-specific tuning is available.
kimik2.5-int4-b300-vllm:
image: vllm/vllm-openai:v0.19.0-cu130
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The PR description explicitly states "Image: vllm/vllm-openai:v0.15.1 (same as B200)", but the actual B300 config uses vllm/vllm-openai:v0.19.0-cu130 — a different image. The image choice for B300 is intentional and correct (cu130 is required for Blackwell hardware), but the PR description is factually inaccurate; the perf-changelog.yaml already correctly records v0.19.0-cu130, so updating the PR description to match would remove the inconsistency.

Extended reasoning...

What the bug is and how it manifests

The PR description says: "Image: vllm/vllm-openai:v0.15.1 (same as B200)", but the actual config in nvidia-master.yaml sets image: vllm/vllm-openai:v0.19.0-cu130. The B200 config (kimik2.5-int4-b200-vllm) uses vllm/vllm-openai:v0.15.1. These are two different images with different vLLM versions and different CUDA base layers.

The specific locations

  1. .github/configs/nvidia-master.yaml lines 2010–2014: the kimik2.5-int4-b300-vllm entry carries image: vllm/vllm-openai:v0.19.0-cu130.
  2. The PR description body literally says v0.15.1 (same as B200).
  3. benchmarks/single_node/kimik2.5_int4_b300.sh header comment says the script reuses the B200 recipe "as-is".
  4. perf-changelog.yaml (correctly) records the image as v0.19.0-cu130, contradicting the PR description.

Why existing code doesn't prevent it

No automated check compares the image name stated in a PR description against the YAML config, so the factual error in the PR description goes undetected.

Addressing the refutation

The refutation correctly notes that the "as-is" language in the inline comments refers to the serving parameters (TP=8, concurrency 4–64, GPU memory utilization, vLLM flags), not the Docker image tag, and that all other B300 configs follow the same pattern of bumping to a cu130 image while keeping the rest of the recipe unchanged. This interpretation is reasonable for the code comments. However, the PR description goes further: it explicitly names a specific image version (v0.15.1) and asserts "same as B200", which is unambiguously wrong regardless of interpretation. The perf-changelog entry in the same PR already records the correct image (v0.19.0-cu130), confirming the config itself is intentional and correct.

Impact

The actual runtime behavior is correct — the B300 job will use v0.19.0-cu130 as required by Blackwell architecture. The only harm is documentation inaccuracy: anyone reading the PR description to understand what changed will be told the wrong image version, potentially causing confusion when trying to reproduce results or audit the change history.

How to fix

Update the PR description to read: "Image: vllm/vllm-openai:v0.19.0-cu130 (cu130 required for B300; B200 uses v0.15.1)." Optionally, tighten the "as-is" comment to clarify that the serving parameters are reused as-is while the image is bumped to the standard B300 cu130 image.

Step-by-step proof

  1. Look up kimik2.5-int4-b200-vllm in nvidia-master.yaml: image: vllm/vllm-openai:v0.15.1.
  2. Look up kimik2.5-int4-b300-vllm in the same file (added by this PR): image: vllm/vllm-openai:v0.19.0-cu130.
  3. Open the PR description: "Image: vllm/vllm-openai:v0.15.1 (same as B200)".
  4. v0.15.1 ≠ v0.19.0-cu130 — the PR description is factually wrong.
  5. The perf-changelog.yaml entry added in the same PR correctly records "Image: vllm/vllm-openai:v0.19.0-cu130", confirming the author knows the actual image but did not update the PR description.

@xinli-sw
Copy link
Copy Markdown

Hi @cquil11 , there were some recent fixes that went in to support this combo, we are waiting on a vLLM release. If you don't mind we can let you know when the release is ready

@cquil11
Copy link
Copy Markdown
Collaborator Author

cquil11 commented Apr 20, 2026

@xinli-sw Can you have a look at this one that I re-opened as well, please? #1100

cquil11 and others added 2 commits April 20, 2026 19:25
- Add kimik2.5-int4-b300-vllm benchmark config and the corresponding benchmarks/single_node/kimik2.5_int4_b300.sh launch script
- At the time of submission, the vLLM Kimi-K2.5 recipes page does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 INT4 B200 vLLM recipe as-is until B300-specific tuning is available
- Image: vllm/vllm-openai:v0.15.1 (same as B200), runner: b300, same TP=8 and concurrency 4-64 search-space as B200
The B300 script pins vllm/vllm-openai:v0.19.0-cu130, which removed the
--disable-log-requests CLI flag. Passing it causes the server to exit
immediately:

  vllm: error: unrecognized arguments: --disable-log-requests
  Server died before becoming healthy. Exiting.

Logging is off by default in newer vLLM releases, so removing the flag
is sufficient.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the reopen-1057-kimik2.5-int4-b300 branch from 0f04bb7 to ce47b76 Compare April 20, 2026 23:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

3 participants