Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
21 changes: 21 additions & 0 deletions .github/configs/nvidia-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2199,6 +2199,27 @@ kimik2.5-int4-b200-vllm:
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }

# NOTE: At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html
# does not have a B300-specific recipe, so this config reuses the existing
# Kimi-K2.5 INT4 B200 vLLM recipe as-is until B300-specific tuning is available.
kimik2.5-int4-b300-vllm:
image: vllm/vllm-openai:v0.19.0-cu130
Comment on lines +2202 to +2206
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The PR description explicitly states "Image: vllm/vllm-openai:v0.15.1 (same as B200)", but the actual B300 config uses vllm/vllm-openai:v0.19.0-cu130 — a different image. The image choice for B300 is intentional and correct (cu130 is required for Blackwell hardware), but the PR description is factually inaccurate; the perf-changelog.yaml already correctly records v0.19.0-cu130, so updating the PR description to match would remove the inconsistency.

Extended reasoning...

What the bug is and how it manifests

The PR description says: "Image: vllm/vllm-openai:v0.15.1 (same as B200)", but the actual config in nvidia-master.yaml sets image: vllm/vllm-openai:v0.19.0-cu130. The B200 config (kimik2.5-int4-b200-vllm) uses vllm/vllm-openai:v0.15.1. These are two different images with different vLLM versions and different CUDA base layers.

The specific locations

  1. .github/configs/nvidia-master.yaml lines 2010–2014: the kimik2.5-int4-b300-vllm entry carries image: vllm/vllm-openai:v0.19.0-cu130.
  2. The PR description body literally says v0.15.1 (same as B200).
  3. benchmarks/single_node/kimik2.5_int4_b300.sh header comment says the script reuses the B200 recipe "as-is".
  4. perf-changelog.yaml (correctly) records the image as v0.19.0-cu130, contradicting the PR description.

Why existing code doesn't prevent it

No automated check compares the image name stated in a PR description against the YAML config, so the factual error in the PR description goes undetected.

Addressing the refutation

The refutation correctly notes that the "as-is" language in the inline comments refers to the serving parameters (TP=8, concurrency 4–64, GPU memory utilization, vLLM flags), not the Docker image tag, and that all other B300 configs follow the same pattern of bumping to a cu130 image while keeping the rest of the recipe unchanged. This interpretation is reasonable for the code comments. However, the PR description goes further: it explicitly names a specific image version (v0.15.1) and asserts "same as B200", which is unambiguously wrong regardless of interpretation. The perf-changelog entry in the same PR already records the correct image (v0.19.0-cu130), confirming the config itself is intentional and correct.

Impact

The actual runtime behavior is correct — the B300 job will use v0.19.0-cu130 as required by Blackwell architecture. The only harm is documentation inaccuracy: anyone reading the PR description to understand what changed will be told the wrong image version, potentially causing confusion when trying to reproduce results or audit the change history.

How to fix

Update the PR description to read: "Image: vllm/vllm-openai:v0.19.0-cu130 (cu130 required for B300; B200 uses v0.15.1)." Optionally, tighten the "as-is" comment to clarify that the serving parameters are reused as-is while the image is bumped to the standard B300 cu130 image.

Step-by-step proof

  1. Look up kimik2.5-int4-b200-vllm in nvidia-master.yaml: image: vllm/vllm-openai:v0.15.1.
  2. Look up kimik2.5-int4-b300-vllm in the same file (added by this PR): image: vllm/vllm-openai:v0.19.0-cu130.
  3. Open the PR description: "Image: vllm/vllm-openai:v0.15.1 (same as B200)".
  4. v0.15.1 ≠ v0.19.0-cu130 — the PR description is factually wrong.
  5. The perf-changelog.yaml entry added in the same PR correctly records "Image: vllm/vllm-openai:v0.19.0-cu130", confirming the author knows the actual image but did not update the PR description.

model: moonshotai/Kimi-K2.5
model-prefix: kimik2.5
runner: b300
precision: int4
framework: vllm
multinode: false
seq-len-configs:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, conc-start: 4, conc-end: 64 }

kimik2.5-int4-h200-vllm:
image: vllm/vllm-openai:v0.16.0
model: moonshotai/Kimi-K2.5
Expand Down
80 changes: 80 additions & 0 deletions benchmarks/single_node/kimik2.5_int4_b300.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,80 @@
#!/usr/bin/env bash

# NOTE: At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html
# does not have a B300-specific recipe, so this script reuses the existing
# Kimi-K2.5 INT4 B200 vLLM recipe as-is until B300-specific tuning is available.

source "$(dirname "$0")/../benchmark_lib.sh"

check_env_vars \
MODEL \
TP \
CONC \
ISL \
OSL \
MAX_MODEL_LEN \
RANDOM_RANGE_RATIO \
RESULT_FILENAME

if [[ -n "$SLURM_JOB_ID" ]]; then
echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME"
fi

hf download "$MODEL"

nvidia-smi

export PYTHONNOUSERSITE=1
export VLLM_USE_FLASHINFER_MOE_INT4=1

SERVER_LOG=/workspace/server.log
PORT=${PORT:-8888}

if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
vllm serve $MODEL --host 0.0.0.0 --port $PORT \
--gpu-memory-utilization 0.95 \
--tensor-parallel-size $TP \
--max-model-len $MAX_MODEL_LEN \
--max-num-seqs $CONC \
--reasoning-parser kimi_k2 \
--tool-call-parser kimi_k2 \
--compilation_config.pass_config.fuse_allreduce_rms true \
--trust-remote-code \
--no-enable-prefix-caching > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

pip install -q datasets pandas

run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts $(( CONC * 10 )) \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/ \
--trust-remote-code

# After throughput, run evaluation only if RUN_EVAL is true
if [ "${RUN_EVAL}" = "true" ]; then
run_eval --framework lm-eval --port "$PORT"
append_lm_eval_summary
fi

# Stop GPU monitoring
stop_gpu_monitor
set +x
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -1479,6 +1479,14 @@
- "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 FP4 B200 vLLM recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1056

- config-keys:
- kimik2.5-int4-b300-vllm
description:
- "Add Kimi-K2.5 INT4 B300 vLLM benchmark"
- "Image: vllm/vllm-openai:v0.19.0-cu130"
- "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 INT4 B200 vLLM recipe as-is"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1057

- config-keys:
- gptoss-fp4-mi300x-vllm
Comment on lines +1483 to 1491
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new kimik2.5-int4-b300-vllm entry in perf-changelog.yaml has two documentation issues: (1) pr-link points to the reverted PR #1057 instead of the current PR #1071, and (2) the entry was inserted before the existing gptoss-fp4-mi300x-vllm entry rather than appended at the very end of the file, violating the AGENTS.md ordering convention. Both can be fixed by moving the new entry to the bottom of the file and updating the pr-link to https://github.com/SemiAnalysisAI/InferenceX/pull/1071.

Extended reasoning...

Issue 1: Wrong PR link

The new kimik2.5-int4-b300-vllm entry carries pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1057. PR #1057 was the original submission that was subsequently reverted entirely by PR #1070. The current PR, #1071, is explicitly described as a reopen of #1057 with otherwise identical contents. Since PR #1057 never landed in main (it was reverted), the canonical PR that actually introduces this config is #1071. The pr-link field is used as historical documentation linking a config to the PR that merged it, so pointing it at a reverted PR is incorrect — anyone following the link will reach a reverted (and confusing) PR rather than the one that introduced the change.

Issue 2: Entry inserted in the wrong position

AGENTS.md line 159 states: "The file is read in chronological order: oldest at the top, newest at the bottom. New entries MUST be appended to the END of the file — never insert in the middle or prepend." The diff shows the new kimik2.5-int4-b300-vllm block was inserted between the kimik2.5-fp4-b300-vllm entry (PR #1056) and the gptoss-fp4-mi300x-vllm entry (PR #1053). After the PR merges, gptoss-fp4-mi300x-vllm remains the last entry in the file, while the newer kimik2.5-int4-b300-vllm entry sits above it — a clear ordering inversion.

Why existing code doesn't prevent this

There is no automated enforcement of the append-only rule or the pr-link value in the CI pipeline. The AGENTS.md instruction is a human convention. When the author copied the entry from PR #1057, both the pr-link value and the insertion position were carried over as-is, bypassing the update.

Impact

The impact is limited to documentation accuracy. The wrong pr-link means anyone using the changelog to trace config history will land on the reverted PR, potentially causing confusion about whether the config is actually live. The ordering violation makes the changelog harder to read chronologically and sets a precedent for future out-of-order insertions.

How to fix

Move the entire kimik2.5-int4-b300-vllm block to the very bottom of perf-changelog.yaml (after the gptoss-fp4-mi300x-vllm entry) and update its pr-link to https://github.com/SemiAnalysisAI/InferenceX/pull/1071.

Step-by-step proof

  1. The last entry before this PR was gptoss-fp4-mi300x-vllm (PR [AMD][MI300X] Expand GPT-OSS FP4 TP=1 concurrency from 64 to 256 #1053).
  2. The diff shows the new block inserted at the position before gptoss-fp4-mi300x-vllm — confirmed by the diff context showing - config-keys: [gptoss-fp4-mi300x-vllm]… appearing immediately after the added block.
  3. After the merge, tailing perf-changelog.yaml shows the order: …kimik2.5-fp4-b300-vllmkimik2.5-int4-b300-vllmgptoss-fp4-mi300x-vllm, so PR [AMD][MI300X] Expand GPT-OSS FP4 TP=1 concurrency from 64 to 256 #1053's entry is still last despite being older.
  4. The pr-link value in the new entry is pull/1057, which maps to the reverted PR, not the current one (pull/1071).

description:
Expand Down
Loading