-
Notifications
You must be signed in to change notification settings - Fork 143
[vllm broken - waiting for 0.20] Add B300 config: kimi-k2.5-int4-vllm #1071
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,80 @@ | ||
| #!/usr/bin/env bash | ||
|
|
||
| # NOTE: At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html | ||
| # does not have a B300-specific recipe, so this script reuses the existing | ||
| # Kimi-K2.5 INT4 B200 vLLM recipe as-is until B300-specific tuning is available. | ||
|
|
||
| source "$(dirname "$0")/../benchmark_lib.sh" | ||
|
|
||
| check_env_vars \ | ||
| MODEL \ | ||
| TP \ | ||
| CONC \ | ||
| ISL \ | ||
| OSL \ | ||
| MAX_MODEL_LEN \ | ||
| RANDOM_RANGE_RATIO \ | ||
| RESULT_FILENAME | ||
|
|
||
| if [[ -n "$SLURM_JOB_ID" ]]; then | ||
| echo "JOB $SLURM_JOB_ID running on $SLURMD_NODENAME" | ||
| fi | ||
|
|
||
| hf download "$MODEL" | ||
|
|
||
| nvidia-smi | ||
|
|
||
| export PYTHONNOUSERSITE=1 | ||
| export VLLM_USE_FLASHINFER_MOE_INT4=1 | ||
|
|
||
| SERVER_LOG=/workspace/server.log | ||
| PORT=${PORT:-8888} | ||
|
|
||
| if [ "${EVAL_ONLY}" = "true" ]; then | ||
| setup_eval_context | ||
| MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN" | ||
| fi | ||
| # Start GPU monitoring (power, temperature, clocks every second) | ||
| start_gpu_monitor | ||
|
|
||
| set -x | ||
| vllm serve $MODEL --host 0.0.0.0 --port $PORT \ | ||
| --gpu-memory-utilization 0.95 \ | ||
| --tensor-parallel-size $TP \ | ||
| --max-model-len $MAX_MODEL_LEN \ | ||
| --max-num-seqs $CONC \ | ||
| --reasoning-parser kimi_k2 \ | ||
| --tool-call-parser kimi_k2 \ | ||
| --compilation_config.pass_config.fuse_allreduce_rms true \ | ||
| --trust-remote-code \ | ||
| --no-enable-prefix-caching > $SERVER_LOG 2>&1 & | ||
|
|
||
| SERVER_PID=$! | ||
|
|
||
| # Wait for server to be ready | ||
| wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID" | ||
|
|
||
| pip install -q datasets pandas | ||
|
|
||
| run_benchmark_serving \ | ||
| --model "$MODEL" \ | ||
| --port "$PORT" \ | ||
| --backend vllm \ | ||
| --input-len "$ISL" \ | ||
| --output-len "$OSL" \ | ||
| --random-range-ratio "$RANDOM_RANGE_RATIO" \ | ||
| --num-prompts $(( CONC * 10 )) \ | ||
| --max-concurrency "$CONC" \ | ||
| --result-filename "$RESULT_FILENAME" \ | ||
| --result-dir /workspace/ \ | ||
| --trust-remote-code | ||
|
|
||
| # After throughput, run evaluation only if RUN_EVAL is true | ||
| if [ "${RUN_EVAL}" = "true" ]; then | ||
| run_eval --framework lm-eval --port "$PORT" | ||
| append_lm_eval_summary | ||
| fi | ||
|
|
||
| # Stop GPU monitoring | ||
| stop_gpu_monitor | ||
| set +x |
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -1479,6 +1479,14 @@ | |
| - "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 FP4 B200 vLLM recipe as-is" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1056 | ||
|
|
||
| - config-keys: | ||
| - kimik2.5-int4-b300-vllm | ||
| description: | ||
| - "Add Kimi-K2.5 INT4 B300 vLLM benchmark" | ||
| - "Image: vllm/vllm-openai:v0.19.0-cu130" | ||
| - "At the time of submission, https://docs.vllm.ai/projects/recipes/en/latest/moonshotai/Kimi-K2.5.html does not have a B300-specific recipe, so this reuses the existing Kimi-K2.5 INT4 B200 vLLM recipe as-is" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1057 | ||
|
|
||
| - config-keys: | ||
| - gptoss-fp4-mi300x-vllm | ||
|
Comment on lines
+1483
to
1491
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🟡 The new Extended reasoning...Issue 1: Wrong PR link The new Issue 2: Entry inserted in the wrong position AGENTS.md line 159 states: "The file is read in chronological order: oldest at the top, newest at the bottom. New entries MUST be appended to the END of the file — never insert in the middle or prepend." The diff shows the new Why existing code doesn't prevent this There is no automated enforcement of the append-only rule or the Impact The impact is limited to documentation accuracy. The wrong How to fix Move the entire Step-by-step proof
|
||
| description: | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🟡 The PR description explicitly states "Image: vllm/vllm-openai:v0.15.1 (same as B200)", but the actual B300 config uses vllm/vllm-openai:v0.19.0-cu130 — a different image. The image choice for B300 is intentional and correct (cu130 is required for Blackwell hardware), but the PR description is factually inaccurate; the perf-changelog.yaml already correctly records v0.19.0-cu130, so updating the PR description to match would remove the inconsistency.
Extended reasoning...
What the bug is and how it manifests
The PR description says: "Image:
vllm/vllm-openai:v0.15.1(same as B200)", but the actual config innvidia-master.yamlsetsimage: vllm/vllm-openai:v0.19.0-cu130. The B200 config (kimik2.5-int4-b200-vllm) usesvllm/vllm-openai:v0.15.1. These are two different images with different vLLM versions and different CUDA base layers.The specific locations
.github/configs/nvidia-master.yamllines 2010–2014: thekimik2.5-int4-b300-vllmentry carriesimage: vllm/vllm-openai:v0.19.0-cu130.v0.15.1 (same as B200).benchmarks/single_node/kimik2.5_int4_b300.shheader comment says the script reuses the B200 recipe "as-is".perf-changelog.yaml(correctly) records the image asv0.19.0-cu130, contradicting the PR description.Why existing code doesn't prevent it
No automated check compares the image name stated in a PR description against the YAML config, so the factual error in the PR description goes undetected.
Addressing the refutation
The refutation correctly notes that the "as-is" language in the inline comments refers to the serving parameters (TP=8, concurrency 4–64, GPU memory utilization, vLLM flags), not the Docker image tag, and that all other B300 configs follow the same pattern of bumping to a cu130 image while keeping the rest of the recipe unchanged. This interpretation is reasonable for the code comments. However, the PR description goes further: it explicitly names a specific image version (
v0.15.1) and asserts "same as B200", which is unambiguously wrong regardless of interpretation. The perf-changelog entry in the same PR already records the correct image (v0.19.0-cu130), confirming the config itself is intentional and correct.Impact
The actual runtime behavior is correct — the B300 job will use
v0.19.0-cu130as required by Blackwell architecture. The only harm is documentation inaccuracy: anyone reading the PR description to understand what changed will be told the wrong image version, potentially causing confusion when trying to reproduce results or audit the change history.How to fix
Update the PR description to read: "Image:
vllm/vllm-openai:v0.19.0-cu130(cu130 required for B300; B200 usesv0.15.1)." Optionally, tighten the "as-is" comment to clarify that the serving parameters are reused as-is while the image is bumped to the standard B300 cu130 image.Step-by-step proof
kimik2.5-int4-b200-vllminnvidia-master.yaml:image: vllm/vllm-openai:v0.15.1.kimik2.5-int4-b300-vllmin the same file (added by this PR):image: vllm/vllm-openai:v0.19.0-cu130.vllm/vllm-openai:v0.15.1(same as B200)".v0.15.1 ≠ v0.19.0-cu130— the PR description is factually wrong.perf-changelog.yamlentry added in the same PR correctly records"Image: vllm/vllm-openai:v0.19.0-cu130", confirming the author knows the actual image but did not update the PR description.