Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69)#1137
Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69)#1137Oseltamivir wants to merge 2 commits intomainfrom
Conversation
New config dsv4-fp4-gb200-sglang covers SGLang aggregated serving on GB200 (TP=8 across 2 nodes) with two recipes from NVIDIA srt-slurm PR #69: agg-2n-low-latency (EAGLE 3/4 spec decoding) for conc 1-64, and agg-2n-nomtp for conc 128-1024. Adds new framework value "sglang" (no Dynamo frontend) and the dsv4 model-prefix to runners/launch_gb200-nv.sh. For this framework the runner clones https://github.com/YAMY1234/srt-slurm-nv (PR #69 source fork), pinned at commit da535e87 for reproducibility, instead of NVIDIA/srt-slurm. Also exposes the dsv4-grace-blackwell container alias in srtslurm.yaml so the upstream recipes resolve cleanly. Image: lmsysorg/sglang:deepseek-v4-grace-blackwell Model: deepseek-ai/DeepSeek-V4-Pro at /mnt/lustre01/users/sa-shared/DeepSeek-V4-Pro
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
1 similar comment
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| - isl: 1024 | ||
| osl: 1024 | ||
| search-space: | ||
| # Low-latency: TP=8 + EAGLE 3/4 speculative decoding (smaller batches, | ||
| # better TPOT). Recipe targets the low-conc end of the curve. | ||
| - conc-list: [1, 2, 4, 8, 16, 32, 64] | ||
| prefill: | ||
| num-worker: 1 | ||
| tp: 8 | ||
| ep: 1 | ||
| dp-attn: false | ||
| additional-settings: | ||
| # https://github.com/NVIDIA/srt-slurm/pull/69/files#diff-recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml | ||
| - "CONFIG_FILE=recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml" |
There was a problem hiding this comment.
🔴 The low-latency search-space entry (conc 1-64) references agg-2n-low-latency.yaml, which this PR's own description and the adjacent YAML comment describe as using EAGLE 3/4 speculative decoding, but the entry omits spec-decoding: "mtp" and therefore defaults to "none". Because SPEC_DECODING is plumbed through the runner into the result metadata (utils/process_result.py line 53), every EAGLE run from this config will be mislabeled as non-speculative in downstream dashboards, and it will be indistinguishable from the agg-2n-nomtp entry in any aggregation that keys on spec-decoding. Fix: add spec-decoding: "mtp" to the conc 1-64 entry at lines 7452-7465, matching the convention used by every other EAGLE/MTP config in this file.
Extended reasoning...
What the bug is
The new dsv4-fp4-gb200-sglang config in .github/configs/nvidia-master.yaml has two search-space entries for the 1k/1k seq-len. The first entry (conc [1, 2, 4, 8, 16, 32, 64]) points at recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml. This PR's own description says agg-2n-low-latency.yaml — TP=8 + EAGLE 3/4 speculative decoding, the adjacent YAML comment at line 7450 says Low-latency: TP=8 + EAGLE 3/4 speculative decoding, and the perf-changelog entry says agg-2n-low-latency (EAGLE 3/4 spec decoding). Despite this, the search-space entry does not set spec-decoding: "mtp".
Why the default is wrong here
Per utils/matrix_logic/validation.py:227-228, MultiNodeSearchSpaceEntry.spec_decoding is declared as Literal["mtp", "draft_model", "none"] with default="none". So the low-latency EAGLE entry silently becomes spec_decoding="none" when the matrix is generated. Every other EAGLE/MTP recipe in this same file explicitly sets spec-decoding: "mtp" (the *-trt-mtp, *-sglang-mtp, *-dynamo-sglang-mtp entries; the changelog's own description of the other MTP PRs in this file confirms this is the established convention).
Impact
utils/process_result.py is the result-metadata writer run for every benchmark point. At module load (line 27) it requires SPEC_DECODING as an env var, and at line 53 it writes 'spec_decoding': spec_decoding into the result JSON that is uploaded for dashboards:
data = {
...
'framework': framework,
'precision': precision,
'spec_decoding': spec_decoding,
...
}With spec-decoding missing from the YAML, the generated matrix entry carries spec_decoding="none", the job's SPEC_DECODING env var is set to "none", and all seven low-latency EAGLE points (conc 1, 2, 4, 8, 16, 32, 64) get written into result metadata as non-speculative. Downstream dashboards and aggregations that key on spec_decoding will silently fold the EAGLE points in with the non-MTP entry (agg-2n-nomtp.yaml, conc 128-1024), producing a combined sweep that looks like a pure non-MTP config — you lose the ability to see the EAGLE speedup at low concurrency.
The multi-node eval-grouping logic in utils/matrix_logic/generate_sweep_configs.py:106-114 also keys on spec-decoding, but the grouping targets only 8k1k (see target_isl, target_osl = seq_len_stoi["8k1k"] at line 51) and this config is 1k1k, so eval selection is not directly affected. The metadata-correctness impact on result labeling remains, which is the primary defect.
Proof (step-by-step)
generate_sweep_configs.pyexpands the config. For the conc=16 point in the first search-space entry, it calls into the validatedMultiNodeSearchSpaceEntrymodel. Becausespec-decodingis absent, Pydantic applies the default"none"(validation.py:227-228).- The entry is emitted into the matrix with
spec-decoding: "none". - The matrix row drives a GitHub Actions job; the runner sets
SPEC_DECODING=nonein the environment alongsideFRAMEWORK=sglang,MODEL_PREFIX=dsv4,CONFIG_FILE=recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml. srtctl applyruns theagg-2n-low-latency.yamlrecipe, which actually does use EAGLE 3/4 speculative decoding (per the YAMY1234/srt-slurm-nv fork linked in the diff).- Post-benchmark,
utils/process_result.pyreadsSPEC_DECODING="none"and writes{"spec_decoding": "none", ...}into the per-run result JSON that is uploaded for dashboards. - The EAGLE run is now indistinguishable from a non-speculative run in downstream aggregations.
Fix
Add spec-decoding: "mtp" to the first search-space entry:
- conc-list: [1, 2, 4, 8, 16, 32, 64]
spec-decoding: "mtp"
prefill:
num-worker: 1
...The second (throughput) entry correctly leaves it at the default since agg-2n-nomtp.yaml has no MTP.
| - config-keys: | ||
| - dsv4-fp4-gb200-sglang | ||
| description: | ||
| - "Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (1k/1k, TP=8, 2 nodes)" | ||
| - "Recipes from YAMY1234/srt-slurm-nv:dsv4-pro-recipes (NVIDIA srt-slurm PR #69)" | ||
| - "Image: lmsysorg/sglang:deepseek-v4-grace-blackwell" | ||
| - "Two recipes: agg-2n-low-latency (EAGLE 3/4 spec decoding) for conc 1-64, agg-2n-nomtp for conc 128-1024" | ||
| - "Runner script clones the YAMY1234 fork pinned at commit da535e87 instead of NVIDIA/srt-slurm" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD |
There was a problem hiding this comment.
🔴 The new dsv4-fp4-gb200-sglang changelog entry was prepended to the top of perf-changelog.yaml (lines 1-9), but AGENTS.md line 160 explicitly requires new entries to be appended to the END of the file ("oldest at the top, newest at the bottom"). All other recent entries (#1120, #1040, #1043, etc.) correctly live at the bottom, so prepending here inverts chronological ordering for this one entry. Please move the 9-line block from the top of the file to the end.
Extended reasoning...
What the bug is
AGENTS.md line 160 states the rule plainly:
The file is read in chronological order: oldest at the top, newest at the bottom. New entries MUST be appended to the END of the file — never insert in the middle or prepend.
The diff for perf-changelog.yaml in this PR is a single hunk @@ -1,3 +1,13 @@ that inserts the new dsv4-fp4-gb200-sglang entry at the very top of the file, before the pre-existing dsr1-fp8-h100-dynamo-trt/dsr1-fp8-h100-dynamo-sglang entry. That directly violates the documented rule.
Why existing code/convention doesnt prevent it
The rule is a human-enforced convention documented in AGENTS.md — there is no lint or CI check that validates ordering, so the only guard is reviewer/author attention. Every other recent PR in the repo (e.g. #1120 evals trigger, #1040 qwen atom, #1043 glm5.1 atom, #1098, #1106, #1094) lands its entry at the bottom, which is how new readers correctly infer the chronology.
Impact
Anyone scanning perf-changelog.yaml top-down to understand the evolution of configs will now see the DSV4-Pro GB200 SGLang entry (a brand-new April 2026 submission) as if it were the oldest change, predating things like 70b-fp8-*-vllm (PR #95) and gptoss-fp4-*-trt (PR #110). That misleads humans, and any future tooling that parses the file chronologically (e.g. "what changed since last quarter") would get the wrong answer.
How to fix
Move the 9-line block currently at lines 1-9 of perf-changelog.yaml to the end of the file (after the last glm5.1-fp4-mi355x-atom entry from PR #1043). The content of the entry itself is fine — only its position needs to change.
Step-by-step proof
- Open
AGENTS.mdline 160 → confirms the rule: "appended to the END of the file — never insert in the middle or prepend." - Open the PR diff for
perf-changelog.yaml→ the single hunk is@@ -1,3 +1,13 @@, meaning the 10 added lines (9 entry + 1 blank separator) sit at the very beginning of the file. The first pre-existing entry- config-keys: [dsr1-fp8-h100-dynamo-trt, ...]is pushed from line 1 down to line 11. - Compare to PR trigger H100 multinode evals #1120 (the most recent "evals trigger" entry in the file) and PR [AMD/ROCM] atom glm5.1 fp4 on mi355x #1043 (
glm5.1-fp4-mi355x-atom) — both sit at the bottom of the modified file, per the rule. - Therefore this PR prepends while all sibling PRs appended, violating the explicit documented convention.
NVIDIA srt-slurm PR #69 recipes set `slurm.partition: gb200` (or gb300), which doesn't exist on our cluster — sbatch rejects the submission with "invalid partition specified". srtctl's config.slurm.partition takes precedence over the SLURM_PARTITION env var, so we rewrite both names to $SLURM_PARTITION in all cloned recipe YAMLs immediately after checkout.
Summary
dsv4-fp4-gb200-sglangfor SGLang aggregated serving of DeepSeek-V4-Pro on GB200 (TP=8 across 2 nodes)YAMY1234/srt-slurm-nv:dsv4-pro-recipes), derived from the official SGLang DeepSeek-V4 cookbookagg-2n-low-latency.yaml— TP=8 + EAGLE 3/4 speculative decoding (low-conc, optimized for TPOT)agg-2n-nomtp.yaml— TP=8, no MTP (mid-conc throughput)How it differs from PR #1129 (vLLM disagg)
The two PRs are complementary, not competing — they cover different points on the perf surface.
Implementation notes
sglang(notdynamo-sglang) to mean direct SGLang aggregated serving with no Dynamo frontendYAMY1234/srt-slurm-nvpinned at commitda535e87instead ofNVIDIA/srt-slurm. When PR Add a comment on the use of float16 and set some EVs explicitly #69 merges upstream we can switch back toNVIDIA/srt-slurmat that pointdsv4-grace-blackwellto the srtslurm.yamlcontainerssection so the upstream recipes resolve their image aliasmodel.path: dsv4-proalias maps to/mnt/lustre01/users/sa-shared/DeepSeek-V4-Provia launch scriptSweep matrix
11 total benchmark points covering interactivity → throughput.
Test plan
generate_sweep_configs.py test-config --config-keys dsv4-fp4-gb200-sglangexpands to 2 entries with correct CONFIG_FILE referencesbash -n runners/launch_gb200-nv.shpassesload_config_files) passessrtctl applyfindsrecipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml