Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69) by Oseltamivir · Pull Request #1137 · SemiAnalysisAI/InferenceX

Oseltamivir · 2026-04-24T10:25:23Z

Summary

Adds new config dsv4-fp4-gb200-sglang for SGLang aggregated serving of DeepSeek-V4-Pro on GB200 (TP=8 across 2 nodes)
Uses recipes from NVIDIA/srt-slurm#69 (source fork: YAMY1234/srt-slurm-nv:dsv4-pro-recipes), derived from the official SGLang DeepSeek-V4 cookbook
Two recipes covering both ends of the Pareto frontier:
- agg-2n-low-latency.yaml — TP=8 + EAGLE 3/4 speculative decoding (low-conc, optimized for TPOT)
- agg-2n-nomtp.yaml — TP=8, no MTP (mid-conc throughput)

How it differs from PR #1129 (vLLM disagg)

Dimension	PR #1129	This PR
Backend	vLLM via Dynamo	SGLang
Mode	Disaggregated	Aggregated
ISL/OSL	8k / 1k	1k / 1k
Topology	up to 18 nodes (7p1d-dep8-dep16)	2 nodes (TP=8 across 2 GB200 nodes)
Speculative	None	EAGLE 3/4 / 1/2
Source recipes	NVIDIA srt-slurm PR #67	NVIDIA srt-slurm PR #69

The two PRs are complementary, not competing — they cover different points on the perf surface.

Implementation notes

New framework value sglang (not dynamo-sglang) to mean direct SGLang aggregated serving with no Dynamo frontend
Runner script clones YAMY1234/srt-slurm-nv pinned at commit da535e87 instead of NVIDIA/srt-slurm. When PR Add a comment on the use of float16 and set some EVs explicitly #69 merges upstream we can switch back to NVIDIA/srt-slurm at that point
Adds dsv4-grace-blackwell to the srtslurm.yaml containers section so the upstream recipes resolve their image alias
model.path: dsv4-pro alias maps to /mnt/lustre01/users/sa-shared/DeepSeek-V4-Pro via launch script

Sweep matrix

conc	recipe	nodes
1, 2, 4, 8, 16, 32, 64	agg-2n-low-latency.yaml (EAGLE)	2
128, 256, 512, 1024	agg-2n-nomtp.yaml (no MTP)	2

11 total benchmark points covering interactivity → throughput.

Test plan

generate_sweep_configs.py test-config --config-keys dsv4-fp4-gb200-sglang expands to 2 entries with correct CONFIG_FILE references
bash -n runners/launch_gb200-nv.sh passes
Schema validation (load_config_files) passes
Run on GB200 cluster — verify the YAMY1234 fork clones cleanly and srtctl apply finds recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml
Server boots and benchmark produces results

New config dsv4-fp4-gb200-sglang covers SGLang aggregated serving on GB200 (TP=8 across 2 nodes) with two recipes from NVIDIA srt-slurm PR #69: agg-2n-low-latency (EAGLE 3/4 spec decoding) for conc 1-64, and agg-2n-nomtp for conc 128-1024. Adds new framework value "sglang" (no Dynamo frontend) and the dsv4 model-prefix to runners/launch_gb200-nv.sh. For this framework the runner clones https://github.com/YAMY1234/srt-slurm-nv (PR #69 source fork), pinned at commit da535e87 for reproducibility, instead of NVIDIA/srt-slurm. Also exposes the dsv4-grace-blackwell container alias in srtslurm.yaml so the upstream recipes resolve cleanly. Image: lmsysorg/sglang:deepseek-v4-grace-blackwell Model: deepseek-ai/DeepSeek-V4-Pro at /mnt/lustre01/users/sa-shared/DeepSeek-V4-Pro

github-actions · 2026-04-24T10:25:31Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

github-actions · 2026-04-24T10:25:31Z

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

claude · 2026-04-24T10:35:48Z

+  - isl: 1024
+    osl: 1024
+    search-space:
+    # Low-latency: TP=8 + EAGLE 3/4 speculative decoding (smaller batches,
+    # better TPOT). Recipe targets the low-conc end of the curve.
+    - conc-list: [1, 2, 4, 8, 16, 32, 64]
+      prefill:
+        num-worker: 1
+        tp: 8
+        ep: 1
+        dp-attn: false
+        additional-settings:
+        # https://github.com/NVIDIA/srt-slurm/pull/69/files#diff-recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml
+        - "CONFIG_FILE=recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml"


🔴 The low-latency search-space entry (conc 1-64) references agg-2n-low-latency.yaml, which this PR's own description and the adjacent YAML comment describe as using EAGLE 3/4 speculative decoding, but the entry omits spec-decoding: "mtp" and therefore defaults to "none". Because SPEC_DECODING is plumbed through the runner into the result metadata (utils/process_result.py line 53), every EAGLE run from this config will be mislabeled as non-speculative in downstream dashboards, and it will be indistinguishable from the agg-2n-nomtp entry in any aggregation that keys on spec-decoding. Fix: add spec-decoding: "mtp" to the conc 1-64 entry at lines 7452-7465, matching the convention used by every other EAGLE/MTP config in this file.

Extended reasoning...

What the bug is

The new dsv4-fp4-gb200-sglang config in .github/configs/nvidia-master.yaml has two search-space entries for the 1k/1k seq-len. The first entry (conc [1, 2, 4, 8, 16, 32, 64]) points at recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml. This PR's own description says agg-2n-low-latency.yaml — TP=8 + EAGLE 3/4 speculative decoding, the adjacent YAML comment at line 7450 says Low-latency: TP=8 + EAGLE 3/4 speculative decoding, and the perf-changelog entry says agg-2n-low-latency (EAGLE 3/4 spec decoding). Despite this, the search-space entry does not set spec-decoding: "mtp".

Why the default is wrong here

Per utils/matrix_logic/validation.py:227-228, MultiNodeSearchSpaceEntry.spec_decoding is declared as Literal["mtp", "draft_model", "none"] with default="none". So the low-latency EAGLE entry silently becomes spec_decoding="none" when the matrix is generated. Every other EAGLE/MTP recipe in this same file explicitly sets spec-decoding: "mtp" (the *-trt-mtp, *-sglang-mtp, *-dynamo-sglang-mtp entries; the changelog's own description of the other MTP PRs in this file confirms this is the established convention).

Impact

utils/process_result.py is the result-metadata writer run for every benchmark point. At module load (line 27) it requires SPEC_DECODING as an env var, and at line 53 it writes 'spec_decoding': spec_decoding into the result JSON that is uploaded for dashboards:

data = { ... 'framework': framework, 'precision': precision, 'spec_decoding': spec_decoding, ... }

With spec-decoding missing from the YAML, the generated matrix entry carries spec_decoding="none", the job's SPEC_DECODING env var is set to "none", and all seven low-latency EAGLE points (conc 1, 2, 4, 8, 16, 32, 64) get written into result metadata as non-speculative. Downstream dashboards and aggregations that key on spec_decoding will silently fold the EAGLE points in with the non-MTP entry (agg-2n-nomtp.yaml, conc 128-1024), producing a combined sweep that looks like a pure non-MTP config — you lose the ability to see the EAGLE speedup at low concurrency.

The multi-node eval-grouping logic in utils/matrix_logic/generate_sweep_configs.py:106-114 also keys on spec-decoding, but the grouping targets only 8k1k (see target_isl, target_osl = seq_len_stoi["8k1k"] at line 51) and this config is 1k1k, so eval selection is not directly affected. The metadata-correctness impact on result labeling remains, which is the primary defect.

Proof (step-by-step)

generate_sweep_configs.py expands the config. For the conc=16 point in the first search-space entry, it calls into the validated MultiNodeSearchSpaceEntry model. Because spec-decoding is absent, Pydantic applies the default "none" (validation.py:227-228).

The entry is emitted into the matrix with spec-decoding: "none".

The matrix row drives a GitHub Actions job; the runner sets SPEC_DECODING=none in the environment alongside FRAMEWORK=sglang, MODEL_PREFIX=dsv4, CONFIG_FILE=recipes/gb200-fp4/1k1k-dsv4/agg-2n-low-latency.yaml.

srtctl apply runs the agg-2n-low-latency.yaml recipe, which actually does use EAGLE 3/4 speculative decoding (per the YAMY1234/srt-slurm-nv fork linked in the diff).

Post-benchmark, utils/process_result.py reads SPEC_DECODING="none" and writes {"spec_decoding": "none", ...} into the per-run result JSON that is uploaded for dashboards.

The EAGLE run is now indistinguishable from a non-speculative run in downstream aggregations.

Fix

Add spec-decoding: "mtp" to the first search-space entry:

- conc-list: [1, 2, 4, 8, 16, 32, 64] spec-decoding: "mtp" prefill: num-worker: 1 ...

The second (throughput) entry correctly leaves it at the default since agg-2n-nomtp.yaml has no MTP.

claude · 2026-04-24T10:35:48Z

+- config-keys:
+    - dsv4-fp4-gb200-sglang
+  description:
+    - "Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (1k/1k, TP=8, 2 nodes)"
+    - "Recipes from YAMY1234/srt-slurm-nv:dsv4-pro-recipes (NVIDIA srt-slurm PR #69)"
+    - "Image: lmsysorg/sglang:deepseek-v4-grace-blackwell"
+    - "Two recipes: agg-2n-low-latency (EAGLE 3/4 spec decoding) for conc 1-64, agg-2n-nomtp for conc 128-1024"
+    - "Runner script clones the YAMY1234 fork pinned at commit da535e87 instead of NVIDIA/srt-slurm"
+  pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD


🔴 The new dsv4-fp4-gb200-sglang changelog entry was prepended to the top of perf-changelog.yaml (lines 1-9), but AGENTS.md line 160 explicitly requires new entries to be appended to the END of the file ("oldest at the top, newest at the bottom"). All other recent entries (#1120, #1040, #1043, etc.) correctly live at the bottom, so prepending here inverts chronological ordering for this one entry. Please move the 9-line block from the top of the file to the end.

Extended reasoning...

What the bug is

AGENTS.md line 160 states the rule plainly:

The file is read in chronological order: oldest at the top, newest at the bottom. New entries MUST be appended to the END of the file — never insert in the middle or prepend.

The diff for perf-changelog.yaml in this PR is a single hunk @@ -1,3 +1,13 @@ that inserts the new dsv4-fp4-gb200-sglang entry at the very top of the file, before the pre-existing dsr1-fp8-h100-dynamo-trt/dsr1-fp8-h100-dynamo-sglang entry. That directly violates the documented rule.

Why existing code/convention doesnt prevent it

The rule is a human-enforced convention documented in AGENTS.md — there is no lint or CI check that validates ordering, so the only guard is reviewer/author attention. Every other recent PR in the repo (e.g. #1120 evals trigger, #1040 qwen atom, #1043 glm5.1 atom, #1098, #1106, #1094) lands its entry at the bottom, which is how new readers correctly infer the chronology.

Impact

Anyone scanning perf-changelog.yaml top-down to understand the evolution of configs will now see the DSV4-Pro GB200 SGLang entry (a brand-new April 2026 submission) as if it were the oldest change, predating things like 70b-fp8-*-vllm (PR #95) and gptoss-fp4-*-trt (PR #110). That misleads humans, and any future tooling that parses the file chronologically (e.g. "what changed since last quarter") would get the wrong answer.

How to fix

Move the 9-line block currently at lines 1-9 of perf-changelog.yaml to the end of the file (after the last glm5.1-fp4-mi355x-atom entry from PR #1043). The content of the entry itself is fine — only its position needs to change.

Step-by-step proof

Open AGENTS.md line 160 → confirms the rule: "appended to the END of the file — never insert in the middle or prepend."

Open the PR diff for perf-changelog.yaml → the single hunk is @@ -1,3 +1,13 @@, meaning the 10 added lines (9 entry + 1 blank separator) sit at the very beginning of the file. The first pre-existing entry - config-keys: [dsr1-fp8-h100-dynamo-trt, ...] is pushed from line 1 down to line 11.

Compare to PR trigger H100 multinode evals #1120 (the most recent "evals trigger" entry in the file) and PR [AMD/ROCM] atom glm5.1 fp4 on mi355x #1043 (glm5.1-fp4-mi355x-atom) — both sit at the bottom of the modified file, per the rule.

Therefore this PR prepends while all sibling PRs appended, violating the explicit documented convention.

NVIDIA srt-slurm PR #69 recipes set `slurm.partition: gb200` (or gb300), which doesn't exist on our cluster — sbatch rejects the submission with "invalid partition specified". srtctl's config.slurm.partition takes precedence over the SLURM_PARTITION env var, so we rewrite both names to $SLURM_PARTITION in all cloned recipe YAMLs immediately after checkout.

Oseltamivir requested a review from a team April 24, 2026 10:25

Oseltamivir requested review from jgangani and kedarpotdar-nv as code owners April 24, 2026 10:25

github-project-automation Bot added this to InferenceMAX Board Apr 24, 2026

claude Bot reviewed Apr 24, 2026

View reviewed changes

Oseltamivir added the sweep-enabled label Apr 24, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69)#1137

Add DeepSeek-V4-Pro SGLang aggregated GB200 benchmarks (NVIDIA srt-slurm PR #69)#1137
Oseltamivir wants to merge 2 commits intomainfrom
dsv4-fp4-gb200-sglang

Oseltamivir commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

Oseltamivir commented Apr 24, 2026

Summary

How it differs from PR #1129 (vLLM disagg)

Implementation notes

Sweep matrix

Test plan

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

github-actions Bot commented Apr 24, 2026

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

What the bug is

Why the default is wrong here

Impact

Proof (step-by-step)

Fix

Uh oh!

claude Bot Apr 24, 2026

Choose a reason for hiding this comment

What the bug is

Why existing code/convention doesnt prevent it

Impact

How to fix

Step-by-step proof

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant