Conversation
Adds the DeepSeek-V4-Flash-FP8 H200 SGLang recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4. Server launch follows the cookbook command (--tp N, --moe-runner-backend flashinfer_mxfp4, --chunked-prefill-size 4096, --disable-flashinfer-autotune, SGLANG_JIT_DEEPGEMM_PRECOMPILE=0, SGLANG_DSV4_FP4_EXPERTS=0). Speculative decoding omitted and --disable-radix-cache added for the no-spec / no-prefix-cache baseline. Also applies the same /workspace mount workaround to the H200 runners (launch_h200-cw.sh and launch_h200-nb.sh): the deepseek-v4-hopper image installs sglang editable under /workspace/sglang/python, which our bind-mount would mask. Mount at /ix for this image only. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers. If additional help is needed, PR authors can reach out to core maintainers over Slack. |
| # TODO(Cam): lmsysorg/sglang:deepseek-v4-hopper installs sglang editable at | ||
| # /workspace/sglang/python (prior sglang tags used /sgl-workspace/sglang), so | ||
| # the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and | ||
| # breaks `import sglang`. Mount this one image at /ix instead; drop the | ||
| # conditional once the image stops installing editable under /workspace. | ||
| if [[ "$IMAGE" == *deepseek-v4-hopper* ]]; then | ||
| CONTAINER_MOUNT_DIR=/ix | ||
| else | ||
| CONTAINER_MOUNT_DIR=/workspace | ||
| fi |
There was a problem hiding this comment.
🔴 The /ix mount workaround is applied to launch_h200-cw.sh and launch_h200-nb.sh but not to runners/launch_h200-dgxc-slurm.sh. Per .github/configs/runners.yaml, 14 of the 18 h200 pool runners are h200-dgxc-slurm_*, so the new dsv4-fp8-h200-sglang config (declared as runner: h200) will most often be scheduled onto the unfixed launcher, where /workspace bind-mount will mask /workspace/sglang/python and import sglang will fail — the exact failure this PR is trying to prevent. Apply the same conditional CONTAINER_MOUNT_DIR=/ix logic to the single-node else-branch (lines 289-295) of runners/launch_h200-dgxc-slurm.sh.
Extended reasoning...
The bug
This PR adds a conditional /ix mount for the lmsysorg/sglang:deepseek-v4-hopper image to two of the three H200 launchers (launch_h200-cw.sh, launch_h200-nb.sh) but leaves the third — runners/launch_h200-dgxc-slurm.sh — unpatched. Its single-node else-branch still hardcodes:
--container-mounts=$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE
--container-workdir=/workspace/
(see runners/launch_h200-dgxc-slurm.sh lines 291 and 293, inside the else branch that starts at line 262).
Why it matters: the majority of h200 runners hit the unfixed launcher
From .github/configs/runners.yaml the h200 pool (lines 29-47) is:
- 2×
h200-cw_*— patched by this PR - 2×
h200-nb_*— patched by this PR - 14×
h200-dgxc-slurm_*— not patched
That is 14/18 ≈ 78% of the pool. The new dsv4-fp8-h200-sglang entry in .github/configs/nvidia-master.yaml declares runner: h200 (not h200-multinode), so it is schedulable onto any of these 18 runners.
How it triggers
.github/workflows/benchmark-tmpl.yml:154 selects the launcher via bash ./runners/launch_${RUNNER_NAME%%_*}.sh. So a runner labeled h200-dgxc-slurm_7 executes runners/launch_h200-dgxc-slurm.sh. The new config is single-node (multinode: false), so the workflow takes the else branch (single-node path, line 262 onward) which hardcodes /workspace.
Step-by-step proof
- GitHub Actions dispatches the
dsv4-fp8-h200-sglangjob withrunner: h200. - The scheduler picks one of the 18 pool runners; 14 of 18 are
h200-dgxc-slurm_N. benchmark-tmpl.ymlinvokesrunners/launch_h200-dgxc-slurm.sh.IS_MULTINODEis nottrue(config declaresmultinode: false), so execution enters theelsebranch at line 262.srunruns with--container-mounts=$GITHUB_WORKSPACE:/workspace/and--container-workdir=/workspace/(lines 291, 293).- The
lmsysorg/sglang:deepseek-v4-hopperimage installs the editable sglang at/workspace/sglang/python, but the bind-mount has masked that path with $GITHUB_WORKSPACE contents. benchmarks/single_node/dsv4_fp8_h200.shrunspython3 -m sglang.launch_server ..., which errors withModuleNotFoundError: No module named 'sglang'(the exact failure the PR's own TODO comment is guarding against).
Why existing code doesn't prevent it
The two launchers that were patched added a conditional CONTAINER_MOUNT_DIR at their top-level; the dgxc-slurm variant has no such conditional, and still literally writes /workspace in both the --container-mounts and --container-workdir flags of the single-node srun. Nothing else in the launcher rewrites these paths based on image name.
How to fix
Apply the same two-step fix the PR already made to the other launchers, to the single-node branch of runners/launch_h200-dgxc-slurm.sh:
- Near the top of the file (or inside the else-branch before the
srun), add:if [[ "$IMAGE" == *deepseek-v4-hopper* ]]; then CONTAINER_MOUNT_DIR=/ix else CONTAINER_MOUNT_DIR=/workspace fi
- Change lines 291 and 293 from
/workspace/to$CONTAINER_MOUNT_DIR(matching the pattern already used in launch_h200-cw.sh:53,55 and launch_h200-nb.sh:26,29).
This is a purely mechanical fix that mirrors the existing two-launcher patch and resolves the failure on the majority of the h200 pool.
| - config-keys: | ||
| - dsv4-fp8-h200-sglang | ||
| description: | ||
| - "Add DeepSeek-V4-Flash-FP8 single-node H200 SGLang benchmark (TP4)" | ||
| - "Container: lmsysorg/sglang:deepseek-v4-hopper" | ||
| - "Model: sgl-project/DeepSeek-V4-Flash-FP8" | ||
| - "Recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4" | ||
| - "Prefix caching and speculative decoding disabled for baseline numbers" | ||
| pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD |
There was a problem hiding this comment.
🟡 The new dsv4-fp8-h200-sglang entry was prepended at lines 1-9 of perf-changelog.yaml, but AGENTS.md requires new entries to be appended to the END of the file. Please move this entry to the bottom of the file, alongside the other recent entries (e.g., #1043, #1120).
Extended reasoning...
What the bug is
AGENTS.md (line 160) contains an explicit, unambiguous rule for perf-changelog.yaml:
The file is read in chronological order: oldest at the top, newest at the bottom. New entries MUST be appended to the END of the file — never insert in the middle or prepend.
This PR's diff header @@ -1,3 +1,13 @@ shows that the new dsv4-fp8-h200-sglang entry was inserted at the top of perf-changelog.yaml (lines 1-9), immediately before the previous first entry (dsr1-fp8-h100-dynamo-trt / dsr1-fp8-h100-dynamo-sglang). That directly violates the documented convention.
Why existing code doesn't prevent it
perf-changelog.yaml is a plain YAML sequence, so order is stylistic/documentary rather than functional — process_changelog.py will still pick up the entry no matter where it sits. There is no lint or CI check that enforces the append-only convention; it relies on the rule in AGENTS.md.
Why this is the right interpretation (convention is still active)
Scanning the end of the modified perf-changelog.yaml, the most recent entries are all properly appended at the bottom:
- PR [AMD/ROCM] atom qwen fp8/fp8_mtp3 on mi355x #1040 (atom qwen fp8/fp8_mtp3 on mi355x) — at the end
- PR [AMD/ROCM] atom glm5.1 fp4 on mi355x #1043 (glm5.1 fp4 atom) — at the end
- PR trigger H100 multinode evals #1120 (H100 multinode evals) — near the end
- PR [NV] minimaxm2.5 fp8 b300 vllm update #1106, [NV] update minimaxm2.5 fp4 b300 vllm #1107, [NV] update minimaxm2.5-fp8-b200-vllm #1068, [NV] update minimaxm2.5 fp4 b200 vllm flag #1069 — all clustered near the end
So the convention is still being actively followed by other contributors. The prepend in this PR is an outlier.
Proof (step-by-step)
- Open
AGENTS.mdat line 160: the rule says entries "MUST be appended to the END of the file — never insert in the middle or prepend." - Open the PR diff for
perf-changelog.yaml: the hunk header is@@ -1,3 +1,13 @@, meaning the 10 new lines start at line 1 of the new file — i.e. the top. - Look at the current tail of
perf-changelog.yaml: the newest pre-existing entry (PR [AMD/ROCM] atom glm5.1 fp4 on mi355x #1043,glm5.1-fp4-mi355x-atom) sits there, confirming the append convention is still in force. - Therefore this PR prepends rather than appends, in direct contradiction of
AGENTS.md.
How to fix
Move the new entry block (the 10 lines starting with - config-keys: / dsv4-fp8-h200-sglang / description: / pr-link:) from lines 1-9 to the end of perf-changelog.yaml, after the glm5.1-fp4-mi355x-atom entry (PR #1043). Also update the pr-link: from TBD to the actual PR URL (.../pull/1136) while you're in there.
Impact
Functionally harmless — process_changelog.py will still process the entry correctly. But it's a documented-convention violation that makes the "newest at the bottom" ordering no longer reliable for readers or tooling that assumes chronological order (e.g. quick tail inspections). Hence nit severity.
Summary
dsv4-fp8-h200-sglangto.github/configs/nvidia-master.yamlusinglmsysorg/sglang:deepseek-v4-hopperandsgl-project/DeepSeek-V4-Flash-FP8benchmarks/single_node/dsv4_fp8_h200.shfollowing the DeepSeek-V4-Flash-FP8 H200 SGLang recipe with prefix caching (--disable-radix-cache) and speculative decoding both disabled/workspace/sglangeditable-install workaround tolaunch_h200-cw.shandlaunch_h200-nb.sh: mount at/ixwhen the image isdeepseek-v4-hopper, else/workspace(matches the treatment ofdeepseek-v4-blackwellin the B200 runner)perf-changelog.yamlentryTest plan
full-sweep-enabledlabel) produces results for 1k/1k (conc 4-64) and 8k/1k (conc 4-32) at tp=4/ixmount🤖 Generated with Claude Code