Skip to content

Add dsv4-fp8-h200-sglang single-node config#1136

Closed
cquil11 wants to merge 1 commit intomainfrom
chore/dsv4-sgl-h200-fp8
Closed

Add dsv4-fp8-h200-sglang single-node config#1136
cquil11 wants to merge 1 commit intomainfrom
chore/dsv4-sgl-h200-fp8

Conversation

@cquil11
Copy link
Copy Markdown
Collaborator

@cquil11 cquil11 commented Apr 24, 2026

Summary

  • Adds dsv4-fp8-h200-sglang to .github/configs/nvidia-master.yaml using lmsysorg/sglang:deepseek-v4-hopper and sgl-project/DeepSeek-V4-Flash-FP8
  • Adds benchmarks/single_node/dsv4_fp8_h200.sh following the DeepSeek-V4-Flash-FP8 H200 SGLang recipe with prefix caching (--disable-radix-cache) and speculative decoding both disabled
  • Applies the /workspace/sglang editable-install workaround to launch_h200-cw.sh and launch_h200-nb.sh: mount at /ix when the image is deepseek-v4-hopper, else /workspace (matches the treatment of deepseek-v4-blackwell in the B200 runner)
  • Adds a perf-changelog.yaml entry

Test plan

  • Full sweep (full-sweep-enabled label) produces results for 1k/1k (conc 4-64) and 8k/1k (conc 4-32) at tp=4
  • Server launch logs show sglang module resolves correctly under the /ix mount

🤖 Generated with Claude Code

Adds the DeepSeek-V4-Flash-FP8 H200 SGLang recipe from
https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4.
Server launch follows the cookbook command (--tp N,
--moe-runner-backend flashinfer_mxfp4, --chunked-prefill-size 4096,
--disable-flashinfer-autotune, SGLANG_JIT_DEEPGEMM_PRECOMPILE=0,
SGLANG_DSV4_FP4_EXPERTS=0). Speculative decoding omitted and
--disable-radix-cache added for the no-spec / no-prefix-cache baseline.

Also applies the same /workspace mount workaround to the H200 runners
(launch_h200-cw.sh and launch_h200-nb.sh): the deepseek-v4-hopper
image installs sglang editable under /workspace/sglang/python, which
our bind-mount would mask. Mount at /ix for this image only.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

As a rule of thumb, generally, PR authors should request a review & get a PR approval from the respective companies' CODEOWNERS before requesting a review from core maintainers.

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment thread runners/launch_h200-cw.sh
Comment on lines +14 to +23
# TODO(Cam): lmsysorg/sglang:deepseek-v4-hopper installs sglang editable at
# /workspace/sglang/python (prior sglang tags used /sgl-workspace/sglang), so
# the default $GITHUB_WORKSPACE:/workspace/ bind-mount masks the install and
# breaks `import sglang`. Mount this one image at /ix instead; drop the
# conditional once the image stops installing editable under /workspace.
if [[ "$IMAGE" == *deepseek-v4-hopper* ]]; then
CONTAINER_MOUNT_DIR=/ix
else
CONTAINER_MOUNT_DIR=/workspace
fi
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The /ix mount workaround is applied to launch_h200-cw.sh and launch_h200-nb.sh but not to runners/launch_h200-dgxc-slurm.sh. Per .github/configs/runners.yaml, 14 of the 18 h200 pool runners are h200-dgxc-slurm_*, so the new dsv4-fp8-h200-sglang config (declared as runner: h200) will most often be scheduled onto the unfixed launcher, where /workspace bind-mount will mask /workspace/sglang/python and import sglang will fail — the exact failure this PR is trying to prevent. Apply the same conditional CONTAINER_MOUNT_DIR=/ix logic to the single-node else-branch (lines 289-295) of runners/launch_h200-dgxc-slurm.sh.

Extended reasoning...

The bug

This PR adds a conditional /ix mount for the lmsysorg/sglang:deepseek-v4-hopper image to two of the three H200 launchers (launch_h200-cw.sh, launch_h200-nb.sh) but leaves the third — runners/launch_h200-dgxc-slurm.sh — unpatched. Its single-node else-branch still hardcodes:

--container-mounts=$GITHUB_WORKSPACE:/workspace/,$HF_HUB_CACHE_MOUNT:$HF_HUB_CACHE
--container-workdir=/workspace/

(see runners/launch_h200-dgxc-slurm.sh lines 291 and 293, inside the else branch that starts at line 262).

Why it matters: the majority of h200 runners hit the unfixed launcher

From .github/configs/runners.yaml the h200 pool (lines 29-47) is:

  • h200-cw_* — patched by this PR
  • h200-nb_* — patched by this PR
  • 14× h200-dgxc-slurm_*not patched

That is 14/18 ≈ 78% of the pool. The new dsv4-fp8-h200-sglang entry in .github/configs/nvidia-master.yaml declares runner: h200 (not h200-multinode), so it is schedulable onto any of these 18 runners.

How it triggers

.github/workflows/benchmark-tmpl.yml:154 selects the launcher via bash ./runners/launch_${RUNNER_NAME%%_*}.sh. So a runner labeled h200-dgxc-slurm_7 executes runners/launch_h200-dgxc-slurm.sh. The new config is single-node (multinode: false), so the workflow takes the else branch (single-node path, line 262 onward) which hardcodes /workspace.

Step-by-step proof

  1. GitHub Actions dispatches the dsv4-fp8-h200-sglang job with runner: h200.
  2. The scheduler picks one of the 18 pool runners; 14 of 18 are h200-dgxc-slurm_N.
  3. benchmark-tmpl.yml invokes runners/launch_h200-dgxc-slurm.sh.
  4. IS_MULTINODE is not true (config declares multinode: false), so execution enters the else branch at line 262.
  5. srun runs with --container-mounts=$GITHUB_WORKSPACE:/workspace/ and --container-workdir=/workspace/ (lines 291, 293).
  6. The lmsysorg/sglang:deepseek-v4-hopper image installs the editable sglang at /workspace/sglang/python, but the bind-mount has masked that path with $GITHUB_WORKSPACE contents.
  7. benchmarks/single_node/dsv4_fp8_h200.sh runs python3 -m sglang.launch_server ..., which errors with ModuleNotFoundError: No module named 'sglang' (the exact failure the PR's own TODO comment is guarding against).

Why existing code doesn't prevent it

The two launchers that were patched added a conditional CONTAINER_MOUNT_DIR at their top-level; the dgxc-slurm variant has no such conditional, and still literally writes /workspace in both the --container-mounts and --container-workdir flags of the single-node srun. Nothing else in the launcher rewrites these paths based on image name.

How to fix

Apply the same two-step fix the PR already made to the other launchers, to the single-node branch of runners/launch_h200-dgxc-slurm.sh:

  1. Near the top of the file (or inside the else-branch before the srun), add:
    if [[ "$IMAGE" == *deepseek-v4-hopper* ]]; then
        CONTAINER_MOUNT_DIR=/ix
    else
        CONTAINER_MOUNT_DIR=/workspace
    fi
  2. Change lines 291 and 293 from /workspace/ to $CONTAINER_MOUNT_DIR (matching the pattern already used in launch_h200-cw.sh:53,55 and launch_h200-nb.sh:26,29).

This is a purely mechanical fix that mirrors the existing two-launcher patch and resolves the failure on the majority of the h200 pool.

Comment thread perf-changelog.yaml
Comment on lines +1 to +9
- config-keys:
- dsv4-fp8-h200-sglang
description:
- "Add DeepSeek-V4-Flash-FP8 single-node H200 SGLang benchmark (TP4)"
- "Container: lmsysorg/sglang:deepseek-v4-hopper"
- "Model: sgl-project/DeepSeek-V4-Flash-FP8"
- "Recipe from https://docs.sglang.io/cookbook/autoregressive/DeepSeek/DeepSeek-V4"
- "Prefix caching and speculative decoding disabled for baseline numbers"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/TBD
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🟡 The new dsv4-fp8-h200-sglang entry was prepended at lines 1-9 of perf-changelog.yaml, but AGENTS.md requires new entries to be appended to the END of the file. Please move this entry to the bottom of the file, alongside the other recent entries (e.g., #1043, #1120).

Extended reasoning...

What the bug is

AGENTS.md (line 160) contains an explicit, unambiguous rule for perf-changelog.yaml:

The file is read in chronological order: oldest at the top, newest at the bottom. New entries MUST be appended to the END of the file — never insert in the middle or prepend.

This PR's diff header @@ -1,3 +1,13 @@ shows that the new dsv4-fp8-h200-sglang entry was inserted at the top of perf-changelog.yaml (lines 1-9), immediately before the previous first entry (dsr1-fp8-h100-dynamo-trt / dsr1-fp8-h100-dynamo-sglang). That directly violates the documented convention.

Why existing code doesn't prevent it

perf-changelog.yaml is a plain YAML sequence, so order is stylistic/documentary rather than functional — process_changelog.py will still pick up the entry no matter where it sits. There is no lint or CI check that enforces the append-only convention; it relies on the rule in AGENTS.md.

Why this is the right interpretation (convention is still active)

Scanning the end of the modified perf-changelog.yaml, the most recent entries are all properly appended at the bottom:

So the convention is still being actively followed by other contributors. The prepend in this PR is an outlier.

Proof (step-by-step)

  1. Open AGENTS.md at line 160: the rule says entries "MUST be appended to the END of the file — never insert in the middle or prepend."
  2. Open the PR diff for perf-changelog.yaml: the hunk header is @@ -1,3 +1,13 @@, meaning the 10 new lines start at line 1 of the new file — i.e. the top.
  3. Look at the current tail of perf-changelog.yaml: the newest pre-existing entry (PR [AMD/ROCM] atom glm5.1 fp4 on mi355x #1043, glm5.1-fp4-mi355x-atom) sits there, confirming the append convention is still in force.
  4. Therefore this PR prepends rather than appends, in direct contradiction of AGENTS.md.

How to fix

Move the new entry block (the 10 lines starting with - config-keys: / dsv4-fp8-h200-sglang / description: / pr-link:) from lines 1-9 to the end of perf-changelog.yaml, after the glm5.1-fp4-mi355x-atom entry (PR #1043). Also update the pr-link: from TBD to the actual PR URL (.../pull/1136) while you're in there.

Impact

Functionally harmless — process_changelog.py will still process the entry correctly. But it's a documented-convention violation that makes the "newest at the bottom" ordering no longer reliable for readers or tooling that assumes chronological order (e.g. quick tail inspections). Hence nit severity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Development

Successfully merging this pull request may close these issues.

1 participant