Skip to content

[sglang broken] Add MI355X config: qwen3.5-fp4-sglang-mtp#1078

Open
functionstackx wants to merge 2 commits intomainfrom
claude/add-qwen3.5-fp4-mi355x-mtp
Open

[sglang broken] Add MI355X config: qwen3.5-fp4-sglang-mtp#1078
functionstackx wants to merge 2 commits intomainfrom
claude/add-qwen3.5-fp4-mi355x-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Adds qwen3.5-fp4-mi355x-sglang-mtp config mirroring the existing qwen3.5-fp4-mi355x-sglang non-MTP recipe, plus a new benchmarks/single_node/qwen3.5_fp4_mi355x_mtp.sh launch script.
  • Adds EAGLE speculative decoding flags on top of the non-MTP script: --speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4.
  • Search space rows carry spec-decoding: mtp so the MI355X runner picks up the _mtp.sh variant.
  • Adds a perf-changelog.yaml entry (PR link placeholder; update after merge per AGENTS.md).

Test plan

  • YAML parses for both master config and perf-changelog.
  • bash -n benchmarks/single_node/qwen3.5_fp4_mi355x_mtp.sh — bash syntax OK.
  • python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml — emits 20 entries with spec-decoding=mtp (same sweep shape as the non-MTP config).
  • CI sweep passes on MI355X.

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +53 to +63
run_benchmark_serving \
--model "$MODEL" \
--port "$PORT" \
--backend vllm \
--input-len "$ISL" \
--output-len "$OSL" \
--random-range-ratio "$RANDOM_RANGE_RATIO" \
--num-prompts "$((CONC * 10))" \
--max-concurrency "$CONC" \
--result-filename "$RESULT_FILENAME" \
--result-dir /workspace/
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new qwen3.5_fp4_mi355x_mtp.sh script omits --use-chat-template from its run_benchmark_serving call (lines 53–63), which causes EAGLE speculative decoding acceptance rates to be artificially inflated. Every other MTP benchmark script in the repository includes this flag, and the perf-changelog (PR #647) explicitly documents: 'Without this arg, MTP acceptance rates are artificially high.' Add --use-chat-template to the run_benchmark_serving call to match all sister MTP scripts.

Extended reasoning...

What the bug is: The run_benchmark_serving call in benchmarks/single_node/qwen3.5_fp4_mi355x_mtp.sh (lines 53–63) does not include --use-chat-template. This flag tells the benchmark client to format prompts through the model's chat template before sending them to the server, matching how the model is actually used in production.

The specific code path: During benchmarking, run_benchmark_serving generates synthetic requests and sends them to the SGLang server. Without --use-chat-template, raw/unformatted text is sent directly. The EAGLE draft model predicts tokens conditioned on the prefix context; when the prefix is raw text rather than properly formatted chat turns, the draft model's token predictions align with the training distribution of unformatted text, yielding artificially high acceptance rates that do not reflect real-world performance.

Why existing code doesn't prevent it: The flag is purely opt-in on the benchmark client side. The SGLang server starts successfully with the EAGLE speculative decoding flags regardless, so there is no runtime error or warning—the benchmark runs to completion and produces results that look plausible but are inflated.

Impact: The reported MTP acceptance rates and resulting throughput numbers will be higher than what users see in production deployments where chat-formatted prompts are the norm. This overstates the benefit of EAGLE speculative decoding for Qwen3.5 FP4 on MI355X.

How to fix it: Add --use-chat-template to the run_benchmark_serving call in benchmarks/single_node/qwen3.5_fp4_mi355x_mtp.sh, consistent with all sister MTP scripts: qwen3.5_fp8_b200_mtp.sh (line 91), qwen3.5_fp8_b300_mtp.sh (line 77), qwen3.5_fp8_h200_mtp.sh (line 82), dsr1_fp8_b200_mtp.sh (line 113), and others.

Step-by-step proof: (1) The non-MTP script qwen3.5_fp4_mi355x.sh is the copy-source; it does not include --use-chat-template (non-MTP benchmarks don't require it as critically). (2) This new MTP script was created by copying that non-MTP script and adding EAGLE flags, but --use-chat-template was not added. (3) At benchmark time: synthetic prompts are sent as raw text → the EAGLE draft model predicts well because raw-text sequences are easier to continue without the special chat-format tokens → acceptance rate is reported as, say, 0.85 instead of a realistic 0.60. (4) The reported speedup (e.g., 2.1× vs ~1.6×) overstates the real-world benefit, misleading readers of the benchmark results.

@functionstackx functionstackx marked this pull request as draft April 18, 2026 00:54
@functionstackx functionstackx changed the title Add MI355X config: qwen3.5-fp4-sglang-mtp [sglang broken] Add MI355X config: qwen3.5-fp4-sglang-mtp Apr 18, 2026
@functionstackx
Copy link
Copy Markdown
Contributor Author

awaiting debug from @chunfangamd & @HaiShaw's team sgl-project/sglang#23113

@functionstackx functionstackx marked this pull request as ready for review April 18, 2026 01:16
functionstackx and others added 2 commits April 17, 2026 21:16
Mirrors the existing qwen3.5-fp4-mi355x-sglang non-MTP recipe and adds
EAGLE speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)
via the standard spec-decoding=mtp suffix.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@functionstackx functionstackx force-pushed the claude/add-qwen3.5-fp4-mi355x-mtp branch from 6e06e9a to f728cdf Compare April 18, 2026 01:16
Comment on lines +26 to +45
if [ "${EVAL_ONLY}" = "true" ]; then
setup_eval_context
fi

# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
python3 -m sglang.launch_server --model-path=$MODEL --trust-remote-code \
--host=0.0.0.0 --port=$PORT \
--tensor-parallel-size=$TP \
--attention-backend aiter \
--mem-fraction-static $MEM_FRAC_STATIC \
--model-loader-extra-config '{"enable_multithread_load": true}' \
--watchdog-timeout 1200 \
--disable-radix-cache \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 When EVAL_ONLY=true, qwen3.5_fp4_mi355x_mtp.sh calls setup_eval_context (line 26-27) but never wires the resulting EVAL_MAX_MODEL_LEN into --context-length on the server launch command (lines 34-45), so the server allocates KV cache for the model's full native context window (~131K) instead of the eval-appropriate size. For an EAGLE MTP script already under memory pressure (FP4 + speculative decoding), this unnecessary over-allocation can cause OOM during eval runs; fix by adding CONTEXT_LENGTH=$EVAL_MAX_MODEL_LEN after setup_eval_context and passing --context-length $CONTEXT_LENGTH to sglang.launch_server, matching every other MTP script in the repo.

Extended reasoning...

What the bug is and how it manifests

In benchmarks/single_node/qwen3.5_fp4_mi355x_mtp.sh (lines 26-27), setup_eval_context() is called when EVAL_ONLY=true. That function (benchmark_lib.sh line 640-643) computes EVAL_MAX_MODEL_LEN and exports it, with an explicit comment: "Scripts then wire $EVAL_MAX_MODEL_LEN into whichever server variable they need." The new script never performs that wiring — the sglang.launch_server invocation on lines 34-45 has no --context-length argument at all.

The specific code path and why existing code does not prevent it

When EVAL_ONLY=true the server is started fresh by this script. Without --context-length, SGLang uses the model-config default, which for Qwen3.5-397B-A17B is ~131K tokens. The KV cache is allocated proportionally to that full window even when the eval tasks only require a small fraction of it. The EAGLE speculative decoding flags (--speculative-num-steps 3, --speculative-num-draft-tokens 4) already add a draft-model memory overhead on top of the main model. Combined with FP4 quantization under --mem-fraction-static 0.8, the total memory pressure during eval is significantly higher than in the non-MTP parent script, making the unnecessary KV cache over-allocation an OOM risk that the parent did not face.

Addressing the refutation

One verifier argues that compute_eval_context_length() caps EVAL_MAX_MODEL_LEN at the model's native max, so the server launched without --context-length (which also defaults to native max) would always fit eval prompts — no truncation occurs. This is technically correct for eval correctness. However, correctness and resource efficiency are distinct concerns. The server does not know at startup that eval prompts are short; it eagerly allocates KV cache pages for the full 131K window. In a memory-constrained EAGLE MTP configuration this unnecessary allocation is the trigger for OOM. The non-MTP parent script is not a valid precedent here: it runs without speculative decoding draft buffers, so its memory budget is materially different. The benchmark_lib.sh documentation explicitly designates wiring EVAL_MAX_MODEL_LEN as a required step for scripts that call setup_eval_context.

What impact this has

When CI runs this script with EVAL_ONLY=true on MI355X hardware, the server may OOM and crash before eval begins, producing a silent failure or a misleading error. At minimum, KV cache is over-allocated relative to what eval requires, reducing available memory for concurrent eval requests and potentially degrading throughput metrics.

How to fix it

After the setup_eval_context call, add:

Then pass to the sglang.launch_server invocation. This matches the pattern in every other MTP benchmark script: qwen3.5_fp8_b200_mtp.sh (L51-65), qwen3.5_fp8_b300_mtp.sh (L22-27), dsr1_fp8_b200_mtp.sh (L50-53).

Step-by-step proof

  1. EVAL_ONLY=true is set; setup_eval_context runs and exports EVAL_MAX_MODEL_LEN (e.g., 8192 for an 8k eval task).
  2. The server launches without --context-length → SGLang reads the model config and defaults to 131072.
  3. SGLang pre-allocates KV cache blocks covering 131072 positions × num_heads × head_dim × FP8 bytes. With EAGLE draft buffers on top, total VRAM usage exceeds what was profiled for the benchmark config.
  4. The server process hits an OOM signal (or HIP out-of-memory) and exits.
  5. wait_for_server_ready eventually times out, and the eval does not run — or runs against a crashed server and reports errors.

@functionstackx
Copy link
Copy Markdown
Contributor Author

lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260417 cobtibues to be broken

sgl-project/sglang#23113 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant