Skip to content

[SGLang broken] Add MI355X config: glm5-fp4-sglang-mtp#1091

Draft
functionstackx wants to merge 1 commit intomainfrom
claude/add-glm5-fp4-mi355x-mtp
Draft

[SGLang broken] Add MI355X config: glm5-fp4-sglang-mtp#1091
functionstackx wants to merge 1 commit intomainfrom
claude/add-glm5-fp4-mi355x-mtp

Conversation

@functionstackx
Copy link
Copy Markdown
Contributor

Summary

  • Adds glm5-fp4-mi355x-sglang-mtp config + new benchmarks/single_node/glm5_fp4_mi355x_mtp.sh launch script.
  • Image: lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260417 — already ships transformers with glm_moe_dsa support, so no pip install -U transformers is needed (unlike glm5-fp8-mi355x-sglang).
  • Model: amd/GLM-5-MXFP4.
  • Launch flags per the request: --trust-remote-code, --tp $TP, --chunked-prefill-size 131072, --disable-radix-cache, --mem-fraction-static 0.85, --model-loader-extra-config '{"enable_multithread_load": true}', --watchdog-timeout 1200, --reasoning-parser glm45, --tool-call-parser glm47, plus EAGLE spec-decoding (--speculative-algorithm EAGLE --speculative-num-steps 3 --speculative-eagle-topk 1 --speculative-num-draft-tokens 4) behind SGLANG_ENABLE_SPEC_V2=1.
  • Client passes --use-chat-template per AGENTS.md for MTP.
  • Search-space: { tp: 8, conc-start: 4, conc-end: 64, spec-decoding: mtp } for 1k1k and 8k1k.
  • perf-changelog.yaml diff is append-only.

Test plan

  • YAML parses for both master config and perf-changelog.
  • bash -n benchmarks/single_node/glm5_fp4_mi355x_mtp.sh — bash syntax OK.
  • git diff perf-changelog.yaml shows only additions.
  • python3 utils/matrix_logic/generate_sweep_configs.py full-sweep --config-files .github/configs/amd-master.yaml — emits 10 entries (2 ISL/OSL × 5 concurrencies) with spec-decoding=mtp.
  • CI sweep passes on MI355X.

🤖 Generated with Claude Code

New GLM-5 MXFP4 MI355X SGLang MTP benchmark. Uses the
lmsysorg/sglang-rocm:v0.5.10rc0-rocm700-mi35x-20260417 image against
the amd/GLM-5-MXFP4 model, with TP=8 conc 4-64 for 1k1k and 8k1k.

Unlike glm5-fp8-mi355x-sglang, the image already ships transformers
with glm_moe_dsa support, so no pip install -U transformers is needed
at benchmark time.

Launch flags per the request: --trust-remote-code, --tp $TP,
--chunked-prefill-size 131072, --disable-radix-cache,
--mem-fraction-static 0.85, --model-loader-extra-config
'{"enable_multithread_load": true}', --watchdog-timeout 1200,
--reasoning-parser glm45, --tool-call-parser glm47, plus EAGLE
speculative decoding (num-steps=3, eagle-topk=1, num-draft-tokens=4)
behind SGLANG_ENABLE_SPEC_V2=1. Client passes --use-chat-template as
required by AGENTS.md for MTP.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

Thanks for the contribution! For vLLM & SGLang, please ensure that your recipes is similar to the official vLLM recipes and/or the SGLang cookbook

If it is not, please create a PR first before we can merge your PR into the master branch. Let's ensure that the documentation is first class such that the entire ML community can benefit from your hard work! Thank you

PR authors are responsible for ensuring that after merging, all GitHub Action jobs fully pass. A lot of the time, failures are just flakes and simply re-running the failed jobs will fix it. If re-running failed jobs is attempted, PR authors are responsible for ensuring it passes. See GitHub's docs on re-running failed jobs: https://docs.github.com/en/actions/how-tos/manage-workflow-runs/re-run-workflows-and-jobs#re-running-failed-jobs-in-a-workflow

If additional help is needed, PR authors can reach out to core maintainers over Slack.

Comment on lines +43 to +56
--watchdog-timeout 1200 \
--reasoning-parser glm45 \
--tool-call-parser glm47 \
--speculative-algorithm EAGLE \
--speculative-num-steps 3 \
--speculative-eagle-topk 1 \
--speculative-num-draft-tokens 4 \
$EVAL_CONTEXT_ARGS > $SERVER_LOG 2>&1 &

SERVER_PID=$!

# Wait for server to be ready
wait_for_server_ready --port "$PORT" --server-log "$SERVER_LOG" --server-pid "$SERVER_PID"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔴 The new glm5_fp4_mi355x_mtp.sh launch script is missing the --nsa-prefill-backend tilelang --nsa-decode-backend tilelang flags required by GLM-5's Native Sparse Attention architecture. Every other GLM-5 MI355X script includes these flags; without them SGLang falls back to a standard attention kernel that does not implement NSA correctly, producing silently degraded or incorrect benchmark results.

Extended reasoning...

What the bug is and how it manifests

GLM-5 uses Native Sparse Attention (NSA) as a core architectural component. SGLang requires explicit backend flags to enable the correct NSA attention kernel; without them it silently falls back to a standard dense-attention kernel, which implements a fundamentally different computation pattern and will produce incorrect attention outputs or severely degraded throughput/latency numbers.

The specific code path that triggers it

In benchmarks/single_node/glm5_fp4_mi355x_mtp.sh (lines 37–55), the python3 -m sglang.launch_server invocation includes all the expected GLM-5 flags (reasoning parser, tool-call parser, EAGLE speculative decoding, etc.) but is entirely missing --nsa-prefill-backend tilelang and --nsa-decode-backend tilelang.

Why existing code doesn't prevent it

These flags are not validated or defaulted by the framework – their absence simply causes SGLang to select a default (non-NSA) attention backend. No warning or error is emitted at startup. The model will appear to load and run normally while silently computing incorrect attention.

Concrete proof via comparison

  • glm5_fp8_mi355x.sh (line 51–52): includes --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
  • glm5_fp8_mi355x_mtp.sh (line 52–53): includes --nsa-prefill-backend tilelang --nsa-decode-backend tilelang
  • glm5_fp4_b200_sglang_mtp.sh / B300 variants: include --attention-backend nsa --nsa-decode-backend trtllm --nsa-prefill-backend trtllm (trtllm backend for NVIDIA)
  • glm5_fp4_mi355x_mtp.sh (this PR): no NSA flags at all

Every GLM-5 script on every platform specifies NSA backends; this is the single exception.

Impact

Benchmark results collected with this script will reflect standard attention performance rather than NSA performance. Since NSA is a key differentiator of GLM-5, the published numbers will be misleading at best and incorrect at worst. Correctness of generated output may also be affected.

How to fix

Add --nsa-prefill-backend tilelang --nsa-decode-backend tilelang to the sglang.launch_server invocation, consistent with the existing glm5_fp8_mi355x_mtp.sh script (the direct MI355X FP8 MTP counterpart).

@functionstackx functionstackx changed the title Add MI355X config: glm5-fp4-sglang-mtp [SGLang broken] Add MI355X config: glm5-fp4-sglang-mtp Apr 18, 2026
@functionstackx functionstackx marked this pull request as draft April 18, 2026 22:54
@functionstackx
Copy link
Copy Markdown
Contributor Author

sgl-project/sglang#23142

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

1 participant