Skip to content
42 changes: 24 additions & 18 deletions .github/configs/amd-master.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -2257,15 +2257,8 @@ dsv4-fp4-mi355x-vllm-mtp:
search-space:
- { tp: 8, conc-start: 4, conc-end: 512, spec-decoding: mtp }

# Day-0 single-sequence marker for DeepSeek-V4 on ATOM (ROCm/ATOM#650).
# PR1 of the ATOM DSv4 series still uses torch sparse-attention fallbacks
# that OOM once warmup/prefill batches multiple requests; keep CONC=1 until
# the AITER sparse-attention kernel / multi-request path lands upstream.
# --enforce-eager and ATOM_USE_TRITON_MOE=1 are required on gfx950. Image is
# the standard atom0.1.2.post MI355X base (matching qwen3.5-fp8-mi355x-atom);
# the DSv4 PR is overlaid at runtime by dsv4_fp4_mi355x_atom.sh at a pinned SHA.
dsv4-fp4-mi355x-atom:
image: rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.3
image: rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4_20260612
model: deepseek-ai/DeepSeek-V4-Pro
model-prefix: dsv4
runner: mi355x
Expand All @@ -2274,16 +2267,29 @@ dsv4-fp4-mi355x-atom:
multinode: false
scenarios:
fixed-seq-len:
- isl: 1024
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 64 }
- { tp: 8, ep: 1, dp-attn: true, conc-start: 64, conc-end: 1024 }
- isl: 8192
osl: 1024
search-space:
- { tp: 8, ep: 1, conc-start: 1, conc-end: 64 }
- { tp: 8, ep: 1, dp-attn: true, conc-start: 64, conc-end: 512 }
#- isl: 1024
# osl: 1024
# search-space:
# # conc4-64, TP8
# # conc128-512, DPA
# # conc1024-8192, DPA TBO
# - { tp: 8, ep: 1, conc-start: 1, conc-end: 64 }
# - { tp: 8, ep: 1, dp-attn: true, conc-start: 64, conc-end: 8192 }
#- isl: 8192
# osl: 1024
# search-space:
# # conc4-64, TP8
# # conc128, DPA
# # conc256-4096, DPA TBO
# - { tp: 8, ep: 1, conc-start: 4, conc-end: 64 }
# - { tp: 8, ep: 1, dp-attn: true, conc-start: 128, conc-end: 4096 }
- isl: 8192
osl: 1024
search-space:
# conc4-64, TP8
# conc128, DPA
# conc256-4096, DPA TBO
- { tp: 8, ep: 1, dp-attn: true, conc-start: 1024, conc-end: 1024 }

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Search space not enabled

High Severity

The dsv4-fp4-mi355x-atom fixed-seq-len matrix no longer matches the PR: the ISL=1024 scenario is fully commented out, and ISL=8192 only runs a single DPA point at conc=1024 instead of the documented TP8 (conc 4–64) and DPA (conc 128–1024) sweeps.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c3b3289. Configure here.


dsv4-fp4-mi355x-atom-mtp:
image: rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.3
Expand Down
28 changes: 24 additions & 4 deletions benchmarks/single_node/fixed_seq_len/dsv4_fp4_mi355x_atom.sh
Original file line number Diff line number Diff line change
Expand Up @@ -22,31 +22,51 @@ echo "TP: $TP, CONC: $CONC, ISL: $ISL, OSL: $OSL, EP_SIZE: $EP_SIZE, DP_ATTENTIO
SERVER_LOG=/workspace/server.log

PARALLEL_ARGS=(-tp "$TP") #TP
CUDAGRAPH_SIZES='[1,2,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,208,212,216,220,224,228,232,236,240,244,248,252,256]'
if [ "$DP_ATTENTION" = "true" ]; then
if [ "$EP_SIZE" -gt 1 ]; then #DP+EP
PARALLEL_ARGS=(-tp "$TP" --enable-expert-parallel --enable-dp-attention )
else #DP+TP
PARALLEL_ARGS=(-tp "$TP" --enable-dp-attention )
else #DPA+TP
#DPA+TP+TBO
if [ "$ISL" -eq 1024 ] && [ "$OSL" -eq 1024 ] && [ "$CONC" -ge 1024 ]; then
PARALLEL_ARGS=(-tp "$TP" --enable-dp-attention --enable-tbo)
CUDAGRAPH_SIZES='[1,2,4,8,12,16,20,24,28,32,36,40,44,48,52,56,60,64,68,72,76,80,84,88,92,96,100,104,108,112,116,120,124,128,132,136,140,144,148,152,156,160,164,168,172,176,180,184,188,192,196,200,204,208,212,216,220,224,228,232,236,240,244,248,252,256,512,1024]'
elif [ "$ISL" -eq 8192 ] && [ "$OSL" -eq 1024 ] && [ "$CONC" -ge 256 ]; then
PARALLEL_ARGS=(-tp "$TP" --enable-dp-attention --enable-tbo)
else

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

CUDA graph sizes omit high concurrency

Medium Severity

For ISL=8192 and OSL=1024, --enable-tbo turns on when CONC≥256, but CUDAGRAPH_SIZES stays capped at 256 while the server uses --max-num-seqs equal to CONC (e.g. 1024 in the active YAML). The 1024/1024 TBO branch extends capture sizes through 1024; this path does not.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 7ffa976. Configure here.

PARALLEL_ARGS=(-tp "$TP" --enable-dp-attention )
fi
fi
fi

BENCHMARK_MAX_MODEL_LEN="$MAX_MODEL_LEN"

if [ "${EVAL_ONLY}" = "true" ]; then
EVAL_MAX_MODEL_LEN=$(compute_eval_context_length "$MODEL" "$BENCHMARK_MAX_MODEL_LEN")
export EVAL_MAX_MODEL_LEN
SERVE_MAX_MODEL_LEN="$EVAL_MAX_MODEL_LEN"
else
SERVE_MAX_MODEL_LEN="$BENCHMARK_MAX_MODEL_LEN"
fi
# Start GPU monitoring (power, temperature, clocks every second)
start_gpu_monitor

set -x
export ATOM_DISABLE_MMAP=true
export AITER_BF16_FP8_MOE_BOUND=0
export ATOM_MOE_GU_ITLV=1
# TODO: add --no-enable_chunked_prefill, when dsv4 prefix caching is supported
#https://github.com/ROCm/ATOM/commit/7df93a181da4d3c3250c2441c7d5e2745a03d0cd#diff-61b1ba0b8b74523530d2d5cdc739d4f3a23a43bedf69015a5235844d46e9373bL1127
python3 -m atom.entrypoints.openai_server \
--model $MODEL \
--server-port $PORT \
"${PARALLEL_ARGS[@]}" \
--kv_cache_dtype fp8 \
--trust-remote-code \
--gpu-memory-utilization 0.85 \
--no-enable_prefix_caching \
--cudagraph-capture-sizes "${CUDAGRAPH_SIZES}" \
--max-num-seqs ${CONC} \
> $SERVER_LOG 2>&1 &
#--max-model-len "$SERVE_MAX_MODEL_LEN" \

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Eval context not wired

Medium Severity

This commit adds SERVE_MAX_MODEL_LEN / compute_eval_context_length handling for EVAL_ONLY, but the server launch leaves --max-model-len commented out, so the ATOM server never receives the computed context length.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit c3b3289. Configure here.


SERVER_PID=$!

Expand Down
8 changes: 8 additions & 0 deletions perf-changelog.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -3600,3 +3600,11 @@
- "MI355x DSR1-FP4: Include TP4 configurations for 8k1k"
- "Expand the TP sweep (included TP=4) for 8k/1k configuration for conc=4 to 64"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1692

- config-keys:
- dsv4-fp4-mi355x-atom
description:
- "Update image to rocm/atom:rocm7.2.4_ubuntu24.04_py3.12_pytorch_release_2.10.0_atom0.1.4_20260612"
- "Enable --enable-tbo (Two Batch Overlap) on top of DPA+TP8 at high concurrency: ISL=1024/OSL=1024 at CONC>=1024, ISL=8192/OSL=1024 at CONC>=256"
- "Update ISL=8192 search-space: TP8-only from conc=4-64, DPA from conc=128-1024 (previously conc=1-64 and DPA conc=64-512)"
pr-link: https://github.com/SemiAnalysisAI/InferenceX/pull/1717
Loading