[AMD][MI300X] Extend GPT-OSS FP4 TP=8 search to conc=1 (extends interactivity frontier to ~249 tps/user)#1092
Conversation
Previously the TP=8 sweep started at conc=4. At conc=1, TP=8 reaches
~249 tokens/sec/user, extending the interactivity Pareto frontier
beyond the prior best (~234 tps/user at TP=8 conc=4 → 219 tps/gpu).
Measured on a single MI300X node, vllm/vllm-openai-rocm:v0.17.0,
ROCM_AITER_UNIFIED_ATTN, fuse_rope_kvcache, inductor graph partition,
1k/1k random workload (range-ratio 0.8):
conc=1 (new):
Total TPS : 493
Output TPS : 247
TPS/GPU : 61.7
Mean ITL : 4.02 ms (= 248.9 tps/sec/user)
TTFT : 33.6 ms
Completed : 64/64
This single conc=1 point widens the right end of the interactivity
frontier without requiring any code or config changes outside the
search-space bound. Existing benchmark script handles conc=1 directly.
The 8k/1k row is updated symmetrically.
|
@claude /sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi300x-vllm |
|
Claude finished @chunfangamd's task in 0s —— View job I'll analyze this and get back to you. |
|
/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi300x-vllm |
|
@seungrokj Kicking off a sweep. Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24657932304 |
|
@claude @Klaud-Cold update perf-changelog.yaml |
|
@functionstackx @cquil11 can you plz merge this? |
|
@ramineroane @seungrokj at the new interactivity ranges, h100 beats mi300 of 210 to 250 tok/s/user. any plans for imrpovement?
|
functionstackx
left a comment
There was a problem hiding this comment.
approved. feel free to merge


Summary
Extends the TP=8 concurrency search for
gptoss-fp4-mi300x-vllmfrom[4..16]to[1..16], adding a single new low-concurrency point that pushes the interactivity Pareto frontier rightward.Motivation
The current best 1k/1k single-user point on the InferenceX gpt-oss-120b vLLM FP4 MI300X frontier is ~234 tokens/sec/user (TP=8 conc=4 → 219 TPS/GPU). The latency-optimal regime (smallest M, all GPUs, no batching) was not being measured.
Adding conc=1 captures that regime and gives the dashboard a true low-latency endpoint for users prioritizing interactive single-user use cases (chat, copilot, agentic).
Measured result
Single MI300X node, image
vllm/vllm-openai-rocm:v0.17.0. Bench harness:benchmarks/single_node/gptoss_fp4_mi300x.sh(no script change needed — it already accepts conc=1).Server flags (unchanged from existing config):
Bench params:
--random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.8 --num-prompts 64 --max-concurrency 1 --num-warmups 1 --ignore-eos.Pareto positioning
Net effect: a new datapoint at the high-interactivity tail; no existing points are displaced.
Impact on CI
.github/configs/amd-master.yaml.cc @seungrokj @functionstackx @chunfangamd