Skip to content

[AMD][MI300X] Extend GPT-OSS FP4 TP=8 search to conc=1 (extends interactivity frontier to ~249 tps/user)#1092

Merged
seungrokj merged 4 commits intoSemiAnalysisAI:mainfrom
ramineroane:gptoss-fp4-mi300x-tp8-low-latency
Apr 21, 2026
Merged

[AMD][MI300X] Extend GPT-OSS FP4 TP=8 search to conc=1 (extends interactivity frontier to ~249 tps/user)#1092
seungrokj merged 4 commits intoSemiAnalysisAI:mainfrom
ramineroane:gptoss-fp4-mi300x-tp8-low-latency

Conversation

@ramineroane
Copy link
Copy Markdown
Collaborator

Summary

Extends the TP=8 concurrency search for gptoss-fp4-mi300x-vllm from [4..16] to [1..16], adding a single new low-concurrency point that pushes the interactivity Pareto frontier rightward.

Motivation

The current best 1k/1k single-user point on the InferenceX gpt-oss-120b vLLM FP4 MI300X frontier is ~234 tokens/sec/user (TP=8 conc=4 → 219 TPS/GPU). The latency-optimal regime (smallest M, all GPUs, no batching) was not being measured.

Adding conc=1 captures that regime and gives the dashboard a true low-latency endpoint for users prioritizing interactive single-user use cases (chat, copilot, agentic).

Measured result

Single MI300X node, image vllm/vllm-openai-rocm:v0.17.0. Bench harness: benchmarks/single_node/gptoss_fp4_mi300x.sh (no script change needed — it already accepts conc=1).

Server flags (unchanged from existing config):

--attention-backend ROCM_AITER_UNIFIED_ATTN
-cc.pass_config.fuse_rope_kvcache=True
-cc.use_inductor_graph_partition=True
--tensor-parallel-size=8
--gpu-memory-utilization 0.95
--max-model-len 2248
--block-size=64
--no-enable-prefix-caching

Bench params: --random-input-len 1024 --random-output-len 1024 --random-range-ratio 0.8 --num-prompts 64 --max-concurrency 1 --num-warmups 1 --ignore-eos.

Metric Value
Total TPS 493
Output TPS 247
TPS/GPU 61.7
Mean ITL 4.02 ms
Tokens/sec/user 248.9
Mean TTFT 33.6 ms
Completed 64/64

Pareto positioning

Frontier point tps/user tps/gpu
TP=8 conc=4 (existing best in interactivity dir.) 234.5 219
TP=8 conc=1 (this PR) 248.9 62

Net effect: a new datapoint at the high-interactivity tail; no existing points are displaced.

Impact on CI

  • Adds a single conc point to the existing TP=8 sweep for gptoss-fp4-mi300x-vllm.
  • 8k/1k row updated symmetrically.
  • No script, image, or harness changes. Only .github/configs/amd-master.yaml.

cc @seungrokj @functionstackx @chunfangamd

Previously the TP=8 sweep started at conc=4. At conc=1, TP=8 reaches
~249 tokens/sec/user, extending the interactivity Pareto frontier
beyond the prior best (~234 tps/user at TP=8 conc=4 → 219 tps/gpu).

Measured on a single MI300X node, vllm/vllm-openai-rocm:v0.17.0,
ROCM_AITER_UNIFIED_ATTN, fuse_rope_kvcache, inductor graph partition,
1k/1k random workload (range-ratio 0.8):

  conc=1 (new):
    Total TPS  : 493
    Output TPS : 247
    TPS/GPU    : 61.7
    Mean ITL   : 4.02 ms (= 248.9 tps/sec/user)
    TTFT       : 33.6 ms
    Completed  : 64/64

This single conc=1 point widens the right end of the interactivity
frontier without requiring any code or config changes outside the
search-space bound. Existing benchmark script handles conc=1 directly.

The 8k/1k row is updated symmetrically.
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

Copy link
Copy Markdown
Collaborator

@chunfangamd chunfangamd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

@chunfangamd
Copy link
Copy Markdown
Collaborator

@claude /sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi300x-vllm

@Klaud-Cold
Copy link
Copy Markdown
Collaborator

Klaud-Cold commented Apr 19, 2026

Claude finished @chunfangamd's task in 0s —— View job


I'll analyze this and get back to you.

@seungrokj
Copy link
Copy Markdown
Collaborator

/sweep test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi300x-vllm

@seungrokj seungrokj added the AMD label Apr 20, 2026
@github-actions
Copy link
Copy Markdown
Contributor

@seungrokj Kicking off a sweep.

Run: https://github.com/SemiAnalysisAI/InferenceX/actions/runs/24657932304
Command: test-config --config-files .github/configs/amd-master.yaml --config-keys gptoss-fp4-mi300x-vllm
Pinned ref: 699643e
Approval: not required (trusted collaborator).

@seungrokj
Copy link
Copy Markdown
Collaborator

@claude @Klaud-Cold update perf-changelog.yaml

@seungrokj
Copy link
Copy Markdown
Collaborator

seungrokj commented Apr 21, 2026

Screenshot_20260421_101219_Chrome

Copy link
Copy Markdown
Collaborator

@seungrokj seungrokj left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Lgtm

@seungrokj
Copy link
Copy Markdown
Collaborator

@functionstackx @cquil11 can you plz merge this?

@functionstackx
Copy link
Copy Markdown
Contributor

@ramineroane @seungrokj at the new interactivity ranges, h100 beats mi300 of 210 to 250 tok/s/user. any plans for imrpovement?

image

Copy link
Copy Markdown
Contributor

@functionstackx functionstackx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approved. feel free to merge

@seungrokj seungrokj merged commit 5261b0a into SemiAnalysisAI:main Apr 21, 2026
14 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

Development

Successfully merging this pull request may close these issues.

6 participants