[TRTLLM-12669][refactor] Remove allow_advanced_sampling and capture dual CUDA graphs by zhaoyangwang-nvidia · Pull Request #14745 · NVIDIA/TensorRT-LLM

zhaoyangwang-nvidia · 2026-05-29T08:59:59Z

Replace static config flag with auto-detected per-step uses_advanced_sampling based on actual sampling params. Include this in CUDA graph key so we lazily capture two graph variants (argmax fast-path vs advanced sampling kernel) and dispatch by replaying the right one.

@coderabbitai summary

Description

Removed allow_advanced_sampling config flag from DecodingBaseConfig.
Replaced with auto-detected per-step is_all_greedy_sample based on
actual temperature/top_k/top_p of requests in the batch.
Included this in the CUDA graph key so two graph variants are lazily
captured (argmax fast-path vs advanced sampling kernel) and dispatched
at replay time based on batch composition.

Test Coverage

PR Checklist

Please review the following before submitting your PR:

PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.
PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.
Test cases are provided for new code paths (see test instructions)
If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.
Any new dependencies have been scanned for license and vulnerabilities
CODEOWNERS updated if ownership changes
Documentation updated as needed
Update tava architecture diagram if there is a significant design change in PR.
The reviewers assigned automatically/manually are appropriate for the PR.
Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…tected dual-graph dispatch Remove the static `allow_advanced_sampling` config flag and replace it with a per-step auto-detected `is_all_greedy_sample` boolean on SpecMetadata. The flag is computed in `populate_sampling_params_for_one_model` from the actual temperature/top_k/top_p of every request in the batch. `is_all_greedy_sample` is included in the CUDA graph key so we lazily capture two graph variants (argmax fast-path vs advanced sampling kernel) and dispatch by replaying the right one based on the current batch composition. Both variants stay CUDA-graph-compatible because the dispatch is a host-side decision outside the captured region. Additional optimizations for the all-greedy batch (the common default): - Populate skips per-token list building and 6 H->D copies entirely. - Rejection sampling is bypassed (argmax is equivalent for all-greedy) in both linear and dynamic-tree paths. - _compute_and_store_draft_probs is skipped, saving a softmax pass and draft-probs copy. Signed-off-by: ZhaoyangWang <zhaoyangw@nvidia.com>

zhaoyangwang-nvidia · 2026-05-29T12:39:04Z

/bot run

zhaoyangwang-nvidia · 2026-05-29T12:39:27Z

Hi @mikeiovine please help to review this PR, thanks~

tensorrt-cicd · 2026-05-29T12:44:41Z

PR_Github #51043 [ run ] triggered by Bot. Commit: d237690 Link to invocation

tensorrt-cicd · 2026-05-29T17:20:11Z

PR_Github #51043 [ run ] completed with state SUCCESS. Commit: d237690
/LLM/main/L0_MergeRequest_PR pipeline #40490 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

github-actions Bot assigned zhaoyangwang-nvidia May 29, 2026

zhaoyangwang-nvidia force-pushed the TRTLLM-12669-remove-allow-advanced-sampling branch from 903b453 to d237690 Compare May 29, 2026 10:05

zhaoyangwang-nvidia marked this pull request as ready for review May 29, 2026 10:10

zhaoyangwang-nvidia requested review from a team as code owners May 29, 2026 10:10

zhaoyangwang-nvidia requested review from nv-guomingz, sunnyqgg, syuoni, venkywonka and zhenhuaw-me May 29, 2026 10:10

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[TRTLLM-12669][refactor] Remove allow_advanced_sampling and capture dual CUDA graphs#14745

[TRTLLM-12669][refactor] Remove allow_advanced_sampling and capture dual CUDA graphs#14745
zhaoyangwang-nvidia wants to merge 1 commit into
NVIDIA:mainfrom
zhaoyangwang-nvidia:TRTLLM-12669-remove-allow-advanced-sampling

zhaoyangwang-nvidia commented May 29, 2026 •

edited

Loading

Uh oh!

zhaoyangwang-nvidia commented May 29, 2026

Uh oh!

zhaoyangwang-nvidia commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

zhaoyangwang-nvidia commented May 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Test Coverage

PR Checklist

GitHub Bot Help

Uh oh!

zhaoyangwang-nvidia commented May 29, 2026

Uh oh!

zhaoyangwang-nvidia commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

tensorrt-cicd commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

zhaoyangwang-nvidia commented May 29, 2026 •

edited

Loading