Skip to content

[None][test] Remove duplicate test cases in llm_perf_core file#14749

Open
yufeiwu-nv wants to merge 10 commits into
NVIDIA:mainfrom
yufeiwu-nv:model_dict
Open

[None][test] Remove duplicate test cases in llm_perf_core file#14749
yufeiwu-nv wants to merge 10 commits into
NVIDIA:mainfrom
yufeiwu-nv:model_dict

Conversation

@yufeiwu-nv
Copy link
Copy Markdown
Collaborator

@yufeiwu-nv yufeiwu-nv commented May 29, 2026

Signed-off-by: yufeiwu-nv 230315618+yufeiwu-nv@users.noreply.github.com

Summary by CodeRabbit

  • Tests
    • Expanded performance test coverage with new model configurations for streaming and throughput scenarios.
    • Enhanced speculative decoding support in performance benchmarking.
    • Added new model variants to performance test matrix across multiple GPU tiers.
    • Improved remote code trust handling for newly supported models.

Review Change Stack

Description

Also add nemotron_3_super_120b_nvfp4 serve test cases

Test Coverage

PR Checklist

Please review the following before submitting your PR:

  • PR description clearly explains what and why. If using CodeRabbit's summary, please make sure it makes sense.

  • PR Follows TRT-LLM CODING GUIDELINES to the best of your knowledge.

  • Test cases are provided for new code paths (see test instructions)

  • If PR introduces API changes, an appropriate PR label is added - either api-compatible or api-breaking. For api-breaking, include BREAKING in the PR title.

  • Any new dependencies have been scanned for license and vulnerabilities

  • CODEOWNERS updated if ownership changes

  • Documentation updated as needed

  • Update tava architecture diagram if there is a significant design change in PR.

  • The reviewers assigned automatically/manually are appropriate for the PR.

  • Please check this after reviewing the above items as appropriate for this PR.

GitHub Bot Help

To see a list of available CI bot commands, please comment /bot help.

…ean up waives.txt

Removed outdated model paths and unnecessary entries from MODEL_PATH_DICT in test_perf.py. Updated waives.txt to reflect the removal of tests that are no longer applicable, improving clarity and maintainability.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
These 7 waivers referenced perf tests (bart_large_cnn, bert_large,
flan_t5_base/large/xl/xxl, mbart_large_50_many_to_one_mmt) that no
longer appear in any test-db yaml on main. Drop them to keep the
cleanup consistent with the 5 sibling waivers (roberta_base, t5_*)
that were already removed in this PR.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Drop the 4 perf waivers that the PR originally added — author confirmed
the underlying nvbugs (5150255 / 5304388 / 6130334) are no longer
necessary to waive.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
Included additional models "nemotron_nano_12b_v2", "phi_4_multimodal_instruct", "phi_4_multimodal_instruct_fp4", and "phi_4_multimodal_instruct_fp8" to the TRUST_REMOTE_CODE_MODELS dictionary to enhance testing coverage.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
@yufeiwu-nv yufeiwu-nv requested review from a team as code owners May 29, 2026 13:16
…dels in pytorch_model_config.py and update test_perf.py to include new spec-decoding models. Added configurations for streaming and throughput variants, ensuring better performance tuning. Adjusted test conditions in llm_perf_core.yml to reflect new model tests and conditions for GPU capabilities.

Signed-off-by: yufeiwu-nv <230315618+yufeiwu-nv@users.noreply.github.com>
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

📝 Walkthrough

Walkthrough

The PR updates performance testing infrastructure for the TensorRT-LLM framework by splitting nemotron model configurations into streaming and throughput variants, updating the perf test harness to handle spec-decoding models correctly, and expanding the performance test matrix with new models and GPU tier constraints across multiple hardware configurations.

Changes

Performance Configuration and Testing

Layer / File(s) Summary
Model streaming and throughput configuration split
tests/integration/defs/perf/pytorch_model_config.py
Splits nemotron_3_super_120b_nvfp4 and nemotron_3_super_120b_nvfp4_mtp into streaming/serve patterns with enable_attention_dp=False and smaller batch size (8), and throughput patterns with enable_attention_dp=True and larger batch size (256), with MTP variant supporting speculative decoding in throughput mode.
Serve client and model trust configuration
tests/integration/defs/perf/test_perf.py
Adds nemotron_nano_12b_v2, phi_4_multimodal_instruct and variants to TRUST_REMOTE_CODE_MODELS, introduces SPEC_DEC_MODELS constant aggregating spec-decoding models, and updates serve-client command construction to conditionally append --ignore-eos only for non-spec-decoding models.
Performance test matrix GPU tier updates
tests/integration/test_lists/qa/llm_perf_core.yml
Reorganizes the LLM performance test matrix with updated GPU tier headers, adjusted compute capability constraints, expanded model coverage (qwen3.5 variants, llama_v3.3 configs, deepseek_r1), increased system GPU count requirements for RTX-6000D/Server tier, and new nemotron_3_super_120b_nvfp4 serve-based test entries across multiple GPU tiers.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

Possibly related PRs

  • NVIDIA/TensorRT-LLM#14570: Both PRs modify tests/integration/defs/perf/test_perf.py to expand TRUST_REMOTE_CODE_MODELS for nemotron_nano_12b_v2 and phi_4_multimodal_instruct (FP4/FP8), aligning the perf harness' trust_remote behavior for these new models.

Suggested reviewers

  • StanleySun639
  • LarryXFly
  • ruodil
  • niukuo
🚥 Pre-merge checks | ✅ 2 | ❌ 3

❌ Failed checks (3 warnings)

Check name Status Explanation Resolution
Title check ⚠️ Warning The title describes removing duplicate test cases in llm_perf_core, but the raw summary shows the changes actually add new test cases, update configurations, and reorganize GPU test groupings rather than simply removing duplicates. Update the PR title to accurately reflect that the changes reorganize GPU test groupings and add new model/perf test cases (like qwen3.5 variants and nemotron_3_super_120b_nvfp4-serve) in addition to removing old ones.
Description check ⚠️ Warning The PR description is mostly empty placeholders from the template. The author checked the final PR Checklist box but provided no actual description content, test coverage explanation, or narrative justification for the changes. Fill in the Description section explaining what duplicate test cases were removed and why, and provide specific Test Coverage details for the changes made to pytorch_model_config.py, test_perf.py, and llm_perf_core.yml.
Docstring Coverage ⚠️ Warning Docstring coverage is 50.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@tests/integration/defs/perf/pytorch_model_config.py`:
- Around line 517-568: The throughput entries' pattern strings
('nemotron_3_super_120b_nvfp4-' and 'nemotron_3_super_120b_nvfp4_mtp') are too
broad and accidentally match streaming/low-latency labels, causing their
'config' (e.g., enable_attention_dp, cuda_graph_config.max_batch_size) to
override streaming variants; fix by making the patterns non-overlapping (for
example rename to a distinct suffix like
'nemotron_3_super_120b_nvfp4_throughput' and
'nemotron_3_super_120b_nvfp4_mtp_throughput' or use more specific anchors) so
the throughput entries in the 'patterns' lists no longer match streaming serve
labels and won't overwrite the streaming configs.

In `@tests/integration/test_lists/qa/llm_perf_core.yml`:
- Line 414: The list entry
"perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220-input_output_len:4000,2000-reqs:512-ep:8-tp:8-gpus:8]`#max_throughput`"
is malformed because the trailing "`#max_throughput`" is being treated as part of
the scalar; fix it by separating the comment from the scalar (e.g., add a space
before the #) or by quoting the entire test id string so the "#" is preserved
correctly as a comment marker or literal, updating the entry in
tests/integration/test_lists/qa/llm_perf_core.yml where that test id appears.
- Around line 251-254: The QA perf entries for
nemotron_3_super_120b_nvfp4-serve-pytorch-float4 (and the other missing QA cases
qwen3.5_9b, qwen3.5_27b, qwen3.5_122b_a10b,
deepseek_r1_0528_fp4-bench-pytorch-streaming-float4) that appear in
tests/integration/test_lists/qa/llm_perf_core.yml are not present in the
authoritative CI test-db files under
tests/integration/test_lists/test-db/l0_perf*.yml; add equivalent entries to
those l0_perf*.yml files so the CI DB includes the perf cases referenced (e.g.,
perf/test_perf.py::test_perf[nemotron_3_super_120b_nvfp4-serve-pytorch-float4-...],
perf/test_perf.py::test_perf[qwen3.5_9b-...],
perf/test_perf.py::test_perf[qwen3.5_27b-...],
perf/test_perf.py::test_perf[qwen3.5_122b_a10b-...], and
perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-...])
ensuring the exact test identifiers, markers (min_latency / max_throughput) and
parameter strings are copied so CI will discover and run the same cases.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: da3dc35b-eff7-4168-87e5-b47b897eff22

📥 Commits

Reviewing files that changed from the base of the PR and between c7683f2 and e944483.

📒 Files selected for processing (3)
  • tests/integration/defs/perf/pytorch_model_config.py
  • tests/integration/defs/perf/test_perf.py
  • tests/integration/test_lists/qa/llm_perf_core.yml

Comment on lines +517 to +568
# Nemotron-3-Super-120B-NVFP4 (throughput variant, aligned with curated yaml)
# Non-streaming cases use attention DP and larger cuda_graph batch for throughput.
{
'patterns': ['nemotron_3_super_120b_nvfp4-'],
'config': {
'max_seq_len': 1048576,
'enable_chunked_prefill': True,
'enable_attention_dp': True,
'stream_interval': 1,
'moe_config': {
'backend': 'CUTLASS',
},
'cuda_graph_config': {
'enable_padding': True,
'max_batch_size': 256,
},
'kv_cache_config': {
'enable_block_reuse': False,
'mamba_ssm_cache_dtype': 'float16',
'mamba_ssm_stochastic_rounding': True,
'mamba_ssm_philox_rounds': 5,
},
}
},
# Nemotron-3-Super-120B-NVFP4_MTP (throughput variant with MTP spec decoding)
{
'patterns': ['nemotron_3_super_120b_nvfp4_mtp'],
'config': {
'max_seq_len': 1048576,
'enable_chunked_prefill': True,
'enable_attention_dp': True,
'stream_interval': 1,
'moe_config': {
'backend': 'CUTLASS',
},
'cuda_graph_config': {
'enable_padding': True,
'max_batch_size': 256,
},
'kv_cache_config': {
'enable_block_reuse': False,
'mamba_ssm_cache_dtype': 'float16',
'mamba_ssm_stochastic_rounding': True,
'mamba_ssm_philox_rounds': 5,
},
'speculative_config': {
'decoding_type': 'MTP',
'num_nextn_predict_layers': 3,
'allow_advanced_sampling': True,
},
}
},
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Throughput patterns currently override streaming variants unintentionally.

'nemotron_3_super_120b_nvfp4-' and 'nemotron_3_super_120b_nvfp4_mtp' also match the streaming serve labels, so these later entries overwrite the streaming low-latency config (e.g., enable_attention_dp and max_batch_size).

Suggested fix (make throughput patterns non-overlapping)
-            'patterns': ['nemotron_3_super_120b_nvfp4-'],
+            'patterns': [
+                'nemotron_3_super_120b_nvfp4-bench-pytorch-',
+                'nemotron_3_super_120b_nvfp4-serve-pytorch-float',
+            ],
@@
-            'patterns': ['nemotron_3_super_120b_nvfp4_mtp'],
+            'patterns': [
+                'nemotron_3_super_120b_nvfp4_mtp-bench-pytorch-',
+                'nemotron_3_super_120b_nvfp4_mtp-serve-pytorch-float',
+            ],
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/defs/perf/pytorch_model_config.py` around lines 517 - 568,
The throughput entries' pattern strings ('nemotron_3_super_120b_nvfp4-' and
'nemotron_3_super_120b_nvfp4_mtp') are too broad and accidentally match
streaming/low-latency labels, causing their 'config' (e.g., enable_attention_dp,
cuda_graph_config.max_batch_size) to override streaming variants; fix by making
the patterns non-overlapping (for example rename to a distinct suffix like
'nemotron_3_super_120b_nvfp4_throughput' and
'nemotron_3_super_120b_nvfp4_mtp_throughput' or use more specific anchors) so
the throughput entries in the 'patterns' lists no longer match streaming serve
labels and won't overwrite the streaming configs.

Comment on lines +251 to +254
#nemotron_3_super_120b_nvfp4 (Hybrid MoE+SSM+Attn FP4 76G, 4-GPU ep=4 tp=4, throughput config)
- perf/test_perf.py::test_perf[nemotron_3_super_120b_nvfp4-serve-pytorch-float4-maxbs:512-maxnt:2048-kv_frac:0.8-input_output_len:1024,1024-reqs:5-con:1-ep:4-tp:4-gpus:4] #min_latency
- perf/test_perf.py::test_perf[nemotron_3_super_120b_nvfp4-serve-pytorch-float4-maxbs:512-maxnt:2048-kv_frac:0.8-input_output_len:1024,1024-reqs:160-con:32-ep:4-tp:4-gpus:4]
- perf/test_perf.py::test_perf[nemotron_3_super_120b_nvfp4-serve-pytorch-float4-maxbs:512-maxnt:2048-kv_frac:0.8-input_output_len:1024,1024-reqs:640-con:128-ep:4-tp:4-gpus:4] #max_throughput
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🔴 Critical | ⚡ Quick win

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

# Locate perf test-db files
fd -i 'l0.*perf.*\.yml' tests/integration/test_lists/test-db

# Check representative newly added cases from this PR
rg -nS \
  "nemotron_3_super_120b_nvfp4-serve-pytorch|qwen3\.5_9b|qwen3\.5_27b|qwen3\.5_122b_a10b|deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220" \
  tests/integration/test_lists/test-db

Repository: NVIDIA/TensorRT-LLM

Length of output: 2424


Add the new QA perf entries to the authoritative CI test-db (tests/integration/test_lists/test-db/l0_perf*.yml).

  • The tests/integration/test_lists/qa/llm_perf_core.yml perf cases for nemotron_3_super_120b_nvfp4-serve-pytorch (lines 251-254) are missing from tests/integration/test_lists/test-db/l0_perf*.yml.
  • The other referenced entries (lines 30-37, 64-70, 164-170) are also missing from tests/integration/test_lists/test-db/l0_perf*.yml (no matches for qwen3.5_9b, qwen3.5_27b, qwen3.5_122b_a10b, deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220).
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_lists/qa/llm_perf_core.yml` around lines 251 - 254,
The QA perf entries for nemotron_3_super_120b_nvfp4-serve-pytorch-float4 (and
the other missing QA cases qwen3.5_9b, qwen3.5_27b, qwen3.5_122b_a10b,
deepseek_r1_0528_fp4-bench-pytorch-streaming-float4) that appear in
tests/integration/test_lists/qa/llm_perf_core.yml are not present in the
authoritative CI test-db files under
tests/integration/test_lists/test-db/l0_perf*.yml; add equivalent entries to
those l0_perf*.yml files so the CI DB includes the perf cases referenced (e.g.,
perf/test_perf.py::test_perf[nemotron_3_super_120b_nvfp4-serve-pytorch-float4-...],
perf/test_perf.py::test_perf[qwen3.5_9b-...],
perf/test_perf.py::test_perf[qwen3.5_27b-...],
perf/test_perf.py::test_perf[qwen3.5_122b_a10b-...], and
perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-...])
ensuring the exact test identifiers, markers (min_latency / max_throughput) and
parameter strings are copied so CI will discover and run the same cases.

- perf/test_perf.py::test_perf[llama_v3.3_nemotron_super_49b-bench-pytorch-bfloat16-input_output_len:1000,2000-tp:2-gpus:2]
- perf/test_perf.py::test_perf[llama_v3.3_nemotron_super_49b-bench-pytorch-bfloat16-maxbs:1-input_output_len:1000,1000-reqs:10-con:1-tp:2-gpus:2] #min_latency
- perf/test_perf.py::test_perf[llama_v3.3_nemotron_super_49b-bench-pytorch-bfloat16-input_output_len:1000,1000-con:250-tp:2-gpus:2] #max_throughput
- perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220-input_output_len:4000,2000-reqs:512-ep:8-tp:8-gpus:8]#max_throughput
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Fix malformed test entry (# is currently part of the test name).

...gpus:8]#max_throughput`` is parsed as a single scalar, so the comment text becomes part of the test id.

Minimal fix
-  - perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220-input_output_len:4000,2000-reqs:512-ep:8-tp:8-gpus:8]`#max_throughput`
+  - perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220-input_output_len:4000,2000-reqs:512-ep:8-tp:8-gpus:8] `#max_throughput`
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
- perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220-input_output_len:4000,2000-reqs:512-ep:8-tp:8-gpus:8]#max_throughput
- perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220-input_output_len:4000,2000-reqs:512-ep:8-tp:8-gpus:8] `#max_throughput`
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@tests/integration/test_lists/qa/llm_perf_core.yml` at line 414, The list
entry
"perf/test_perf.py::test_perf[deepseek_r1_0528_fp4-bench-pytorch-streaming-float4-maxbs:512-maxnt:5220-input_output_len:4000,2000-reqs:512-ep:8-tp:8-gpus:8]`#max_throughput`"
is malformed because the trailing "`#max_throughput`" is being treated as part of
the scalar; fix it by separating the comment from the scalar (e.g., add a space
before the #) or by quoting the entire test id string so the "#" is preserved
correctly as a comment marker or literal, updating the entry in
tests/integration/test_lists/qa/llm_perf_core.yml where that test id appears.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant