[None][feat] Add AutoDeploy support for StepFun Step-3.7-Flash by bmarimuthu-nv · Pull Request #14759 · NVIDIA/TensorRT-LLM

bmarimuthu-nv · 2026-05-29T19:36:25Z

Summary

Adds AutoDeploy support for the StepFun Step-3.7-Flash text decoder (model_type step3p7 wrapping text config step3p5). Step-3.7-Flash is a 198B-param sparse-MoE vision-language model; this PR onboards the text generation path as a prefill-only AutoDeploy custom model.

Architecture highlights captured

45 decoder layers: 3 dense SwiGLU MLP + 42 MoE (288 routed experts top-8 + dense shared expert)
Grouped-query attention with mixed full / sliding-window layer types and different head counts per type (64 Q-heads full, 96 sliding; 8 KV heads; head_dim 128)
Head-wise attention gate (g_proj, sigmoid per head)
Per-head QK RMSNorm; per-layer-type partial RoPE (0.5 + llama3 scaling on full-attention layers, 1.0 on sliding); per-layer rope_theta
Gemma-style (1+weight) RMSNorm convention; MTP layers (45-47) skipped

Implementation notes

Uses AutoDeploy canonical ops only: torch_rmsnorm, torch_attention, torch_rope_with_explicit_cos_sin, torch_moe.
Three load-state-dict hooks: absorb (1+weight) norm convention, split stacked moe.{gate,up,down}_proj.weight [E,...] into per-expert Linears, and expand the head-gate weight [num_heads, hidden] -> [num_heads*head_dim, hidden].
Head-gate sharding: the native gate output dim (num_heads) is smaller than head_dim, which breaks AutoDeploy's head-aligned TP heuristic (num_heads // head_dim = 0). g_proj is therefore sized [num_heads*head_dim, hidden] (the per-head gate repeated across each head's head_dim slots, applied in the flattened attention space) so it shards exactly like q_proj. Numerically identical.

Known limitations

Step-3.7-Flash-FP8 ships F8_E4M3 expert weights but no weight-scale tensors, so the block-FP8 transform cannot load it; this PR validates on the bf16 stepfun-ai/Step-3.7-Flash (identical architecture). NVFP4 variant untested.
The SwiGLU activation clamp on routed experts of the last two MoE layers is not applied (no clamp param on torch_moe; large numerical guards) — it is applied on the dense/shared-expert path.
Vision tower not deployed (text path only).

Files

tensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.py (new), __init__.py (register)
tests/unittest/auto_deploy/singlegpu/models/test_step3p7_modeling.py (new)
examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml (new) + models.yaml entries (bf16/FP8/NVFP4)
examples/auto_deploy/cookbooks/step_3.7_flash_trtllm_cookbook.ipynb (new)
docs/source/models/supported-models.md (matrix rows)

Reproduce — end-to-end run

python examples/auto_deploy/build_and_run_ad.py \
  --model stepfun-ai/Step-3.7-Flash \
  --args.yaml-extra examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml

Run on 8x H100 (world_size 8, bf16). Full 45-layer run completed all transforms in ~185s, 10/10 prompts coherent. Complete prompt -> generated-text pairs (verbatim):

P0 — "How big is the universe?"

<think>: Got it, let's tackle the question "How big is the universe?" First, I need to start by clarifying the difference between the observable universe and the entire universe, right? Because a lot of people mix those up. First, observable universe: that's the part we can see, limited by the speed of light and the age of the universe. Let's get the numbers right. The age is about 13.8 billion years, so naive calculation would be 13.8 billion light

P1 — "In simple words and a single sentence, explain the concept of gravity:"

<think>: Got it, let's tackle this. First, need simple, single sentence, explain gravity... Gravity is the invisible force that pulls all objects with mass toward each other, which is why things fall to the ground instead of floating off into space, and why planets orbit the sun...

P2 — "How to fix slicing in golf?"

<think>: Got it, let's tackle this question: how to fix a slice in golf. First, I need to remember what a slice actually is, right? It's when the ball curves sharply right (for right-handed golfers, left for lefties) because of side spin, caused by an open clubface at impact combined with an outside-in swing path...

P3 — "Where is the capital of Iceland?"

<think>: Hmm, the user is asking about the capital of Iceland. This is a straightforward factual question with a clear answer. Reykjavík is the capital, so I'll state that directly and add a few key details to make the response more informative—its location, population, and cultural significance...

P4 — "What are the three laws of thermodynamics?"

<think>: We need to provide a concise but thorough answer... 1. Zeroth law: defines thermal equilibrium and temperature... 2. First law: conservation of energy (energy cannot be created or destroyed, only transferred or changed in form; ΔU = Q - W or ΔU = Q + W

P5 — "Summarize the plot of Romeo and Juliet in two sentences:"

<think>: Got it, let's tackle this. First sentence needs to set up the core conflict and the meet-cute... Romeo, a Montague, sneaks into a Capulet feast and falls for Juliet, a Capulet, even though their families have a bitter, long-standing feud... they secretly get married the next day... "Set in the warring Italian city of Verona

P6 — "Write a Python function that checks if a number is prime."

<think>: Got it, let's see. The user wants a Python function to check if a number is prime. First, I need to remember what a prime number is: a number greater than 1 that has no divisors other than 1 and itself... handle edge cases. Like numbers less than 2 aren't prime...

P7 — "Explain the difference between a compiler and an interpreter:"

<think>: Got it, let's tackle explaining the difference between compilers and interpreters... Both translate high-level code into machine code that the CPU can run... a compiler translates the entire source code all at once into machine code (or sometimes an intermediate bytecode

P8 — "What causes the northern lights?"

<think>: We are asked: "What causes the northern lights?" This is a straightforward question about the aurora borealis... the main points: solar wind, Earth's magnetic field, interaction with atmosphere, resulting in photons...

P9 — "What are the health benefits of drinking green tea?"

<think>: We need to answer: "What are the health benefits of drinking green tea?"... components like catechins, EGCG, antioxidants... cardiovascular health, weight management, brain function, cancer risk reduction, anti-inflammatory, blood sugar control, dental health...

A reduced 5-layer run (--args.model-kwargs.text_config.num_hidden_layers=5) was used first to validate build/export/sharding/KV-cache; it passes the sharding stage in ~96s (garbled output expected for the truncated model).

Unit tests

pytest tests/unittest/auto_deploy/singlegpu/models/test_step3p7_modeling.py -v

Result on this commit (CPU, fp32):

test_mlp_equivalence PASSED
test_mlp_clamped_equivalence PASSED
test_attention_full_equivalence PASSED
test_attention_sliding_equivalence PASSED
test_moe_block_equivalence PASSED
test_decoder_layer_moe_equivalence PASSED
test_decoder_layer_dense_sliding_equivalence PASSED
test_full_model_equivalence PASSED
test_export PASSED
========================= 9 passed in 3.01s =========================

🤖 Generated with Claude Code

Summary by CodeRabbit

Release Notes

New Features
- Added full support for StepFun Step-3.7-Flash model with multiple quantization variants (FP8, NVFP4)
- Model available for TensorRT-LLM deployment via AutoDeploy with optimized configurations
Documentation
- Added comprehensive Jupyter notebook cookbook with deployment instructions, environment setup, and OpenAI-compatible client examples
Tests
- Added comprehensive unit tests validating model accuracy and Torch export functionality

coderabbitai · 2026-05-29T19:43:14Z

📝 Walkthrough

Walkthrough

This PR introduces comprehensive support for StepFun Step-3.7-Flash text model within TensorRT-LLM. The changes include a complete AutoDeploy PyTorch implementation with custom state-dict hooks for checkpoint adaptation, YAML configuration and model registry entries, extensive equivalence testing against a reference implementation, and a deployment cookbook with usage examples.

Changes

StepFun Step-3.7-Flash Integration

Layer / File(s)	Summary
Core AutoDeploy Model Implementation `tensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.py`, `tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py`	Implements Step-3.7-Flash text decoder with state-dict pre-hooks for RMSNorm convention adjustment, head-wise gate weight expansion, and MoE expert weight splitting. Includes AD-operator-backed RMSNorm, partial-rotation RoPE with optional llama3 scaling, GQA attention with per-head Q/K normalization and sigmoid gate, SwiGLU MLP with optional clamping, routed MoE with sigmoid routing, and decoder layers supporting both dense and MoE+shared-expert configurations.
Model Configuration and Registry `examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml`, `examples/auto_deploy/model_registry/models.yaml`, `docs/source/models/supported-models.md`	Registers Step-3.7-Flash variants (base, FP8, NVFP4) in the AutoDeploy model registry with YAML configuration specifying TensorRT-LLM runtime, CUDA graphs, FlashInfer attention backend, operational limits, chunked prefill, and BF16 KV cache settings. Updates supported models documentation with feature support matrix entries.
Comprehensive Equivalence Testing `tests/unittest/auto_deploy/singlegpu/models/test_step3p7_modeling.py`	Provides faithful reference PyTorch implementation and validates AutoDeploy operators at multiple granularities: block-level tests for MLP (dense/clamped), attention (full/sliding), and MoE; decoder-layer tests for MoE and sliding variants; full causal-LM equivalence with state-dict hook/load behavior; and Torch export validation with dynamic shapes and RMSE closeness checks.
Deployment Documentation and Examples `examples/auto_deploy/cookbooks/step_3.7_flash_trtllm_cookbook.ipynb`	Jupyter notebook demonstrating Step-3.7-Flash deployment, including GPU prerequisites, pip/torch/openai setup, server launch via trtllm-serve, OpenAI-compatible client configuration, standard and streaming chat completion examples, recommended evaluation parameters, and reference links to TensorRT-LLM and AutoDeploy documentation.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Suggested reviewers

galagam
xinhe-nv
liji-nv
MrGeva

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name	Status	Explanation	Resolution
Docstring Coverage	⚠️ Warning	Docstring coverage is 24.24% which is insufficient. The required threshold is 80.00%.	Write docstrings for the functions missing them to satisfy the coverage threshold.

✅ Passed checks (4 passed)

Check name	Status	Explanation
Title check	✅ Passed	The title clearly and specifically describes the main change: adding AutoDeploy support for the StepFun Step-3.7-Flash model.
Description check	✅ Passed	The PR description is comprehensive and well-structured, covering summary, architecture highlights, implementation details, known limitations, files changed, reproduction steps, and unit test results.
Linked Issues check	✅ Passed	Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check	✅ Passed	Check skipped because no linked issues were found for this pull request.

_{✏️ Tip: You can configure your own custom pre-merge checks in the settings.}

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

coderabbitai

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/source/models/supported-models.md`:
- Line 53: The docs list the model class as Step3p7ForConditionalGeneration but
the actual registered class is Step3p7ForCausalLM; update the table entries
(both occurrences) to use the implemented architecture name Step3p7ForCausalLM
and ensure the markdown link/text matches the registered model identifier
`stepfun-ai/Step-3.7-Flash`; verify any references to the old name
(Step3p7ForConditionalGeneration) are replaced so the docs reflect the real API
surface.

In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.py`:
- Line 185: Replace the runtime assertions with explicit exceptions: instead of
using assert rope_type == "llama3", raise a ValueError (or TypeError if testing
type) with the same message; do the same for the other assert checks noted
(around the code using rope_type at the other two locations referenced). Locate
the assertion lines (the rope_type assertion in modeling_step3p7.py and the two
other assert usages around the referenced blocks) and change them to raise
ValueError("Step-3.7 only supports llama3 rope-scaling, got {rope_type!r}") (or
raise TypeError(...) if the check is for type), preserving the original message
text. Ensure tests/consumers handle these exceptions instead of relying on
AssertionError.
- Line 527: The class-level attribute _no_split_modules is defined as a mutable
list which can be accidentally mutated and triggers RUF012; change it to an
immutable tuple containing "Step3p7DecoderLayer" (i.e., replace the list literal
with a tuple literal) so the attribute is immutable and cannot be modified at
runtime; update any code that expects list-specific methods to handle a tuple or
convert to list locally if mutation is required, referencing the
_no_split_modules attribute and the Step3p7DecoderLayer identifier.

🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

Push a commit to this branch (recommended)
Create a new PR with the fixes

ℹ️ Review info

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 58c67cdd-9066-4aa2-acfe-66ba61582ede

📥 Commits

Reviewing files that changed from the base of the PR and between ebbbec4 and b96e0ad.

📒 Files selected for processing (7)

docs/source/models/supported-models.md
examples/auto_deploy/cookbooks/step_3.7_flash_trtllm_cookbook.ipynb
examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml
examples/auto_deploy/model_registry/models.yaml
tensorrt_llm/_torch/auto_deploy/models/custom/__init__.py
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.py
tests/unittest/auto_deploy/singlegpu/models/test_step3p7_modeling.py

Onboard the text decoder of stepfun-ai/Step-3.7-Flash (model_type step3p7 / step3p5) to the AutoDeploy backend as a prefill-only custom model with tensor-parallel sharding-IR hints. Architecture: 45 layers (3 dense + 42 MoE; 288 routed experts top-8 plus a dense shared expert), grouped-query attention with mixed full/sliding-window layer types and per-type head counts (64 full / 96 sliding, 8 KV heads), a head-wise attention gate, per-head QK RMSNorm, and per-layer-type partial RoPE (llama3 scaling on full-attention layers). Uses AutoDeploy canonical ops (torch_rmsnorm, torch_attention, torch_rope_with_explicit_cos_sin, torch_moe) with load hooks for the (1+weight) RMSNorm convention and stacked->per-expert MoE weights. Sharding: every shardable projection carries explicit sharding-IR hints (torch_linear_simple tp_mode/layer_type, auto_deploy.view tp_scaled_dim, all_reduce after rowwise projections and at the MoE routed+shared merge point); the head-wise gate is a per-head column shard, and the routed MoE uses torch_moe(layer_type="moe"). The registry config disables the legacy heuristic sharding and enables apply_sharding_hints. Adds hierarchical equivalence unit tests, a runnable model-specific sharding-IR equivalence test (parametrized over tp-only/ep-only/tep/attn-dp), a bundled config class for standalone/offline-test construction, a self-contained model-registry config + entries, a deployment cookbook, and supported-models matrix rows. Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv · 2026-05-30T01:59:56Z

[AGENT] Updated this PR: rebased onto upstream/main and ported the model to the hint-driven sharding-IR path (the heuristic sharder is no longer used for new models).

What changed since the previous revision

Sharding-IR port of modeling_step3p7.py: every shardable projection now uses torch.ops.auto_deploy.torch_linear_simple (tp_mode/layer_type), head reshapes use auto_deploy.view (tp_scaled_dim), rowwise outputs are followed by auto_deploy.all_reduce, and routed MoE uses torch_moe(layer_type="moe") with a single all_reduce at the routed+shared merge point. The head-wise gate is a per-head column shard. The registry config disables detect_sharding/sharding_transform_executor and enables apply_sharding_hints.
Added a runnable, model-specific sharding-IR equivalence test tests/unittest/auto_deploy/multigpu/transformations/library/test_step3p7_sharding_ir.py (parametrized over tp-only/ep-only/tep/attn-dp).
Added a bundled Step3p7Config (small TP-friendly defaults) so the sharding-IR harness can build a tiny instance; real runs still use the trust_remote_code config.
Deterministic-seed fixture added to the unit tests.

Reproduce — end-to-end (sharding-IR, 8 GPUs, bf16)

python examples/auto_deploy/build_and_run_ad.py \
  --model stepfun-ai/Step-3.7-Flash \
  --args.yaml-extra examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml

apply_sharding_hints is the sole sharding pass (MoE grid [ep x tp] = [8 x 1], NCCL):

[apply_sharding_hints] (TP + EP): 72 nodes processed    # 5-layer reduced sanity run
[apply_sharding_hints] (TP + EP): 672 nodes processed   # full 45-layer run

Full 45-layer run — coherent generation, complete prompt → output pairs (verbatim):

"How big is the universe?" → Got it, let's tackle the question "How big is the universe?" First, I need to start by clarifying the difference between the observable universe and the entire universe... The age of the universe is about 13.8 billion years, so you might think the radius is [...]
"In simple words and a single sentence, explain the concept of gravity:" → ...Gravity is the natural force that pulls all objects that have mass toward one another, which is why things fall to the ground instead of floating off into space and why planets orbit the Sun. [...]
"How to fix slicing in golf?" → ...what a slice actually is... a shot that curves sharply... opposite of a hook which curves left for righties. First I should start with explaining the root causes [...]
"Where is the capital of Iceland?" → ...The capital of Iceland is Reykjavík. It's also the largest city and the main cultural/economic hub... a concise answer is sufficient [...]
"What are the three laws of thermodynamics?" → ...There are four laws traditionally, but the "three laws" often refer to the first, second, and third, excluding the zeroth law... the first law ([...]
"Summarize the plot of Romeo and Juliet in two sentences:" → ...in Verona, two teens from feuding noble families, the Montagues and Capulets, fall deeply in love at a Capulet ball, and secretly marry with the help of Friar Laurence... Then the second sentence needs the tragic chain of [...]
"Write a Python function that checks if a number is prime." → ...a prime number is: a number greater than 1 that has no divisors other than 1 and itself... numbers less than 2 aren't prime... 2? That's the only even prime [...]
"Explain the difference between a compiler and an interpreter:" → ...Both translate high-level code into machine code... a compiler translates the entire source code all at once into machine code (or sometimes an intermediate bytecode, like Java [...]
"What causes the northern lights?" → ...the aurora borealis... solar wind, Earth's magnetic field, interaction with atmosphere, resulting in photons... [...]
"What are the health benefits of drinking green tea?" → ...Antioxidant content: catechins, EGCG, fight oxidative stress. - Cardiovascular benefits: improved cholesterol (lower LDL, raise HDL), blood pressure, reduced risk of heart [...]

Unit tests

pytest tests/unittest/auto_deploy/singlegpu/models/test_step3p7_modeling.py -v

test_mlp_equivalence PASSED
test_mlp_clamped_equivalence PASSED
test_attention_full_equivalence PASSED
test_attention_sliding_equivalence PASSED
test_moe_block_equivalence PASSED
test_decoder_layer_moe_equivalence PASSED
test_decoder_layer_dense_sliding_equivalence PASSED
test_full_model_equivalence PASSED
test_export PASSED
========================= 9 passed =========================

Sharding-IR equivalence test (4 parallelism configs)

pytest tests/unittest/auto_deploy/multigpu/transformations/library/test_step3p7_sharding_ir.py -v

test_step3p7_sharding_ir_equivalence[tp-only] PASSED
test_step3p7_sharding_ir_equivalence[ep-only] PASSED
test_step3p7_sharding_ir_equivalence[tep]     PASSED
test_step3p7_sharding_ir_equivalence[attn-dp] PASSED
========================= 4 passed in 105s =========================

This compares sharded vs unsharded prefill of the same tiny model. Step-3.7-Flash's correct-sharding rel_rmse is ~0.05 (head-gate scaling + MoE routed_scaling_factor=3.0 amplify the bf16 all_reduce rounding); the per-model tolerance is 0.08. The harness's sabotage control (removing the collectives) drives rel_rmse to ~0.92 (≈17×), confirming the collectives are present and the sharding is correct rather than the tolerance merely being loose.

All three — unit tests, the 4-config sharding-IR equivalence test, and the full 45-layer E2E — pass on this exact commit.

Split the stacked routed-expert block-FP8 dequant scales (moe.{gate,up,down}_proj.weight_scale_inv) per-expert in the MoE load hook, alongside the expert weights. This feeds AutoDeploy's quantize_finegrained_fp8_moe / fuse_finegrained_fp8_moe transforms the per-expert FP8 weight + weight_scale_inv they expect (the same path used for the per-expert DeepSeek-V3 checkpoint), enabling stepfun-ai/Step-3.7-Flash-FP8. The bf16 path is unaffected: bf16 checkpoints carry no weight_scale_inv, so the added split branch is a no-op for them. Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>

bmarimuthu-nv · 2026-05-30T02:41:11Z

[AGENT] Added FP8 support (stepfun-ai/Step-3.7-Flash-FP8) — new commit 1496fc8dba.

Change: the MoE load hook now splits the stacked routed-expert block-FP8 dequant scales (moe.{gate,up,down}_proj.weight_scale_inv) per-expert, alongside the expert weights. Only the routed experts are FP8 (F8_E4M3, block [128,128], with weight_scale_inv); the router/shared-expert/attention/dense layers stay bf16. The per-expert FP8 weight + weight_scale_inv is then consumed by AutoDeploy's default-on quantize_finegrained_fp8_moe / fuse_finegrained_fp8_moe transforms — the same path used for the (already per-expert) DeepSeek-V3 checkpoint. The bf16 path is unaffected (bf16 checkpoints carry no weight_scale_inv, so the split branch is a no-op).

No config change: step-3.7-flash.yaml is shared; the FP8 quant path is auto-detected from the checkpoint's quantization_config.

Reproduce — FP8 end-to-end (8 GPUs)

python examples/auto_deploy/build_and_run_ad.py \
  --model stepfun-ai/Step-3.7-Flash-FP8 \
  --args.yaml-extra examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml

(On Hopper, DeepGEMM's fused FP8-block-scale MoE kernel must JIT via NVCC — TRTLLM_DG_JIT_USE_NVCC=1 — otherwise NVRTC fails to compile it for this moe_intermediate_size=1280 geometry.)

Transforms on the full 45-layer run (all 8 ranks):

quantize_finegrained_fp8_moe [pattern_matcher]: matches=42
apply_sharding_hints [sharding]: moe grid [ep x tp] = [8 x 1] -> 672 nodes processed
fuse_finegrained_fp8_moe [post_load_fusion]: matches=42 -> trtllm_quant_finegrained_fp8_moe_fused

Coherent generation (verbatim samples from the full-layer FP8 run):

"Where is the capital of Iceland?" → "…Reykjavík is the capital, and it's also the largest city in Iceland…"
"What are the three laws of thermodynamics?" → correctly distinguishes the zeroth law from the first/second/third
"Write a Python function that checks if a number is prime." → "…a number greater than 1 that has no divisors other than 1 and itself… numbers less than 2 aren't prime… 0, negative numbers all return False."
"Explain the difference between a compiler and an interpreter:" → "…the big difference is how and when they do that translation."

All 10 prompts produce on-topic <think> chain-of-thought, matching the bf16 run's quality. FP8 fused MoE runs on all 8 GPUs.

bmarimuthu-nv · 2026-05-30T02:48:58Z

/bot run --disable-fail-fast

tensorrt-cicd · 2026-05-30T02:54:33Z

PR_Github #51144 [ run ] triggered by Bot. Commit: 1496fc8 Link to invocation

tensorrt-cicd · 2026-05-30T10:39:12Z

PR_Github #51144 [ run ] completed with state SUCCESS. Commit: 1496fc8
/LLM/main/L0_MergeRequest_PR pipeline #40580 completed with status: 'FAILURE'

CI Report

⚠️ Action Required:

Please check the failed tests and fix your PR
If you cannot view the failures, ask the CI triggerer to share details
Once fixed, request an NVIDIA team member to trigger CI again

CI Agent Failure Analysis

Link to invocation

bmarimuthu-nv requested review from a team as code owners May 29, 2026 19:36

bmarimuthu-nv requested review from Shixiaowei02, arysef and marinayanov May 29, 2026 19:36

github-actions Bot assigned bmarimuthu-nv May 29, 2026

coderabbitai Bot reviewed May 29, 2026

View reviewed changes

Comment thread docs/source/models/supported-models.md

Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.py

Comment thread tensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.py

bmarimuthu-nv force-pushed the bala/step3.7-flash branch from b96e0ad to 0d868fc Compare May 29, 2026 22:06

suyoggupta approved these changes May 29, 2026

View reviewed changes

bmarimuthu-nv force-pushed the bala/step3.7-flash branch from 0d868fc to 0c929ae Compare May 30, 2026 01:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[None][feat] Add AutoDeploy support for StepFun Step-3.7-Flash#14759

[None][feat] Add AutoDeploy support for StepFun Step-3.7-Flash#14759
bmarimuthu-nv wants to merge 2 commits into
NVIDIA:mainfrom
nv-auto-deploy:bala/step3.7-flash

bmarimuthu-nv commented May 29, 2026 •

edited by coderabbitai Bot

Loading

Uh oh!

coderabbitai Bot commented May 29, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmarimuthu-nv commented May 30, 2026

Uh oh!

bmarimuthu-nv commented May 30, 2026

Uh oh!

bmarimuthu-nv commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

bmarimuthu-nv commented May 29, 2026 • edited by coderabbitai Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Architecture highlights captured

Implementation notes

Known limitations

Files

Reproduce — end-to-end run

Unit tests

Summary by CodeRabbit

Release Notes

Uh oh!

coderabbitai Bot commented May 29, 2026

Walkthrough

Changes

Estimated code review effort

Suggested reviewers

❌ Failed checks (1 warning)

Uh oh!

coderabbitai Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bmarimuthu-nv commented May 30, 2026

What changed since the previous revision

Reproduce — end-to-end (sharding-IR, 8 GPUs, bf16)

Unit tests

Sharding-IR equivalence test (4 parallelism configs)

Uh oh!

bmarimuthu-nv commented May 30, 2026

Reproduce — FP8 end-to-end (8 GPUs)

Uh oh!

bmarimuthu-nv commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

tensorrt-cicd commented May 30, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

bmarimuthu-nv commented May 29, 2026 •

edited by coderabbitai Bot

Loading