[None][feat] Add AutoDeploy support for StepFun Step-3.7-Flash#14759
[None][feat] Add AutoDeploy support for StepFun Step-3.7-Flash#14759bmarimuthu-nv wants to merge 2 commits into
Conversation
📝 WalkthroughWalkthroughThis PR introduces comprehensive support for StepFun Step-3.7-Flash text model within TensorRT-LLM. The changes include a complete AutoDeploy PyTorch implementation with custom state-dict hooks for checkpoint adaptation, YAML configuration and model registry entries, extensive equivalence testing against a reference implementation, and a deployment cookbook with usage examples. ChangesStepFun Step-3.7-Flash Integration
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 4 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (4 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@docs/source/models/supported-models.md`:
- Line 53: The docs list the model class as Step3p7ForConditionalGeneration but
the actual registered class is Step3p7ForCausalLM; update the table entries
(both occurrences) to use the implemented architecture name Step3p7ForCausalLM
and ensure the markdown link/text matches the registered model identifier
`stepfun-ai/Step-3.7-Flash`; verify any references to the old name
(Step3p7ForConditionalGeneration) are replaced so the docs reflect the real API
surface.
In `@tensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.py`:
- Line 185: Replace the runtime assertions with explicit exceptions: instead of
using assert rope_type == "llama3", raise a ValueError (or TypeError if testing
type) with the same message; do the same for the other assert checks noted
(around the code using rope_type at the other two locations referenced). Locate
the assertion lines (the rope_type assertion in modeling_step3p7.py and the two
other assert usages around the referenced blocks) and change them to raise
ValueError("Step-3.7 only supports llama3 rope-scaling, got {rope_type!r}") (or
raise TypeError(...) if the check is for type), preserving the original message
text. Ensure tests/consumers handle these exceptions instead of relying on
AssertionError.
- Line 527: The class-level attribute _no_split_modules is defined as a mutable
list which can be accidentally mutated and triggers RUF012; change it to an
immutable tuple containing "Step3p7DecoderLayer" (i.e., replace the list literal
with a tuple literal) so the attribute is immutable and cannot be modified at
runtime; update any code that expects list-specific methods to handle a tuple or
convert to list locally if mutation is required, referencing the
_no_split_modules attribute and the Step3p7DecoderLayer identifier.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 58c67cdd-9066-4aa2-acfe-66ba61582ede
📒 Files selected for processing (7)
docs/source/models/supported-models.mdexamples/auto_deploy/cookbooks/step_3.7_flash_trtllm_cookbook.ipynbexamples/auto_deploy/model_registry/configs/step-3.7-flash.yamlexamples/auto_deploy/model_registry/models.yamltensorrt_llm/_torch/auto_deploy/models/custom/__init__.pytensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.pytests/unittest/auto_deploy/singlegpu/models/test_step3p7_modeling.py
b96e0ad to
0d868fc
Compare
Onboard the text decoder of stepfun-ai/Step-3.7-Flash (model_type step3p7 / step3p5) to the AutoDeploy backend as a prefill-only custom model with tensor-parallel sharding-IR hints. Architecture: 45 layers (3 dense + 42 MoE; 288 routed experts top-8 plus a dense shared expert), grouped-query attention with mixed full/sliding-window layer types and per-type head counts (64 full / 96 sliding, 8 KV heads), a head-wise attention gate, per-head QK RMSNorm, and per-layer-type partial RoPE (llama3 scaling on full-attention layers). Uses AutoDeploy canonical ops (torch_rmsnorm, torch_attention, torch_rope_with_explicit_cos_sin, torch_moe) with load hooks for the (1+weight) RMSNorm convention and stacked->per-expert MoE weights. Sharding: every shardable projection carries explicit sharding-IR hints (torch_linear_simple tp_mode/layer_type, auto_deploy.view tp_scaled_dim, all_reduce after rowwise projections and at the MoE routed+shared merge point); the head-wise gate is a per-head column shard, and the routed MoE uses torch_moe(layer_type="moe"). The registry config disables the legacy heuristic sharding and enables apply_sharding_hints. Adds hierarchical equivalence unit tests, a runnable model-specific sharding-IR equivalence test (parametrized over tp-only/ep-only/tep/attn-dp), a bundled config class for standalone/offline-test construction, a self-contained model-registry config + entries, a deployment cookbook, and supported-models matrix rows. Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
0d868fc to
0c929ae
Compare
|
[AGENT] Updated this PR: rebased onto What changed since the previous revision
Reproduce — end-to-end (sharding-IR, 8 GPUs, bf16)python examples/auto_deploy/build_and_run_ad.py \
--model stepfun-ai/Step-3.7-Flash \
--args.yaml-extra examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml
Full 45-layer run — coherent generation, complete prompt → output pairs (verbatim):
Unit testspytest tests/unittest/auto_deploy/singlegpu/models/test_step3p7_modeling.py -vSharding-IR equivalence test (4 parallelism configs)pytest tests/unittest/auto_deploy/multigpu/transformations/library/test_step3p7_sharding_ir.py -vThis compares sharded vs unsharded prefill of the same tiny model. Step-3.7-Flash's correct-sharding rel_rmse is ~0.05 (head-gate scaling + MoE All three — unit tests, the 4-config sharding-IR equivalence test, and the full 45-layer E2E — pass on this exact commit. |
Split the stacked routed-expert block-FP8 dequant scales
(moe.{gate,up,down}_proj.weight_scale_inv) per-expert in the MoE load hook,
alongside the expert weights. This feeds AutoDeploy's
quantize_finegrained_fp8_moe / fuse_finegrained_fp8_moe transforms the
per-expert FP8 weight + weight_scale_inv they expect (the same path used for
the per-expert DeepSeek-V3 checkpoint), enabling stepfun-ai/Step-3.7-Flash-FP8.
The bf16 path is unaffected: bf16 checkpoints carry no weight_scale_inv, so the
added split branch is a no-op for them.
Signed-off-by: Balamurugan Marimuthu <246387390+bmarimuthu-nv@users.noreply.github.com>
|
[AGENT] Added FP8 support ( Change: the MoE load hook now splits the stacked routed-expert block-FP8 dequant scales ( No config change: Reproduce — FP8 end-to-end (8 GPUs)python examples/auto_deploy/build_and_run_ad.py \
--model stepfun-ai/Step-3.7-Flash-FP8 \
--args.yaml-extra examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml(On Hopper, DeepGEMM's fused FP8-block-scale MoE kernel must JIT via NVCC — Transforms on the full 45-layer run (all 8 ranks): Coherent generation (verbatim samples from the full-layer FP8 run):
All 10 prompts produce on-topic |
|
/bot run --disable-fail-fast |
|
PR_Github #51144 [ run ] triggered by Bot. Commit: |
|
PR_Github #51144 [ run ] completed with state
|
Summary
Adds AutoDeploy support for the StepFun Step-3.7-Flash text decoder (
model_typestep3p7wrapping text configstep3p5). Step-3.7-Flash is a 198B-param sparse-MoE vision-language model; this PR onboards the text generation path as a prefill-only AutoDeploy custom model.Architecture highlights captured
g_proj, sigmoid per head)rope_theta(1+weight)RMSNorm convention; MTP layers (45-47) skippedImplementation notes
torch_rmsnorm,torch_attention,torch_rope_with_explicit_cos_sin,torch_moe.(1+weight)norm convention, split stackedmoe.{gate,up,down}_proj.weight[E,...]into per-expert Linears, and expand the head-gate weight[num_heads, hidden] -> [num_heads*head_dim, hidden].num_heads) is smaller thanhead_dim, which breaks AutoDeploy's head-aligned TP heuristic (num_heads // head_dim = 0).g_projis therefore sized[num_heads*head_dim, hidden](the per-head gate repeated across each head'shead_dimslots, applied in the flattened attention space) so it shards exactly likeq_proj. Numerically identical.Known limitations
Step-3.7-Flash-FP8ships F8_E4M3 expert weights but no weight-scale tensors, so the block-FP8 transform cannot load it; this PR validates on the bf16stepfun-ai/Step-3.7-Flash(identical architecture). NVFP4 variant untested.torch_moe; large numerical guards) — it is applied on the dense/shared-expert path.Files
tensorrt_llm/_torch/auto_deploy/models/custom/modeling_step3p7.py(new),__init__.py(register)tests/unittest/auto_deploy/singlegpu/models/test_step3p7_modeling.py(new)examples/auto_deploy/model_registry/configs/step-3.7-flash.yaml(new) +models.yamlentries (bf16/FP8/NVFP4)examples/auto_deploy/cookbooks/step_3.7_flash_trtllm_cookbook.ipynb(new)docs/source/models/supported-models.md(matrix rows)Reproduce — end-to-end run
Run on 8x H100 (world_size 8, bf16). Full 45-layer run completed all transforms in ~185s, 10/10 prompts coherent. Complete prompt -> generated-text pairs (verbatim):
P0 — "How big is the universe?"
P1 — "In simple words and a single sentence, explain the concept of gravity:"
P2 — "How to fix slicing in golf?"
P3 — "Where is the capital of Iceland?"
P4 — "What are the three laws of thermodynamics?"
P5 — "Summarize the plot of Romeo and Juliet in two sentences:"
P6 — "Write a Python function that checks if a number is prime."
P7 — "Explain the difference between a compiler and an interpreter:"
P8 — "What causes the northern lights?"
P9 — "What are the health benefits of drinking green tea?"
A reduced 5-layer run (
--args.model-kwargs.text_config.num_hidden_layers=5) was used first to validate build/export/sharding/KV-cache; it passes the sharding stage in ~96s (garbled output expected for the truncated model).Unit tests
Result on this commit (CPU, fp32):
🤖 Generated with Claude Code
Summary by CodeRabbit
Release Notes
New Features
Documentation
Tests