[Common] Add dense router output for fused router#3129
Conversation
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Greptile SummaryThis PR adds an optional dense
Confidence Score: 5/5Safe to merge; the new dense-index path is well-isolated behind explicit guards and does not alter behavior for existing sparse-path callers. All new kernel paths correctly mirror existing routing-map counterparts across all three score functions and both pre/post-softmax modes. Guards prevent invalid combinations, shape preservation for multi-dimensional logits is tested, and autograd plumbing is correct. No files require special attention. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A["fused_topk_with_score_function(logits, topk, ...)"]
A --> B{topk_indices provided?}
B -- No --> C["existing sparse path: fused_topk_with_score_function_forward_v2"]
C --> D["Allocate routing_map (BYTEMAP or BITMAP)"]
D --> E["Returns probs + routing_map"]
B -- Yes --> F["check_dense_topk_indices (shape/dtype/device)"]
F --> G["fused_topk_with_score_function_forward_with_indices"]
G --> H{topk >= radix threshold?}
H -- No --> I["fused_topk_forward_simple_kernel (Naive, IndexType, routing_map=nullptr)"]
H -- Yes --> J["fused_topk_with_score_function_forward_kernel (Radix, ScoreFunc, IndexType, routing_map=nullptr)"]
I --> K["Write topk_indices_output; no routing_map allocation"]
J --> K
K --> L["Returns probs + topk_indices (aliased)"]
E --> M["Backward via routing_map path"]
L --> N["Backward via fused_topk_backward_selected_indices_kernel"]
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
flowchart TD
A["fused_topk_with_score_function(logits, topk, ...)"]
A --> B{topk_indices provided?}
B -- No --> C["existing sparse path: fused_topk_with_score_function_forward_v2"]
C --> D["Allocate routing_map (BYTEMAP or BITMAP)"]
D --> E["Returns probs + routing_map"]
B -- Yes --> F["check_dense_topk_indices (shape/dtype/device)"]
F --> G["fused_topk_with_score_function_forward_with_indices"]
G --> H{topk >= radix threshold?}
H -- No --> I["fused_topk_forward_simple_kernel (Naive, IndexType, routing_map=nullptr)"]
H -- Yes --> J["fused_topk_with_score_function_forward_kernel (Radix, ScoreFunc, IndexType, routing_map=nullptr)"]
I --> K["Write topk_indices_output; no routing_map allocation"]
J --> K
K --> L["Returns probs + topk_indices (aliased)"]
E --> M["Backward via routing_map path"]
L --> N["Backward via fused_topk_backward_selected_indices_kernel"]
Reviews (3): Last reviewed commit: "[PyTorch] Preserve router leading dimens..." | Re-trigger Greptile |
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
Signed-off-by: Harry Zhou <hhanyu@nvidia.com>
The fused router forward kernels currently produce boolean or bitmap sparse routing map format, which may need an additional conversion to
topk_indicesformat (in the shape of[*leading_dims, top_k]) to be passed to the dispatcher. For example, NCCL EP acceptstopk_indicesas routing map. To avoid needing an extra kernel for routing map conversion, the fused router could directly write into that format.For NCCL EP, the dense
topk_indicesrow is consumed as an order-insensitive selected-expert set. The dense output therefore does not promise score-sorted or expert-sorted order; it preserves the selected experts produced by the chosen top-k kernel path.Summary
topk_indicesoutput path to fused router top-k.[*leading_dims, topk]index buffer.int16,int32, andint64dense index buffers.int16support for TE CUDA graph weak-ref tensors.Testing
NVTE_BUILD_THREADS_PER_JOB=4 NVTE_CUDA_ARCHS="90;100;103a;120" NVTE_USE_CCACHE=1 pip install --no-build-isolation -e .[test] --verbosepython -m pytest -q tests/pytorch/test_fused_router.py3203 passed, 444 skipped, 3 warnings in 44.17s