Scaled Bias Add support after CUBLAS GGEMM#2885
Scaled Bias Add support after CUBLAS GGEMM#2885vthumbe1503 wants to merge 6 commits intoNVIDIA:mainfrom
Conversation
…imized and uses scales now Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
…ed_linear_integration
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
Greptile SummaryThis PR adds optional per-row scale support to Confidence Score: 5/5Safe to merge; all remaining findings are P2 style/improvement suggestions with no runtime risk for current callers. The core scaled-bias logic is correctly implemented: fmaf argument order matches documented semantics, shared-memory cumsum is correctly initialized and synchronized, and the empty-tensor sentinel correctly disables scaling. The two P2 findings (dead pre-loop bias load; missing tensor_offsets guard) do not affect current callers since bias GroupedTensors are always packed in the Python bindings. transformer_engine/common/gemm/cublaslt_grouped_gemm.cu — dead pre-loop load and missing tensor_offsets guard in nvte_grouped_bias_add. Important Files Changed
Sequence DiagramsequenceDiagram
participant Py as Python (gemm.py)
participant Ext as C++ Extension (gemm.cpp)
participant NVTE as nvte_grouped_gemm
participant BiasKernel as nvte_grouped_bias_add
Py->>Ext: general_grouped_gemm_for_grouped_tensor(A, B, out, bias, bias_scale)
Ext->>Ext: prepare_grouped_gemm_config(alpha, beta, ...)
Ext->>NVTE: nvte_grouped_gemm(A, B, C=D, D, alpha, beta, ...)
NVTE-->>Ext: D = alpha * A @ B + beta * C
alt bias is not None
Ext->>BiasKernel: nvte_grouped_bias_add(D, bias, scale)
Note over BiasKernel: Build shared cumsum for row-to-tensor map
BiasKernel->>BiasKernel: grouped_bias_add_kernel UseScale=true/false
Note over BiasKernel: D[row,col] += bias[col] * scale[row]
BiasKernel-->>Ext: D updated in-place
end
Ext-->>Py: D (updated)
Reviews (3): Last reviewed commit: "[pre-commit.ci] auto fixes from pre-comm..." | Re-trigger Greptile |
| const size_t tensor_idx = blockIdx.y; | ||
| if (tensor_idx >= num_tensors) return; | ||
|
|
||
| const int64_t n = d_meta.last_dims ? d_meta.last_dims[0] : d_meta.uniform_last; |
There was a problem hiding this comment.
Hardcoded index
[0] instead of [tensor_idx]
d_meta.last_dims[0] works only because the pre-launch NVTE_CHECK(outputD->all_same_last_dim() ...) enforces a uniform last dimension. Using the hardcoded index removes the per-tensor correctness at a glance — a future reader (or a refactor that relaxes the uniform check) would not immediately see why [0] is used instead of [tensor_idx]. A comment linking this to the uniformity invariant would make this self-documenting.
| const int64_t n = d_meta.last_dims ? d_meta.last_dims[0] : d_meta.uniform_last; | |
| const int64_t n = d_meta.last_dims ? d_meta.last_dims[0] // uniform across tensors (checked) | |
| : d_meta.uniform_last; |
| int64_t scale_row_offset = 0; | ||
| if constexpr (UseScale) { | ||
| if (d_meta.first_dims) { | ||
| for (size_t i = 0; i < tensor_idx; i++) { | ||
| scale_row_offset += d_meta.first_dims[i]; | ||
| } | ||
| } else { | ||
| scale_row_offset = static_cast<int64_t>(tensor_idx) * d_meta.uniform_first; | ||
| } | ||
| } |
There was a problem hiding this comment.
Redundant per-thread
scale_row_offset loop
Every thread in the block (all 256 of them) independently computes scale_row_offset by iterating up to tensor_idx times over d_meta.first_dims. Since tensor_idx == blockIdx.y, all threads in a block produce the same value. For large num_tensors, moving this into shared memory (computed once by thread 0 and shared) would avoid the redundant iterations. The broadcast access pattern through L1 is benign for small num_tensors, but is worth noting for scalability.
| std::optional<SwizzledGroupedScales> maybe_swizzle_grouped_tensor(GroupedTensorWrapper &input, | ||
| bool rowwise_usage, | ||
| bool columnwise_usage) { | ||
| if (input.scaling_mode() != NVTE_MXFP8_1D_SCALING) { | ||
| if (input.scaling_mode() != NVTE_MXFP8_1D_SCALING && | ||
| input.scaling_mode() != NVTE_NVFP4_1D_SCALING) { | ||
| return std::nullopt; | ||
| } |
There was a problem hiding this comment.
Unrelated FP4 swizzle change — should be documented
This guard extension (adding NVTE_NVFP4_1D_SCALING) is a separate fix that enables grouped-tensor scale swizzling for FP4 inputs; it is unrelated to the Scaled Bias Add feature described in the PR title. nvte_swizzle_grouped_scaling_factors does handle FP4 in swizzle.cu, so the change is mechanically correct, but it would be helpful to document the motivation in the PR description or add a comment here explaining why FP4 also needs this path.
Note: If this suggestion doesn't match your team's coding style, reply to this and let me know. I'll remember it for next time!
for more information, see https://pre-commit.ci
Signed-off-by: Varun Thumbe <vthumbe@nvidia.com>
for more information, see https://pre-commit.ci
Description
Please include a brief summary of the changes, relevant motivation and context.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: