[Common] Enable NVFP4 2D block scaling in columnwise only#3027
Conversation
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Greptile SummaryThis PR enables NVFP4 2D block-scaling in columnwise-only mode, supporting the HybridQuantizer use-case (MXFP8 forward + NVFP4 backward on weights). Two code paths are extended: the optimized Blackwell TMA kernel (
Confidence Score: 5/5Safe to merge — all new code paths are guarded by existing NVTE_CHECK runtime assertions and the logic correctly mirrors established patterns in both kernel implementations. The routing, kernel dispatch, and amax-reduction logic are all consistent with their existing counterparts. block_amax_matrix and amax_smem are populated/consumed in a well-synchronized manner in both the optimized and fallback paths. The one structural concern (the 1D kernel being compiled into dead RETURN_ROWWISE=false branches) is a compile-time code-size issue, not a runtime defect, and is fully protected by the NVTE_CHECK and the early BF16 return above the dispatch. transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh — the new RETURN_ROWWISE switch doubles the 2D kernel compile matrix; worth checking build-time impact. Important Files Changed
Flowchart%%{init: {'theme': 'neutral'}}%%
flowchart TD
A[nvte_quantize_v2] --> B{NVFP4 dtype?}
B -->|Yes| C{row_scaled_nvfp4?}
C -->|No| D{use_optimized_kernel?\nbf16 + rows%32=0 + cols%32=0\n+ has_data OR columnwise_only+2D}
C -->|Yes| E[compute_rowwise_amax only]
D -->|Yes| F{nvfp4_2d_quantization?}
D -->|No| G[quantize_transpose_vector_blockwise_fp4\nfallback path]
F -->|Yes 2D| H[quantize_transpose<use_2d=true>]
F -->|No 1D| I[quantize_transpose<use_2d=false>]
I --> J[quantize_transpose_tuned_1D\nearly return - always rowwise]
H --> K{return_rowwise?}
K -->|true| L[2D kernel\nRETURN_ROWWISE=true\nRETURN_TRANSPOSE=?]
K -->|false NEW| M[2D kernel\nRETURN_ROWWISE=false\nRETURN_TRANSPOSE=true]
G --> N{kReturnIdentity?}
N -->|true| O[Step 2: rowwise quant+store\nStep 2 2D amax → amax_smem\nStep 3: columnwise output]
N -->|false NEW| P[Step 2b: 2D amax-only\npopulates amax_smem\nStep 3: columnwise output]
Reviews (6): Last reviewed commit: "Enable rectangular shapes in tests" | Re-trigger Greptile |
| } | ||
| } | ||
|
|
||
| // Step 2.5: 2D-amax-only pass for columnwise-only mode. |
There was a problem hiding this comment.
Step label collision with existing substep
The new outer-level block is named "Step 2.5" at line 576, but that same label is already used at line 522 for the "Write scale_inv" substep inside Step 2's loop (if constexpr (kReturnIdentity)). A future reader scanning the file will find two distinct "Step 2.5" sections with different semantics. Consider renaming the new block to something like "Step 2b" or "Step 2.5 (outer)" to distinguish it from the // Step 2.5: Write scale_inv substep inside the inner loop.
|
This is just the fallback kernel being changed. Does the main kernel already support this? |
Signed-off-by: Evgeny <etsykunov@nvidia.com>
for more information, see https://pre-commit.ci
Thanks for the catch. The main kernel does not support if as well. Enabled in f7953dd |
| NVTE_CHECK(output->scale_inv.dptr != nullptr, "Scaling tensor must be allocated"); | ||
| NVTE_CHECK(return_rowwise || return_transpose, | ||
| "At least one of rowwise/columnwise NVFP4 output must be allocated."); | ||
| NVTE_CHECK(return_rowwise || use_2d_quantization, |
There was a problem hiding this comment.
This is a bit confusing to read, especially if the kernel is extended in the future to support additional quantization schemes. It would be better to restrict the supported combinations explicitly, e.g.
NVTE_CHECK((return_transpose && use_2d_quantization) || (return_rowwise && !use_2d_quantization),
|
|
||
| // Step 2.5: 2D-amax-only pass for columnwise-only mode. | ||
| // When only the transposed output is requested but 2D block scaling is enabled, the columnwise | ||
| // reads in Step 3 (line ~660 below) still need amax_smem populated. Re-run the load + local-amax |
There was a problem hiding this comment.
The comment refers to line ~660, which is now line 637. Let’s maybe remove the line reference entirely to avoid confusion.
Oleg-Goncharov
left a comment
There was a problem hiding this comment.
We should also add a corresponding C++ unit test to cover this, since this PR changes logic in the common part of the library
Signed-off-by: Evgeny <etsykunov@nvidia.com>
Sure, added a test, please take a look if this is enough. |
|
/te-ci |
|
|
||
| // Columnwise-only 2D NVFP4 must produce the same columnwise data/scales as the columnwise half | ||
| // of (rowwise + columnwise) 2D. This exercises the RETURN_ROWWISE=false path of the optimized | ||
| // kernel quantize_transpose_nvfp4_2D_kernel (and its dispatch gate) added in this PR. |
There was a problem hiding this comment.
Please rewrite this comment to be more succint and not reference "this PR", as that remark stops making sense once the PR is merged.
| // Columnwise-only is supported on the optimized path only for 2D scaling; rowwise-only and | ||
| // both-directions keep their existing routing. Columnwise-only 1D and non-bf16 fall back to | ||
| // quantize_transpose_vector_blockwise_fp4. | ||
| bool use_optimized_kernel = | ||
| (dtype == DType::kBFloat16) && (rows % 32 == 0) && (cols % 32 == 0) && | ||
| (output_tensor->has_data() || | ||
| (output_tensor->has_columnwise_data() && quant_config_cpp.nvfp4_2d_quantization)); |
There was a problem hiding this comment.
That is a little sad, but I think this is fine as a stopgap before we have better more flexible kernels.
There was a problem hiding this comment.
Yes, this should be relaxed in the future
Signed-off-by: Evgeny <etsykunov@nvidia.com>
|
/te-ci |
Signed-off-by: Evgeny <etsykunov@nvidia.com>
|
/te-ci |
Oleg-Goncharov
left a comment
There was a problem hiding this comment.
LGTM from a functional point of view. There may be room for performance optimization, but we can revisit that later if needed.
Description
Enabling 2D NVFP4 quantization in columnwise-only mode.
Needed by HybridQuantizer (PR #2817) for MXFP8 fwd + NVFP4 bwd on W.
Fixes # (issue)
Type of change
Changes
Please list the changes introduced in this PR:
Checklist: