[Common] Enable NVFP4 2D block scaling in columnwise only by negvet · Pull Request #3027 · NVIDIA/TransformerEngine

negvet · 2026-05-21T17:39:35Z

Description

Enabling 2D NVFP4 quantization in columnwise-only mode.
Needed by HybridQuantizer (PR #2817) for MXFP8 fwd + NVFP4 bwd on W.

Fixes # (issue)

Type of change

Documentation change (change only to the documentation, either a fix or a new content)
Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to not work as expected)
Infra/Build change
Code refactoring

Changes

Please list the changes introduced in this PR:

Change A
Change B

Checklist:

I have read and followed the contributing guidelines
The functionality is complete
I have commented my code, particularly in hard-to-understand areas
I have made corresponding changes to the documentation
My changes generate no new warnings
I have added tests that prove my fix is effective or that my feature works
New and existing unit tests pass locally with my changes

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

greptile-apps · 2026-05-21T17:46:37Z

Greptile Summary

This PR enables NVFP4 2D block-scaling in columnwise-only mode, supporting the HybridQuantizer use-case (MXFP8 forward + NVFP4 backward on weights). Two code paths are extended: the optimized Blackwell TMA kernel (quantize_transpose_nvfp4_2D_kernel) gains a RETURN_ROWWISE template parameter that gates all rowwise output writes, and the fallback scalar kernel (block_scaled_1d_cast_transpose) gains a new "Step 2b" block that replicates the 2D warp/smem amax reduction without the rowwise quantize-and-store, correctly populating amax_smem for Step 3.

Routing change (quantize.cuh): use_optimized_kernel now also activates for BF16 aligned columnwise-only tensors when nvfp4_2d_quantization=true; other combinations (1D, non-BF16) fall back to quantize_transpose_vector_blockwise_fp4.
Optimized 2D kernel (quantize_transpose_nvfp4.cuh): tensor_map_output and scales_ptr left uninitialized/null for the columnwise-only case; all rowwise stores guarded by if constexpr (RETURN_ROWWISE), which is verified safe because block_amax_matrix is populated unconditionally and consumed by the existing COLWISE section.
Fallback kernel (quantize_transpose_vector_blockwise_fp4.cu): Step 2b mirrors the 2D amax reduction from Step 2 with identical __syncthreads() barriers, ensuring amax_smem is fully written before Step 3 reads it.

Confidence Score: 5/5

Safe to merge — all new code paths are guarded by existing NVTE_CHECK runtime assertions and the logic correctly mirrors established patterns in both kernel implementations.

The routing, kernel dispatch, and amax-reduction logic are all consistent with their existing counterparts. block_amax_matrix and amax_smem are populated/consumed in a well-synchronized manner in both the optimized and fallback paths. The one structural concern (the 1D kernel being compiled into dead RETURN_ROWWISE=false branches) is a compile-time code-size issue, not a runtime defect, and is fully protected by the NVTE_CHECK and the early BF16 return above the dispatch.

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh — the new RETURN_ROWWISE switch doubles the 2D kernel compile matrix; worth checking build-time impact.

Important Files Changed

Filename	Overview
transformer_engine/common/cast/dispatch/quantize.cuh	Routing logic updated so that columnwise-only + 2D quantization is allowed to use the optimized kernel for BF16 aligned shapes, while all other columnwise-only 1D cases remain on the fallback path.
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh	Adds RETURN_ROWWISE template parameter to the 2D kernel and gates all rowwise output writes behind it; adds return_rowwise flag to dispatch; expands compile-time switch matrix from 8 to 16 2D-kernel instantiations, introducing dead-code paths that are guarded at runtime by NVTE_CHECK.
transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu	Removes the early-return guard for 2D columnwise-only mode and adds Step 2b: an amax-only pass that replicates the 2D warp/smem reduction from Step 2 without the rowwise quantize/store, correctly populating amax_smem for Step 3.
tests/cpp/operator/test_cast_nvfp4_transpose.cu	Adds CastNVFP4ColumnwiseOnly2DTestSuite that exercises the BF16 aligned path through the optimized 2D kernel; covers rectangular multiples-of-128 shapes.
tests/pytorch/nvfp4/test_nvfp4_quantize_exact.py	Adds test_nvfp4_2d_columnwise_only_matches_both_directions covering both the optimized kernel path (BF16 aligned) and the fallback path (fp32 / non-32-aligned) with bitwise comparison of columnwise data, scales, and amax.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nvte_quantize_v2] --> B{NVFP4 dtype?}
    B -->|Yes| C{row_scaled_nvfp4?}
    C -->|No| D{use_optimized_kernel?\nbf16 + rows%32=0 + cols%32=0\n+ has_data OR columnwise_only+2D}
    C -->|Yes| E[compute_rowwise_amax only]
    D -->|Yes| F{nvfp4_2d_quantization?}
    D -->|No| G[quantize_transpose_vector_blockwise_fp4\nfallback path]
    F -->|Yes 2D| H[quantize_transpose&lt;use_2d=true&gt;]
    F -->|No 1D| I[quantize_transpose&lt;use_2d=false&gt;]
    I --> J[quantize_transpose_tuned_1D\nearly return - always rowwise]
    H --> K{return_rowwise?}
    K -->|true| L[2D kernel\nRETURN_ROWWISE=true\nRETURN_TRANSPOSE=?]
    K -->|false NEW| M[2D kernel\nRETURN_ROWWISE=false\nRETURN_TRANSPOSE=true]
    G --> N{kReturnIdentity?}
    N -->|true| O[Step 2: rowwise quant+store\nStep 2 2D amax → amax_smem\nStep 3: columnwise output]
    N -->|false NEW| P[Step 2b: 2D amax-only\npopulates amax_smem\nStep 3: columnwise output]

_{Reviews (6): Last reviewed commit: "Enable rectangular shapes in tests" | Re-trigger Greptile}

greptile-apps · 2026-05-21T17:46:44Z

    }
  }

+  // Step 2.5: 2D-amax-only pass for columnwise-only mode.


Step label collision with existing substep

The new outer-level block is named "Step 2.5" at line 576, but that same label is already used at line 522 for the "Write scale_inv" substep inside Step 2's loop (if constexpr (kReturnIdentity)). A future reader scanning the file will find two distinct "Step 2.5" sections with different semantics. Consider renaming the new block to something like "Step 2b" or "Step 2.5 (outer)" to distinguish it from the // Step 2.5: Write scale_inv substep inside the inner loop.

ptrendx · 2026-05-21T22:24:13Z

This is just the fallback kernel being changed. Does the main kernel already support this?

Signed-off-by: Evgeny <etsykunov@nvidia.com>

for more information, see https://pre-commit.ci

negvet · 2026-06-01T11:36:15Z

This is just the fallback kernel being changed. Does the main kernel already support this?

Thanks for the catch. The main kernel does not support if as well. Enabled in f7953dd

Oleg-Goncharov · 2026-06-01T12:26:05Z

-  NVTE_CHECK(output->scale_inv.dptr != nullptr, "Scaling tensor must be allocated");
+  NVTE_CHECK(return_rowwise || return_transpose,
+             "At least one of rowwise/columnwise NVFP4 output must be allocated.");
+  NVTE_CHECK(return_rowwise || use_2d_quantization,


This is a bit confusing to read, especially if the kernel is extended in the future to support additional quantization schemes. It would be better to restrict the supported combinations explicitly, e.g.
NVTE_CHECK((return_transpose && use_2d_quantization) || (return_rowwise && !use_2d_quantization),

Right, fixed in 783b45b

Oleg-Goncharov · 2026-06-01T12:29:36Z


+  // Step 2.5: 2D-amax-only pass for columnwise-only mode.
+  // When only the transposed output is requested but 2D block scaling is enabled, the columnwise
+  // reads in Step 3 (line ~660 below) still need amax_smem populated. Re-run the load + local-amax


The comment refers to line ~660, which is now line 637. Let’s maybe remove the line reference entirely to avoid confusion.

Fixed in 783b45b

Oleg-Goncharov

We should also add a corresponding C++ unit test to cover this, since this PR changes logic in the common part of the library

Signed-off-by: Evgeny <etsykunov@nvidia.com>

negvet · 2026-06-01T17:36:51Z

We should also add a corresponding C++ unit test to cover this, since this PR changes logic in the common part of the library

Sure, added a test, please take a look if this is enough.

negvet · 2026-06-02T11:43:20Z

/te-ci

ptrendx · 2026-06-05T22:43:46Z


+// Columnwise-only 2D NVFP4 must produce the same columnwise data/scales as the columnwise half
+// of (rowwise + columnwise) 2D. This exercises the RETURN_ROWWISE=false path of the optimized
+// kernel quantize_transpose_nvfp4_2D_kernel (and its dispatch gate) added in this PR.


Please rewrite this comment to be more succint and not reference "this PR", as that remark stops making sense once the PR is merged.

Fixed in 6cb4f8a

ptrendx · 2026-06-05T23:32:30Z

+      // Columnwise-only is supported on the optimized path only for 2D scaling; rowwise-only and
+      // both-directions keep their existing routing. Columnwise-only 1D and non-bf16 fall back to
+      // quantize_transpose_vector_blockwise_fp4.
+      bool use_optimized_kernel =
+          (dtype == DType::kBFloat16) && (rows % 32 == 0) && (cols % 32 == 0) &&
+          (output_tensor->has_data() ||
+           (output_tensor->has_columnwise_data() && quant_config_cpp.nvfp4_2d_quantization));


That is a little sad, but I think this is fine as a stopgap before we have better more flexible kernels.

Yes, this should be relaxed in the future

Signed-off-by: Evgeny <etsykunov@nvidia.com>

negvet · 2026-06-09T07:38:27Z

/te-ci

Signed-off-by: Evgeny <etsykunov@nvidia.com>

negvet · 2026-06-10T18:08:39Z

/te-ci

Oleg-Goncharov

LGTM from a functional point of view. There may be room for performance optimization, but we can revisit that later if needed.

negvet and others added 2 commits May 21, 2026 17:35

Enable colwise only 2d nvfp4

56780d1

Signed-off-by: Evgeny <etsykunov@nvidia.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

61a2387

for more information, see https://pre-commit.ci

greptile-apps Bot reviewed May 21, 2026

View reviewed changes

Enable colwise only for the main 2d kernel

f7953dd

Signed-off-by: Evgeny <etsykunov@nvidia.com>

negvet requested a review from Oleg-Goncharov as a code owner June 1, 2026 11:34

[pre-commit.ci] auto fixes from pre-commit.com hooks

c069cea

for more information, see https://pre-commit.ci

Oleg-Goncharov reviewed Jun 1, 2026

View reviewed changes

Comment thread transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh

Oleg-Goncharov reviewed Jun 1, 2026

View reviewed changes

Resolve comments

783b45b

Signed-off-by: Evgeny <etsykunov@nvidia.com>

negvet requested a review from Oleg-Goncharov June 2, 2026 11:54

negvet mentioned this pull request Jun 3, 2026

[Pytorch][Common] Hybrid quantization #2817

Open

13 tasks

ptrendx reviewed Jun 5, 2026

View reviewed changes

Fix test comment

6cb4f8a

Signed-off-by: Evgeny <etsykunov@nvidia.com>

negvet requested a review from ptrendx June 8, 2026 12:21

Merge branch 'main' into nvfp4_2d_colwise_only

e410cf6

Enable rectangular shapes in tests

0906385

Signed-off-by: Evgeny <etsykunov@nvidia.com>

Oleg-Goncharov added enhancement New feature or request fp4 labels Jun 12, 2026

Oleg-Goncharov approved these changes Jun 12, 2026

View reviewed changes

Oleg-Goncharov merged commit 318dd94 into NVIDIA:main Jun 12, 2026
38 of 43 checks passed

Conversation

negvet commented May 21, 2026

Description

Type of change

Changes

Checklist:

Uh oh!

greptile-apps Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Greptile Summary

Confidence Score: 5/5

Important Files Changed

Flowchart

Uh oh!

greptile-apps Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

ptrendx commented May 21, 2026

Uh oh!

negvet commented Jun 1, 2026

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Oleg-Goncharov left a comment

Choose a reason for hiding this comment

Uh oh!

negvet commented Jun 1, 2026

Uh oh!

negvet commented Jun 2, 2026

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

negvet commented Jun 9, 2026

Uh oh!

negvet commented Jun 10, 2026

Uh oh!

Oleg-Goncharov left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

greptile-apps Bot commented May 21, 2026 •

edited

Loading