Skip to content

[Common] Enable NVFP4 2D block scaling in columnwise only#3027

Merged
Oleg-Goncharov merged 8 commits into
NVIDIA:mainfrom
negvet:nvfp4_2d_colwise_only
Jun 12, 2026
Merged

[Common] Enable NVFP4 2D block scaling in columnwise only#3027
Oleg-Goncharov merged 8 commits into
NVIDIA:mainfrom
negvet:nvfp4_2d_colwise_only

Conversation

@negvet

@negvet negvet commented May 21, 2026

Copy link
Copy Markdown
Collaborator

Description

Enabling 2D NVFP4 quantization in columnwise-only mode.
Needed by HybridQuantizer (PR #2817) for MXFP8 fwd + NVFP4 bwd on W.

Fixes # (issue)

Type of change

  • Documentation change (change only to the documentation, either a fix or a new content)
  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to not work as expected)
  • Infra/Build change
  • Code refactoring

Changes

Please list the changes introduced in this PR:

  • Change A
  • Change B

Checklist:

  • I have read and followed the contributing guidelines
  • The functionality is complete
  • I have commented my code, particularly in hard-to-understand areas
  • I have made corresponding changes to the documentation
  • My changes generate no new warnings
  • I have added tests that prove my fix is effective or that my feature works
  • New and existing unit tests pass locally with my changes

negvet and others added 2 commits May 21, 2026 17:35
Signed-off-by: Evgeny <etsykunov@nvidia.com>
@greptile-apps

greptile-apps Bot commented May 21, 2026

Copy link
Copy Markdown
Contributor

Greptile Summary

This PR enables NVFP4 2D block-scaling in columnwise-only mode, supporting the HybridQuantizer use-case (MXFP8 forward + NVFP4 backward on weights). Two code paths are extended: the optimized Blackwell TMA kernel (quantize_transpose_nvfp4_2D_kernel) gains a RETURN_ROWWISE template parameter that gates all rowwise output writes, and the fallback scalar kernel (block_scaled_1d_cast_transpose) gains a new "Step 2b" block that replicates the 2D warp/smem amax reduction without the rowwise quantize-and-store, correctly populating amax_smem for Step 3.

  • Routing change (quantize.cuh): use_optimized_kernel now also activates for BF16 aligned columnwise-only tensors when nvfp4_2d_quantization=true; other combinations (1D, non-BF16) fall back to quantize_transpose_vector_blockwise_fp4.
  • Optimized 2D kernel (quantize_transpose_nvfp4.cuh): tensor_map_output and scales_ptr left uninitialized/null for the columnwise-only case; all rowwise stores guarded by if constexpr (RETURN_ROWWISE), which is verified safe because block_amax_matrix is populated unconditionally and consumed by the existing COLWISE section.
  • Fallback kernel (quantize_transpose_vector_blockwise_fp4.cu): Step 2b mirrors the 2D amax reduction from Step 2 with identical __syncthreads() barriers, ensuring amax_smem is fully written before Step 3 reads it.

Confidence Score: 5/5

Safe to merge — all new code paths are guarded by existing NVTE_CHECK runtime assertions and the logic correctly mirrors established patterns in both kernel implementations.

The routing, kernel dispatch, and amax-reduction logic are all consistent with their existing counterparts. block_amax_matrix and amax_smem are populated/consumed in a well-synchronized manner in both the optimized and fallback paths. The one structural concern (the 1D kernel being compiled into dead RETURN_ROWWISE=false branches) is a compile-time code-size issue, not a runtime defect, and is fully protected by the NVTE_CHECK and the early BF16 return above the dispatch.

transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh — the new RETURN_ROWWISE switch doubles the 2D kernel compile matrix; worth checking build-time impact.

Important Files Changed

Filename Overview
transformer_engine/common/cast/dispatch/quantize.cuh Routing logic updated so that columnwise-only + 2D quantization is allowed to use the optimized kernel for BF16 aligned shapes, while all other columnwise-only 1D cases remain on the fallback path.
transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh Adds RETURN_ROWWISE template parameter to the 2D kernel and gates all rowwise output writes behind it; adds return_rowwise flag to dispatch; expands compile-time switch matrix from 8 to 16 2D-kernel instantiations, introducing dead-code paths that are guarded at runtime by NVTE_CHECK.
transformer_engine/common/transpose/quantize_transpose_vector_blockwise_fp4.cu Removes the early-return guard for 2D columnwise-only mode and adds Step 2b: an amax-only pass that replicates the 2D warp/smem reduction from Step 2 without the rowwise quantize/store, correctly populating amax_smem for Step 3.
tests/cpp/operator/test_cast_nvfp4_transpose.cu Adds CastNVFP4ColumnwiseOnly2DTestSuite that exercises the BF16 aligned path through the optimized 2D kernel; covers rectangular multiples-of-128 shapes.
tests/pytorch/nvfp4/test_nvfp4_quantize_exact.py Adds test_nvfp4_2d_columnwise_only_matches_both_directions covering both the optimized kernel path (BF16 aligned) and the fallback path (fp32 / non-32-aligned) with bitwise comparison of columnwise data, scales, and amax.

Flowchart

%%{init: {'theme': 'neutral'}}%%
flowchart TD
    A[nvte_quantize_v2] --> B{NVFP4 dtype?}
    B -->|Yes| C{row_scaled_nvfp4?}
    C -->|No| D{use_optimized_kernel?\nbf16 + rows%32=0 + cols%32=0\n+ has_data OR columnwise_only+2D}
    C -->|Yes| E[compute_rowwise_amax only]
    D -->|Yes| F{nvfp4_2d_quantization?}
    D -->|No| G[quantize_transpose_vector_blockwise_fp4\nfallback path]
    F -->|Yes 2D| H[quantize_transpose&lt;use_2d=true&gt;]
    F -->|No 1D| I[quantize_transpose&lt;use_2d=false&gt;]
    I --> J[quantize_transpose_tuned_1D\nearly return - always rowwise]
    H --> K{return_rowwise?}
    K -->|true| L[2D kernel\nRETURN_ROWWISE=true\nRETURN_TRANSPOSE=?]
    K -->|false NEW| M[2D kernel\nRETURN_ROWWISE=false\nRETURN_TRANSPOSE=true]
    G --> N{kReturnIdentity?}
    N -->|true| O[Step 2: rowwise quant+store\nStep 2 2D amax → amax_smem\nStep 3: columnwise output]
    N -->|false NEW| P[Step 2b: 2D amax-only\npopulates amax_smem\nStep 3: columnwise output]
Loading

Reviews (6): Last reviewed commit: "Enable rectangular shapes in tests" | Re-trigger Greptile

}
}

// Step 2.5: 2D-amax-only pass for columnwise-only mode.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Step label collision with existing substep

The new outer-level block is named "Step 2.5" at line 576, but that same label is already used at line 522 for the "Write scale_inv" substep inside Step 2's loop (if constexpr (kReturnIdentity)). A future reader scanning the file will find two distinct "Step 2.5" sections with different semantics. Consider renaming the new block to something like "Step 2b" or "Step 2.5 (outer)" to distinguish it from the // Step 2.5: Write scale_inv substep inside the inner loop.

@ptrendx

ptrendx commented May 21, 2026

Copy link
Copy Markdown
Member

This is just the fallback kernel being changed. Does the main kernel already support this?

Signed-off-by: Evgeny <etsykunov@nvidia.com>
@negvet negvet requested a review from Oleg-Goncharov as a code owner June 1, 2026 11:34
@negvet

negvet commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

This is just the fallback kernel being changed. Does the main kernel already support this?

Thanks for the catch. The main kernel does not support if as well. Enabled in f7953dd

Comment thread transformer_engine/common/cast/nvfp4/quantize_transpose_nvfp4.cuh
NVTE_CHECK(output->scale_inv.dptr != nullptr, "Scaling tensor must be allocated");
NVTE_CHECK(return_rowwise || return_transpose,
"At least one of rowwise/columnwise NVFP4 output must be allocated.");
NVTE_CHECK(return_rowwise || use_2d_quantization,

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a bit confusing to read, especially if the kernel is extended in the future to support additional quantization schemes. It would be better to restrict the supported combinations explicitly, e.g.
NVTE_CHECK((return_transpose && use_2d_quantization) || (return_rowwise && !use_2d_quantization),

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, fixed in 783b45b


// Step 2.5: 2D-amax-only pass for columnwise-only mode.
// When only the transposed output is requested but 2D block scaling is enabled, the columnwise
// reads in Step 3 (line ~660 below) still need amax_smem populated. Re-run the load + local-amax

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment refers to line ~660, which is now line 637. Let’s maybe remove the line reference entirely to avoid confusion.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 783b45b

@Oleg-Goncharov Oleg-Goncharov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should also add a corresponding C++ unit test to cover this, since this PR changes logic in the common part of the library

Signed-off-by: Evgeny <etsykunov@nvidia.com>
@negvet

negvet commented Jun 1, 2026

Copy link
Copy Markdown
Collaborator Author

We should also add a corresponding C++ unit test to cover this, since this PR changes logic in the common part of the library

Sure, added a test, please take a look if this is enough.

@negvet

negvet commented Jun 2, 2026

Copy link
Copy Markdown
Collaborator Author

/te-ci

@negvet negvet requested a review from Oleg-Goncharov June 2, 2026 11:54

// Columnwise-only 2D NVFP4 must produce the same columnwise data/scales as the columnwise half
// of (rowwise + columnwise) 2D. This exercises the RETURN_ROWWISE=false path of the optimized
// kernel quantize_transpose_nvfp4_2D_kernel (and its dispatch gate) added in this PR.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please rewrite this comment to be more succint and not reference "this PR", as that remark stops making sense once the PR is merged.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 6cb4f8a

Comment on lines +111 to +117
// Columnwise-only is supported on the optimized path only for 2D scaling; rowwise-only and
// both-directions keep their existing routing. Columnwise-only 1D and non-bf16 fall back to
// quantize_transpose_vector_blockwise_fp4.
bool use_optimized_kernel =
(dtype == DType::kBFloat16) && (rows % 32 == 0) && (cols % 32 == 0) &&
(output_tensor->has_data() ||
(output_tensor->has_columnwise_data() && quant_config_cpp.nvfp4_2d_quantization));

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a little sad, but I think this is fine as a stopgap before we have better more flexible kernels.

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this should be relaxed in the future

Signed-off-by: Evgeny <etsykunov@nvidia.com>
@negvet negvet requested a review from ptrendx June 8, 2026 12:21
@negvet

negvet commented Jun 9, 2026

Copy link
Copy Markdown
Collaborator Author

/te-ci

Signed-off-by: Evgeny <etsykunov@nvidia.com>
@negvet

negvet commented Jun 10, 2026

Copy link
Copy Markdown
Collaborator Author

/te-ci

@Oleg-Goncharov Oleg-Goncharov added enhancement New feature or request fp4 labels Jun 12, 2026

@Oleg-Goncharov Oleg-Goncharov left a comment

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM from a functional point of view. There may be room for performance optimization, but we can revisit that later if needed.

@Oleg-Goncharov Oleg-Goncharov merged commit 318dd94 into NVIDIA:main Jun 12, 2026
38 of 43 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request fp4

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants