Skip to content

fix(quant): sync NVFP4StaticQuantizer global_amax across TP and EP#1553

Draft
cjluo-nv wants to merge 1 commit into
mainfrom
fix/nvfp4-static-global-amax-tp-ep-sync
Draft

fix(quant): sync NVFP4StaticQuantizer global_amax across TP and EP#1553
cjluo-nv wants to merge 1 commit into
mainfrom
fix/nvfp4-static-global-amax-tp-ep-sync

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: Bug fix

Background. During max_calibrate, promote_nvfp4_static_quantizers(model) is called before the distributed-sync block. Promotion registers _global_amax on each NVFP4StaticQuantizer from reduce_amax(local _amax, axis=None) — i.e. the max over this rank's per-block _amax. With TP (weight sharded across Cout or Cin) or MoE EP (different experts per rank), each rank's local _amax covers a different slice of the global weight, so _global_amax diverges across ranks.

The existing sync block only all-reduces _amax:

  • sync_quantizer_amax_across_dp_ep syncs _amax across DP/EP.
  • sync_quantizer_amax_across_tp syncs _amax across TP.

Nothing in the pipeline touches _global_amax after promotion, so the upper level of the two-level NVFP4 scale ends up TP/EP-layout-dependent. This violates the stated invariant on model_calib.py:303 ("TP=8 ↔ TP=4 ↔ TP=8 should give the same quantization parameters") and produces inconsistent block-scale FP8 grids across MoE experts when EP > 1.

Fix. Add NVFP4StaticQuantizer.sync_global_amax_across_distributed_group mirroring TensorQuantizer.sync_amax_across_distributed_group, then call it from both sync_quantizer_amax_across_dp_ep (EP group) and sync_quantizer_amax_across_tp (TP group), right after the existing _amax sync. DP is omitted on purpose: weights are replicated across DP, so each rank already has the same local _amax and therefore the same locally-computed _global_amax.

Usage

No API change. The existing calibration flow now produces consistent _global_amax across TP/EP automatically:

import modelopt.torch.quantization as mtq
mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG, forward_loop)
# Under TP and/or EP, every rank now holds the same _global_amax on
# each NVFP4StaticQuantizer after max_calibrate / mse_calibrate.

Testing

  • pre-commit run --files modelopt/torch/quantization/model_calib.py modelopt/torch/quantization/nn/modules/tensor_quantizer.py — passes (ruff / mypy / format / bandit clean).
  • pytest tests/unit/torch/quantization/test_calib.py -q — 12 passed, 1 skipped (CUDA-only).
  • Multi-rank verification: deferred (draft). Planning to add a dist-launched unit test that runs max_calibrate under fake TP and EP groups and asserts _global_amax is bit-identical across ranks for NVFP4-static weight quantizers.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ — purely additive; only behavior change is that _global_amax is now consistent across TP/EP ranks, which was the documented intent.
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
  • Did you write any new necessary tests?: ❌ — draft; will add multi-rank test before flipping to ready-for-review.
  • Did you update Changelog?: ❌ — will add a bug-fix entry before flipping to ready-for-review.
  • Did you get Claude approval on this PR?: N/A — draft.

Additional Information

Related: #1536 (in-flight refactor of static-NVFP4 finalization). That PR consolidates promote_nvfp4_static_quantizers + _sync_grouped_weight_global_amax into max_calibrate but does not change the promote-vs-TP/EP-sync ordering and does not add a cross-TP/EP _global_amax reduction — _sync_grouped_weight_global_amax only unifies grouped Q/K/V/gate/up siblings within a rank. This PR layers cleanly on top of #1536 or stands alone on main.

🤖 Generated with Claude Code

`promote_nvfp4_static_quantizers` registers `_global_amax` from
`reduce_amax(local _amax)` before the distributed-sync block in
`max_calibrate`. With TP (different weight shards per rank) or MoE EP
(different experts per rank), each rank's local `_amax` covers a
different subset of the global weight, so the resulting `_global_amax`
diverges across ranks. The existing TP/EP sync only touches `_amax`,
leaving `_global_amax` permanently rank-local — making the upper level
of the two-level NVFP4 scale TP/EP-layout-dependent.

Add `NVFP4StaticQuantizer.sync_global_amax_across_distributed_group`
(mirroring the existing `sync_amax_across_distributed_group` on
`TensorQuantizer`) and call it from both `sync_quantizer_amax_across_dp_ep`
(EP group) and `sync_quantizer_amax_across_tp` (TP group) alongside the
existing `_amax` sync. DP doesn't need a separate call because weights
are replicated across DP, so `_global_amax` is naturally identical.

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 28, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 28, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 16f206ab-1dc4-43eb-9e52-ec79534a5d67

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fix/nvfp4-static-global-amax-tp-ep-sync

Comment @coderabbitai help to get the list of available commands and usage tips.

@github-actions
Copy link
Copy Markdown
Contributor

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1553/

Built to branch gh-pages at 2026-05-28 06:54 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 28, 2026

Codecov Report

❌ Patch coverage is 50.00000% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 75.96%. Comparing base (a5bc6f8) to head (a03db8e).
⚠️ Report is 18 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/quantization/model_calib.py 25.00% 3 Missing ⚠️
.../torch/quantization/nn/modules/tensor_quantizer.py 75.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1553      +/-   ##
==========================================
- Coverage   76.79%   75.96%   -0.83%     
==========================================
  Files         474      477       +3     
  Lines       51560    53862    +2302     
==========================================
+ Hits        39593    40916    +1323     
- Misses      11967    12946     +979     
Flag Coverage Δ
unit 52.75% <50.00%> (+0.12%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant