fix(quant): sync NVFP4StaticQuantizer global_amax across TP and EP#1553
fix(quant): sync NVFP4StaticQuantizer global_amax across TP and EP#1553cjluo-nv wants to merge 1 commit into
Conversation
`promote_nvfp4_static_quantizers` registers `_global_amax` from `reduce_amax(local _amax)` before the distributed-sync block in `max_calibrate`. With TP (different weight shards per rank) or MoE EP (different experts per rank), each rank's local `_amax` covers a different subset of the global weight, so the resulting `_global_amax` diverges across ranks. The existing TP/EP sync only touches `_amax`, leaving `_global_amax` permanently rank-local — making the upper level of the two-level NVFP4 scale TP/EP-layout-dependent. Add `NVFP4StaticQuantizer.sync_global_amax_across_distributed_group` (mirroring the existing `sync_amax_across_distributed_group` on `TensorQuantizer`) and call it from both `sync_quantizer_amax_across_dp_ep` (EP group) and `sync_quantizer_amax_across_tp` (TP group) alongside the existing `_amax` sync. DP doesn't need a separate call because weights are replicated across DP, so `_global_amax` is naturally identical. Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
|
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1553 +/- ##
==========================================
- Coverage 76.79% 75.96% -0.83%
==========================================
Files 474 477 +3
Lines 51560 53862 +2302
==========================================
+ Hits 39593 40916 +1323
- Misses 11967 12946 +979
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: Bug fix
Background. During
max_calibrate,promote_nvfp4_static_quantizers(model)is called before the distributed-sync block. Promotion registers_global_amaxon eachNVFP4StaticQuantizerfromreduce_amax(local _amax, axis=None)— i.e. the max over this rank's per-block_amax. With TP (weight sharded across Cout or Cin) or MoE EP (different experts per rank), each rank's local_amaxcovers a different slice of the global weight, so_global_amaxdiverges across ranks.The existing sync block only all-reduces
_amax:sync_quantizer_amax_across_dp_epsyncs_amaxacross DP/EP.sync_quantizer_amax_across_tpsyncs_amaxacross TP.Nothing in the pipeline touches
_global_amaxafter promotion, so the upper level of the two-level NVFP4 scale ends up TP/EP-layout-dependent. This violates the stated invariant onmodel_calib.py:303("TP=8 ↔ TP=4 ↔ TP=8 should give the same quantization parameters") and produces inconsistent block-scale FP8 grids across MoE experts when EP > 1.Fix. Add
NVFP4StaticQuantizer.sync_global_amax_across_distributed_groupmirroringTensorQuantizer.sync_amax_across_distributed_group, then call it from bothsync_quantizer_amax_across_dp_ep(EP group) andsync_quantizer_amax_across_tp(TP group), right after the existing_amaxsync. DP is omitted on purpose: weights are replicated across DP, so each rank already has the same local_amaxand therefore the same locally-computed_global_amax.Usage
No API change. The existing calibration flow now produces consistent
_global_amaxacross TP/EP automatically:Testing
pre-commit run --files modelopt/torch/quantization/model_calib.py modelopt/torch/quantization/nn/modules/tensor_quantizer.py— passes (ruff / mypy / format / bandit clean).pytest tests/unit/torch/quantization/test_calib.py -q— 12 passed, 1 skipped (CUDA-only).max_calibrateunder fake TP and EP groups and asserts_global_amaxis bit-identical across ranks for NVFP4-static weight quantizers.Before your PR is "Ready for review"
_global_amaxis now consistent across TP/EP ranks, which was the documented intent.CONTRIBUTING.md: N/AAdditional Information
Related: #1536 (in-flight refactor of static-NVFP4 finalization). That PR consolidates
promote_nvfp4_static_quantizers+_sync_grouped_weight_global_amaxintomax_calibratebut does not change the promote-vs-TP/EP-sync ordering and does not add a cross-TP/EP_global_amaxreduction —_sync_grouped_weight_global_amaxonly unifies grouped Q/K/V/gate/up siblings within a rank. This PR layers cleanly on top of #1536 or stands alone on main.🤖 Generated with Claude Code