WIP Support per expert amax in TEGroupedMLP#1550
Conversation
…nfra fixes
modelopt/torch/quantization/plugins/transformer_engine.py:
MODELOPT_TEGROUPED_PER_EXPERT_QUANTIZER=1 opts into per-gemm
weight_quantizer_0..N-1 inside _QuantTEGroupedLinear (deepcopied from
the shared weight_quantizer). Lets TEGroupedMLP recover per-expert
amax granularity, matching SequentialMLP's default behavior.
modelopt/torch/distill/plugins/megatron.py:
LogitsKLLoss.forward prints student/teacher logit stats (mean/std/
min/max/shape) on rank 0 each call. Diagnostic for the QAD loss-spike
investigation — confirms which spec produces which logits without
changing the KL math.
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py:
New test_te_grouped_vs_sequential_default_amax + ..._default_loss
cover the structural amax asymmetry between TEGroupedMLP and
SequentialMLP (TEGrouped per-linear amax = max-over-Sequential-experts
amax) and a finiteness sanity check on the resulting quant error.
tools/launcher/common/service_utils.sh:
- Fall back to SLURM_PROCID / SLURM_LOCALID when PMIX_*/OMPI_* are
unset, so `[[ "$mpi_local_rank" -eq 0 ]]` doesn't silently pass on
every rank under plain srun.
- util_install_extra_dep: per-node marker so concurrent ranks wait
for rank 0 to finish installing (concurrent pip on a shared FS
leaves a broken state); also installs nvidia-resiliency-ext.
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
- transformer_engine.py: dedup `import copy`/`import os` left over from the
rebase, sort the four imports alphabetically.
- transformer_engine.py: comment near the per-expert weight_quantizer setup
explaining that base modelopt_post_restore won't re-calibrate the
weight_quantizer_{i} modules, so save/restore is only safe when TP/EP is
unchanged. Per-channel _amax shape depends on the TP-sliced output dim.
- service_utils.sh: drop the duplicated mpi_rank / mpi_local_rank
re-assignments — main already carries the SLURM fallback, the extra two
lines were leftover rebase noise.
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (1)
🚧 Files skipped from review as they are similar to previous changes (1)
📝 WalkthroughWalkthroughThis PR adds an environment-gated per-GEMM weight-quantizer mode to Transformer Engine's TEGroupedLinear, implements deep-copied per-GEMM quantizers, updates restore/calibration and forward paths to use per-GEMM quantizers, and adds tests comparing TEGrouped vs Sequential MoE quantization behavior. ChangesPer-expert weight quantization for TEGroupedLinear
🎯 3 (Moderate) | ⏱️ ~20 minutes 🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out. Comment |
|
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/plugins/transformer_engine.py (1)
147-166:⚠️ Potential issue | 🟠 Major | 🏗️ Heavy liftPer-expert restore is still unsafe when TP/EP changes.
This branch creates
weight_quantizer_{i}modules, butmodelopt_post_restore()still runs the base restore path that only knows aboutself.weight_quantizer. If a checkpoint is restored under different TP/EP, the per-expert_amaxtensors can keep the old shape and no code here reinitializes them. Please either block this mode on topology changes or add a per-expert reset/recalibration path during restore.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/plugins/transformer_engine.py` around lines 147 - 166, The per-expert weight-quantizer path (_per_expert_weight_quantizer and the created modules weight_quantizer_{i}) is unsafe on TP/EP topology change because modelopt_post_restore currently only runs the base path and does not reinitialize per-expert _amax buffers; update modelopt_post_restore to detect topology/shape mismatch and either (a) fail-fast by raising a clear error when _per_expert_weight_quantizer is enabled and the restored per-expert _amax shapes don't match the current weight shapes, or (b) perform a per-expert recalibration: iterate over each created module name weight_quantizer_{i} (use self.num_gemms to enumerate), reset/recreate their _amax buffers to the correct shape for the current self.weight{i} and call the quantizer's max_calibrate (or its equivalent reinit routine) so each per-expert quantizer is re-calibrated after restore; implement one of these branches in modelopt_post_restore to ensure shapes/amax are consistent after load.
🧹 Nitpick comments (1)
tests/gpu_megatron/torch/quantization/plugins/test_megatron.py (1)
738-750: ⚡ Quick winKeep these GPU assertions on-device.
tensor.item()andmath.isfinite(...)force host syncs on every worker here. Prefer tensor comparisons/reductions plustorch.isfinite, and only materialize Python scalars for failure messages or rank-0 logging.As per coding guidelines, "Performance note: avoid GPU syncs in hot paths/tests (don’t use
tensor.item()/float(tensor)/min(tensor))—use tensor ops instead."Also applies to: 800-813
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py` around lines 738 - 750, The test is forcing GPU->CPU syncs by calling .item() inside the assert message and in the divergence check; change the assert to keep comparisons on-device and only materialize scalars when needed: replace the assert torch.allclose(te_amax.view(()), seq_max, ...) with an if-not pattern (if not torch.allclose(...): raise AssertionError(f"...{te_amax.item()}...{seq_max.item()}...")) so the .item() calls only run on failure, and compute the per-expert divergence using tensor ops (delta = seq_amaxes.max() - seq_amaxes.min(); if (delta > 1e-5): saw_per_expert_divergence = True) while ensuring you only call .item() when you must convert to a Python bool for logging or failure paths; update references to seq_amaxes, seq_mlp.local_experts, linear_name, te_amax, and saw_per_expert_divergence accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`:
- Around line 698-761: The default-path helper
_test_te_grouped_vs_sequential_default_amax_helper must explicitly disable the
feature flag: set os.environ["MODELOPT_TEGROUPED_PER_EXPERT_QUANTIZER"]="0"
(saving the original and restoring it after the test) before calling
initialize_for_megatron so the shared-quantizer behavior is forced; then add a
new test variant that sets
os.environ["MODELOPT_TEGROUPED_PER_EXPERT_QUANTIZER"]="1" and runs the same
checks to cover the env-on path. Apply the same explicit env-off/restore and
corresponding env-on test change to the other similar helper/test pair
exercising TEGrouped vs Sequential (the companion helper referenced in the
review) so both default and env-enabled code paths have dedicated coverage.
---
Outside diff comments:
In `@modelopt/torch/quantization/plugins/transformer_engine.py`:
- Around line 147-166: The per-expert weight-quantizer path
(_per_expert_weight_quantizer and the created modules weight_quantizer_{i}) is
unsafe on TP/EP topology change because modelopt_post_restore currently only
runs the base path and does not reinitialize per-expert _amax buffers; update
modelopt_post_restore to detect topology/shape mismatch and either (a) fail-fast
by raising a clear error when _per_expert_weight_quantizer is enabled and the
restored per-expert _amax shapes don't match the current weight shapes, or (b)
perform a per-expert recalibration: iterate over each created module name
weight_quantizer_{i} (use self.num_gemms to enumerate), reset/recreate their
_amax buffers to the correct shape for the current self.weight{i} and call the
quantizer's max_calibrate (or its equivalent reinit routine) so each per-expert
quantizer is re-calibrated after restore; implement one of these branches in
modelopt_post_restore to ensure shapes/amax are consistent after load.
---
Nitpick comments:
In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py`:
- Around line 738-750: The test is forcing GPU->CPU syncs by calling .item()
inside the assert message and in the divergence check; change the assert to keep
comparisons on-device and only materialize scalars when needed: replace the
assert torch.allclose(te_amax.view(()), seq_max, ...) with an if-not pattern (if
not torch.allclose(...): raise
AssertionError(f"...{te_amax.item()}...{seq_max.item()}...")) so the .item()
calls only run on failure, and compute the per-expert divergence using tensor
ops (delta = seq_amaxes.max() - seq_amaxes.min(); if (delta > 1e-5):
saw_per_expert_divergence = True) while ensuring you only call .item() when you
must convert to a Python bool for logging or failure paths; update references to
seq_amaxes, seq_mlp.local_experts, linear_name, te_amax, and
saw_per_expert_divergence accordingly.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 5fbda55f-b0c7-4054-bc00-63b4bc3a150e
📒 Files selected for processing (2)
modelopt/torch/quantization/plugins/transformer_engine.pytests/gpu_megatron/torch/quantization/plugins/test_megatron.py
| def _test_te_grouped_vs_sequential_default_amax_helper(tp_size, ep_size, quant_cfg, rank, size): | ||
| """TEGrouped per-linear amax should equal max-over-Sequential-experts under default sync=False.""" | ||
| initialize_for_megatron( | ||
| tensor_model_parallel_size=tp_size, | ||
| expert_model_parallel_size=ep_size, | ||
| seed=SEED, | ||
| ) | ||
|
|
||
| te_grouped = _gpt_model_provider( | ||
| tp_size=tp_size, ep_size=ep_size, hidden_size=32, moe_grouped_gemm=True, | ||
| transformer_impl="transformer_engine", num_moe_experts=4, | ||
| ) | ||
| forward = get_forward(te_grouped, batch_size=8) | ||
|
|
||
| sequential = _gpt_model_provider( | ||
| tp_size=tp_size, ep_size=ep_size, hidden_size=32, moe_grouped_gemm=False, | ||
| num_moe_experts=4, transformer_impl="modelopt", | ||
| ) | ||
| copy_weights_from_grouped_to_non_grouped(te_grouped, sequential) | ||
|
|
||
| for module in te_grouped.modules(): | ||
| if isinstance(module, TopKRouter): | ||
| module.topk = module.num_experts | ||
| for module in sequential.modules(): | ||
| if isinstance(module, TopKRouter): | ||
| module.topk = module.num_experts | ||
|
|
||
| mtq.quantize(te_grouped, quant_cfg, forward) | ||
| mtq.quantize(sequential, quant_cfg, forward) | ||
|
|
||
| te_modules = [m for m in te_grouped.modules() if isinstance(m, TEGroupedMLP)] | ||
| seq_modules = [m for m in sequential.modules() if isinstance(m, SequentialMLP)] | ||
| assert len(te_modules) == len(seq_modules) | ||
|
|
||
| saw_per_expert_divergence = False | ||
| for te_mlp, seq_mlp in zip(te_modules, seq_modules): | ||
| for linear_name in ("linear_fc1", "linear_fc2"): | ||
| te_amax = getattr(te_mlp, linear_name).weight_quantizer.amax | ||
| assert te_amax is not None and te_amax.numel() == 1 | ||
|
|
||
| seq_amaxes = torch.stack([ | ||
| getattr(expert, linear_name).weight_quantizer.amax.view(()) | ||
| for expert in seq_mlp.local_experts | ||
| ]) | ||
| seq_max = seq_amaxes.max() | ||
|
|
||
| assert torch.allclose(te_amax.view(()), seq_max, atol=1e-5, rtol=1e-5), ( | ||
| f"TEGrouped per-linear amax ({te_amax.item()}) != " | ||
| f"max-over-Sequential-experts ({seq_max.item()}) for {linear_name}" | ||
| ) | ||
|
|
||
| if (seq_amaxes.max() - seq_amaxes.min()).item() > 1e-5: | ||
| saw_per_expert_divergence = True | ||
|
|
||
| assert saw_per_expert_divergence, ( | ||
| "Expected per-expert weight amax to diverge across SequentialMLP experts." | ||
| ) | ||
|
|
||
|
|
||
| @pytest.mark.parametrize("quant_cfg", [mtq.FP8_DEFAULT_CFG, mtq.NVFP4_DEFAULT_CFG]) | ||
| def test_te_grouped_vs_sequential_default_amax(dist_workers_size_4, quant_cfg): | ||
| dist_workers_size_4.run( | ||
| partial(_test_te_grouped_vs_sequential_default_amax_helper, 1, 2, quant_cfg) | ||
| ) |
There was a problem hiding this comment.
Make these “default” tests control the feature flag explicitly.
These helpers never clear or set MODELOPT_TEGROUPED_PER_EXPERT_QUANTIZER, so a worker process with that flag exported will exercise the new per-expert path instead of the default shared-quantizer path. That also means the env-enabled branch added in this PR still has no dedicated coverage. Please force the flag off in these default-behavior tests and add a separate env-on case for the new path.
Also applies to: 764-820
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@tests/gpu_megatron/torch/quantization/plugins/test_megatron.py` around lines
698 - 761, The default-path helper
_test_te_grouped_vs_sequential_default_amax_helper must explicitly disable the
feature flag: set os.environ["MODELOPT_TEGROUPED_PER_EXPERT_QUANTIZER"]="0"
(saving the original and restoring it after the test) before calling
initialize_for_megatron so the shared-quantizer behavior is forced; then add a
new test variant that sets
os.environ["MODELOPT_TEGROUPED_PER_EXPERT_QUANTIZER"]="1" and runs the same
checks to cover the env-on path. Apply the same explicit env-off/restore and
corresponding env-on test change to the other similar helper/test pair
exercising TEGrouped vs Sequential (the companion helper referenced in the
review) so both default and env-enabled code paths have dedicated coverage.
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1550 +/- ##
==========================================
- Coverage 76.70% 69.45% -7.26%
==========================================
Files 477 477
Lines 51977 52002 +25
==========================================
- Hits 39868 36116 -3752
- Misses 12109 15886 +3777
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: ?
note: if different TP/EP is used during checkpoint restore, the per expert weight quantizer may not restore properly. TODO fix this
Usage
# Add a code snippet demonstrating how to use thisTesting
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
New Features
Tests