Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151]#1562
Autoquant and GPTQ in support in Megatron-Core [OMNIML-3151]#1562jenchen13 wants to merge 1 commit into
Conversation
Signed-off-by: Jennifer Chen <jennifchen@nvidia.com>
📝 WalkthroughWalkthroughAutoQuantize internals are updated to consistently incorporate expert model parallelism in distributed synchronization, refactor weight-size computation to derive from candidate statistics, introduce Megatron-specific auto-quantization support with lazy plugin registration, add block quantization detection properties, include a calibration zero-input guard, and add distributed and unit tests. ChangesAutoQuantize Expert Model Parallelism and Megatron Support
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~22 minutes Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
|
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 1
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/quantization/utils/calib_utils.py (1)
60-61:⚠️ Potential issue | 🟡 Minor | ⚡ Quick winUpdate the docstring note to reflect the new behavior.
The note states that "input must be non-empty" and "a zero-sized input causes division by zero", but the new guard clause at lines 66-67 now handles
batch_size == 0gracefully. Update the docstring to reflect that empty inputs are now supported.📝 Proposed docstring update
- Note: input must be non-empty (batch_size > 0); a zero-sized input causes division by zero. + Note: Empty inputs (batch_size == 0) are handled gracefully and return unchanged hessian/n_samples. + This can occur in MoE models when some experts receive no tokens.🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/utils/calib_utils.py` around lines 60 - 61, Update the docstring Note to reflect that empty inputs are now supported: replace "input must be non-empty (batch_size > 0); a zero-sized input causes division by zero" with a sentence stating that the function now handles batch_size == 0 via the guard clause (which returns early when batch_size == 0) and will not raise a division-by-zero error; mention that non-empty inputs are still processed normally. Target the docstring for the function that contains the guard checking batch_size == 0 (the docstring immediately above that guard) and keep the wording brief and clear.
🧹 Nitpick comments (2)
modelopt/torch/quantization/plugins/megatron.py (1)
810-837: ⚡ Quick winDocument and export the newly added public APIs.
register_megatron_autoquant_supportandget_mcore_decoder_layersare public (non-underscore) but only one has a docstring, and neither is reflected in__all__.As per coding guidelines, "Document public APIs with docstrings, including examples when useful" and "Define the public API with
__all__at the top of each module".🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/plugins/megatron.py` around lines 810 - 837, Add a docstring to the newly public function get_mcore_decoder_layers describing purpose, parameters, return type and an example, and ensure register_megatron_autoquant_support also has appropriate public-docstring coverage if needed; then export both symbols by adding "register_megatron_autoquant_support" and "get_mcore_decoder_layers" to the module's __all__ list at the top of the file so they are part of the public API surface.modelopt/torch/quantization/model_quant.py (1)
510-515: ⚡ Quick winDon’t silently swallow plugin import failures.
Line 514 currently suppresses all
ImportErrors, which can hide real regressions and make Megatron auto-quant support silently disappear. Emit a warning (or gate the exception type more narrowly) so failures are diagnosable.Proposed change
try: from .plugins.megatron import register_megatron_autoquant_support register_megatron_autoquant_support() - except ImportError: - pass + except ImportError as exc: + warnings.warn( + f"Skipping Megatron auto-quant support registration due to import error: {exc}", + RuntimeWarning, + stacklevel=2, + )🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the rest with a brief reason, keep changes minimal, and validate. In `@modelopt/torch/quantization/model_quant.py` around lines 510 - 515, The current try/except around importing and calling register_megatron_autoquant_support silently swallows ImportError; update the block to either catch a more specific exception (e.g., ModuleNotFoundError for the plugin import) or log a warning when import/call fails so failures are visible; specifically wrap the import and call to register_megatron_autoquant_support() and on failure call the module's logger or warnings.warn/processLogger.warning with a clear message including the exception text and that Megatron auto-quant support is disabled.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 830-831: get_mcore_decoder_layers is mutating model.decoder.layers
by appending model.output_layer which causes duplicated entries on repeated
calls; instead return a new nn.ModuleList (e.g., copy model.decoder.layers into
a fresh list/ModuleList) and append the output_layer to that new collection or
check for existence before appending so augmentation is idempotent; update
get_mcore_decoder_layers (and calls from
LayerActivationCollector.get_decoder_layers /
LayerActivationCollector._patch_all_layers) to use the non-mutating copy so
_cleanup_layers need not undo permanent changes.
---
Outside diff comments:
In `@modelopt/torch/quantization/utils/calib_utils.py`:
- Around line 60-61: Update the docstring Note to reflect that empty inputs are
now supported: replace "input must be non-empty (batch_size > 0); a zero-sized
input causes division by zero" with a sentence stating that the function now
handles batch_size == 0 via the guard clause (which returns early when
batch_size == 0) and will not raise a division-by-zero error; mention that
non-empty inputs are still processed normally. Target the docstring for the
function that contains the guard checking batch_size == 0 (the docstring
immediately above that guard) and keep the wording brief and clear.
---
Nitpick comments:
In `@modelopt/torch/quantization/model_quant.py`:
- Around line 510-515: The current try/except around importing and calling
register_megatron_autoquant_support silently swallows ImportError; update the
block to either catch a more specific exception (e.g., ModuleNotFoundError for
the plugin import) or log a warning when import/call fails so failures are
visible; specifically wrap the import and call to
register_megatron_autoquant_support() and on failure call the module's logger or
warnings.warn/processLogger.warning with a clear message including the exception
text and that Megatron auto-quant support is disabled.
In `@modelopt/torch/quantization/plugins/megatron.py`:
- Around line 810-837: Add a docstring to the newly public function
get_mcore_decoder_layers describing purpose, parameters, return type and an
example, and ensure register_megatron_autoquant_support also has appropriate
public-docstring coverage if needed; then export both symbols by adding
"register_megatron_autoquant_support" and "get_mcore_decoder_layers" to the
module's __all__ list at the top of the file so they are part of the public API
surface.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: 30c2390a-c99c-4b41-8c0c-0be68734dc77
📒 Files selected for processing (7)
modelopt/torch/quantization/algorithms.pymodelopt/torch/quantization/model_quant.pymodelopt/torch/quantization/nn/modules/tensor_quantizer.pymodelopt/torch/quantization/plugins/megatron.pymodelopt/torch/quantization/utils/calib_utils.pytests/gpu_megatron/torch/quantization/plugins/test_megatron.pytests/unit/torch/quantization/test_autoquant.py
| if hasattr(model, "output_layer") and layers: | ||
| layers.append(model.output_layer) |
There was a problem hiding this comment.
🧩 Analysis chain
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n -C3 'def get_mcore_decoder_layers|layers\.append\(model\.output_layer\)' modelopt/torch/quantization/plugins/megatron.py
rg -n -C4 'class LayerActivationCollector|register_decoder_layer_support|get_mcore_decoder_layers' --type=py modelopt/torch/quantizationRepository: NVIDIA/Model-Optimizer
Length of output: 4838
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="modelopt/torch/quantization/utils/layerwise_calib.py"
rg -n "def get_decoder_layers\\b|get_decoder_layers\\(" "$file"
rg -n "LayerActivationCollector\\.get_decoder_layers|get_decoder_layers\\b" modelopt/torch/quantization -S --type=py
# Show the implementation region in layerwise_calib.py around get_decoder_layers and where layers are used.
# Use line ranges guessed from rg output by printing ~120 lines around matches.
python3 - <<'PY'
import re, subprocess, os, sys
file="modelopt/torch/quantization/utils/layerwise_calib.py"
out=subprocess.check_output(["rg","-n","def get_decoder_layers\\b",file],text=True)
m=re.search(r":(\d+):",out)
if not m:
print("No get_decoder_layers def found")
sys.exit(0)
ln=int(m.group(1))
start=max(1,ln-60); end=ln+160
print(f"\n--- {file}:{start}-{end} ---")
subprocess.run(["sed","-n",f"{start},{end}p",file],check=False)
PYRepository: NVIDIA/Model-Optimizer
Length of output: 959
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="modelopt/torch/quantization/utils/layerwise_calib.py"
rg -n "def get_decoder_layers\\b|get_decoder_layers\\(" "$file"
rg -n "LayerActivationCollector\\.get_decoder_layers|get_decoder_layers\\b" modelopt/torch/quantization -S --type=py
# Show implementation region around get_decoder_layers and where layers are used.
python3 - <<'PY'
import re, subprocess, sys
file="modelopt/torch/quantization/utils/layerwise_calib.py"
out=subprocess.check_output(["rg","-n","def get_decoder_layers\\b",file],text=True)
m=re.search(r":(\d+):",out)
if not m:
print("No get_decoder_layers def found")
sys.exit(0)
ln=int(m.group(1))
start=max(1,ln-60); end=ln+200
print(f"\n--- {file}:{start}-{end} ---")
subprocess.run(["sed","-n",f"{start},{end}p",file],check=False)
PYRepository: NVIDIA/Model-Optimizer
Length of output: 959
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="modelopt/torch/quantization/utils/layerwise_calib.py"
# Print the get_decoder_layers implementation and relevant callers/caching.
python3 - <<'PY'
import subprocess, re
file="modelopt/torch/quantization/utils/layerwise_calib.py"
rg_out=subprocess.check_output(["rg","-n","def get_decoder_layers\\b",file],text=True)
print(rg_out)
m=re.search(r":(\d+):", rg_out)
ln=int(m.group(1))
start=max(1, ln-40); end=ln+120
print(f"\n--- {file}:{start}-{end} ---")
subprocess.run(["sed","-n",f"{start},{end}p",file],check=False)
PY
# Also inspect where _decoder_layers is set and how often.
rg -n "_decoder_layers\\s*=\\s*self\\.get_decoder_layers|self\\._decoder_layers" -n modelopt/torch/quantization/utils/layerwise_calib.py
python3 - <<'PY'
import subprocess, re
file="modelopt/torch/quantization/utils/layerwise_calib.py"
out=subprocess.check_output(["rg","-n","_decoder_layers\\s*=\\s*self\\.get_decoder_layers|self\\._decoder_layers",file],text=True)
print(out)
PY
# Show the LayerActivationCollector class header and nearby doc/comments.
python3 - <<'PY'
import subprocess
file="modelopt/torch/quantization/utils/layerwise_calib.py"
out=subprocess.check_output(["rg","-n","class LayerActivationCollector\\b",file],text=True)
print(out)
ln=int(out.strip().split(":")[0])
start=max(1,ln-30); end=ln+60
print(f"\n--- {file}:{start}-{end} ---")
subprocess.run(["sed","-n",f"{start},{end}p",file],check=False)
PY
# Inspect any caching on the class method get_decoder_layers (class-level)
rg -n "cache|get_decoder_layers\\.|_decoder" modelopt/torch/quantization/utils/layerwise_calib.pyRepository: NVIDIA/Model-Optimizer
Length of output: 291
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="modelopt/torch/quantization/utils/layerwise_calib.py"
# Show class and the full get_decoder_layers method body + immediate usages.
sed -n '90,320p' "$file" | nl -ba | sed -n '1,260p'
# Show how discoverer registry is used inside get_decoder_layers.
rg -n "get_decoder_layers\\b|_decoder_layer_support|register_decoder_layer_support" "$file" --context=2
# Show where get_decoder_layers is called and if it’s cached.
rg -n "_decoder_layers\\s*=\\s*self\\.get_decoder_layers|_decoder_layers\\b" "$file" --context=2
# Show calls from model_calib.py.
rg -n "LayerActivationCollector\\.get_decoder_layers" -S modelopt/torch/quantization/model_calib.py modelopt/torch/quantization -g'*.py' --context=2Repository: NVIDIA/Model-Optimizer
Length of output: 109
🏁 Script executed:
#!/bin/bash
set -euo pipefail
file="modelopt/torch/quantization/utils/layerwise_calib.py"
echo "=== LayerActivationCollector.get_decoder_layers section ==="
sed -n '120,320p' "$file" | awk '{printf "%4d:%s\n", NR+119, $0}'
echo
echo "=== Registry usage inside get_decoder_layers ==="
rg -n "register_decoder_layer_support|_decoder_layer_support|get_decoder_layers\\b" "$file" --context=2
echo
echo "=== _decoder_layers caching/assignment ==="
rg -n "_decoder_layers\\s*=\\s*self\\.get_decoder_layers|_decoder_layers\\b" "$file" --context=2
echo
echo "=== Calls into get_decoder_layers from other files ==="
rg -n "LayerActivationCollector\\.get_decoder_layers" -S modelopt/torch/quantization -g'*.py' --context=2Repository: NVIDIA/Model-Optimizer
Length of output: 17318
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect model_calib usage around transformer_layers = LayerActivationCollector.get_decoder_layers(model)
sed -n '1720,1860p' modelopt/torch/quantization/model_calib.py | awk '{print NR+1720-1 ":" $0}'
# Find where LayerActivationCollector is instantiated in model_calib / elsewhere
rg -n "LayerActivationCollector\\(" modelopt/torch/quantization --type=py
rg -n "new LayerActivationCollector|_patch_all_layers|_decoder_layers" modelopt/torch/quantization/model_calib.py --type=py
# Count how many times get_decoder_layers is called for a given flow (static search)
rg -n "LayerActivationCollector\\.get_decoder_layers\\(" modelopt/torch/quantization --type=py
rg -n "is_supported\\(model\\)|LayerActivationCollector\\.is_supported\\(" modelopt/torch/quantization --type=pyRepository: NVIDIA/Model-Optimizer
Length of output: 6793
🏁 Script executed:
#!/bin/bash
set -euo pipefail
rg -n "LayerActivationCollector\\.get_decoder_layers\\(" -S . --type=py
rg -n "LayerActivationCollector\\.is_supported\\(" -S . --type=py
rg -n "get_mcore_decoder_layers\\(" -S modelopt/torch/quantization --type=pyRepository: NVIDIA/Model-Optimizer
Length of output: 1600
🏁 Script executed:
#!/bin/bash
set -euo pipefail
# Inspect tests that cover LayerActivationCollector.get_decoder_layers behavior.
sed -n '1,260p' tests/unit/torch/quantization/test_utils.py | awk '{print NR ":" $0}'
# Inspect megatron.py around get_mcore_decoder_layers and any __all__/new API functions.
sed -n '760,900p' modelopt/torch/quantization/plugins/megatron.py | awk '{print NR ":" $0}'Repository: NVIDIA/Model-Optimizer
Length of output: 13437
Avoid in-place mutation in get_mcore_decoder_layers (prevents permanent side effects/duplicate output_layer).
In modelopt/torch/quantization/plugins/megatron.py:826-832, get_mcore_decoder_layers() returns model.decoder.layers and unconditionally append(model.output_layer), permanently mutating the model. layerwise_calibrate() passes these layers directly into LayerActivationCollector._patch_all_layers, and _cleanup_layers() does not undo the appended module, so repeated calls to LayerActivationCollector.get_decoder_layers(model) will duplicate output_layer and lead to duplicated calibration/forward patching. Construct a fresh nn.ModuleList (copy) when augmenting with output_layer, or make the augmentation idempotent.
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
In `@modelopt/torch/quantization/plugins/megatron.py` around lines 830 - 831,
get_mcore_decoder_layers is mutating model.decoder.layers by appending
model.output_layer which causes duplicated entries on repeated calls; instead
return a new nn.ModuleList (e.g., copy model.decoder.layers into a fresh
list/ModuleList) and append the output_layer to that new collection or check for
existence before appending so augmentation is idempotent; update
get_mcore_decoder_layers (and calls from
LayerActivationCollector.get_decoder_layers /
LayerActivationCollector._patch_all_layers) to use the non-mutating copy so
_cleanup_layers need not undo permanent changes.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## feature/mcore_mse_mixed_precision #1562 +/- ##
=====================================================================
- Coverage 69.35% 69.32% -0.03%
=====================================================================
Files 478 478
Lines 52203 52242 +39
=====================================================================
+ Hits 36203 36218 +15
- Misses 16000 16024 +24
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
What does this PR do?
Type of change: New Feature
Autoquant and GPTQ in support in Megatron-Core
Usage
# Add a code snippet demonstrating how to use thisTesting
TODO add a test for GPTQ mcore
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ / ❌ / N/AAdditional Information
Summary by CodeRabbit
Release Notes
New Features
Bug Fixes
Tests