feat: Layerwise calibration: nested config + QDQ-from-prev-layer flag + checkpoint I/O knobs#1571
feat: Layerwise calibration: nested config + QDQ-from-prev-layer flag + checkpoint I/O knobs#1571Fridah-nv wants to merge 5 commits into
Conversation
…layer Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
📝 WalkthroughWalkthroughThis PR restructures layerwise quantization calibration configuration from flat boolean and separate checkpoint-directory fields to a nested ChangesLayerwise Configuration Restructuring
Sequence Diagram(s)sequenceDiagram
participant User
participant QuantizationConfig as Config Schema
participant CalibPipeline as Calibration Pipeline
participant CheckpointState as Checkpoint Persistence
User->>QuantizationConfig: provide legacy or nested layerwise config
QuantizationConfig->>QuantizationConfig: coerce bool to nested dict (emit DeprecationWarning)
QuantizationConfig->>QuantizationConfig: migrate layerwise_checkpoint_dir → layerwise.checkpoint_dir
QuantizationConfig->>CalibPipeline: return normalized LayerwiseConfig
CalibPipeline->>CalibPipeline: extract layerwise.enable, checkpoint_dir, save_every, save_quantizers_only, qdq_from_prev
alt layerwise.enable=True
CalibPipeline->>CheckpointState: create _CheckpointState(checkpoint_dir, save_every, save_quantizers_only)
loop per layer
CalibPipeline->>CalibPipeline: calibrate layer
alt qdq_from_prev=False
CalibPipeline->>CalibPipeline: disable quantizers under calibrated layer
end
CalibPipeline->>CalibPipeline: capture next-layer inputs
alt layer_idx % save_every == 0 OR final layer
CalibPipeline->>CheckpointState: save(layer_idx, model, layers)
CheckpointState->>CheckpointState: save qstate, (weights.pt OR quantizer_buffers.pt), output_meta
CheckpointState->>CheckpointState: write next_inputs.pt only for final layer in window
end
end
else layerwise.enable=False
CalibPipeline->>CalibPipeline: sequential calibration
end
Estimated code review effort🎯 4 (Complex) | ⏱️ ~60 minutes Possibly related PRs
Suggested reviewers
🚥 Pre-merge checks | ✅ 5 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (5 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches📝 Generate docstrings
🧪 Generate unit tests (beta)
Comment |
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
|
/claude review |
|
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
The nested LayerwiseConfig refactor, get_qdq_activations_from_prev_layer flag, and save_quantizers_only whitelist look well-designed and the back-compat shims (bool coercion + flat→nested migration with the use_sequential rescue) are thorough. Tests around config validation and the qdq=False-matches-sequential-amax behavioral check are good.
However, save_every > 1 looks broken: by the time _CheckpointState.save() is invoked for the boundary layer, the earlier layers in the window have already been replaced by _SkipLayer instances in transformer_layers (via _set_layer_states(layer_idx+1) swapping layer_idx-1 to a dummy during cache_outputs_for_next_layer_calib). Reading layers[i].state_dict() / quantizer_state(layers[i]) / layers[i]._layerwise_calib.output_meta on a _SkipLayer returns empty/None state because _original is attached via object.__setattr__ and is not a registered submodule. The new test_layerwise_save_every_writes_next_inputs_only_at_window_boundaries only checks file existence, so the silent corruption isn't caught — there is no resume test for save_every > 1. See inline comment for details and the suggested fix (snapshot to memory immediately after each layer is calibrated, then flush at boundaries; or save real layers eagerly and only window-gate next_inputs.pt).
Minor: changelog says "TODO before marking ready for review" in the PR body but the changelog entry is already added — please confirm the description is up to date before merge.
There was a problem hiding this comment.
Warning
CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.
Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.
Actionable comments posted: 3
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 153-160: Add an explicit __all__ declaration near the top of the
module that lists the new public API symbols (e.g. "LayerwiseConfig" and any
other public config classes, functions or constants defined in this file and the
other block referenced around 636-691); place it after the imports and ensure it
enumerates every symbol intended for export so star-imports are safe and stable,
and update corresponding package __init__.py re-exports if needed to re-export
these names.
- Around line 706-708: _coerce_layerwise_input currently returns
value.model_dump() for LayerwiseConfig which expands defaults and loses the
original model_fields_set causing GPTQ's _gptq_qdq_default to miss injecting
defaults; change _coerce_layerwise_input to return the LayerwiseConfig instance
(value) unchanged when isinstance(value, LayerwiseConfig) so downstream parsing
preserves model_fields_set (e.g., self.layerwise.model_fields_set) and
GPTQCalibConfig._gptq_qdq_default can correctly detect which fields the user
actually set.
In `@modelopt/torch/quantization/utils/layerwise_calib.py`:
- Around line 729-761: The loop that saves windowed checkpoints is capturing
_SkipLayer placeholders rather than the underlying calibrated layer because
cache_outputs_for_next_layer_calib() can replace entries in layers with
_SkipLayer before save; update the save loop in the function that iterates
layers[i] to detect instances of _SkipLayer and use layer = layer._original (or
otherwise snapshot the original layer list before cache outputs mutate it) so
_save_layer receives the real calibrated layer, its quantizer state
(quantizer_state(layer)), and correct output_meta rather than the placeholder.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Enterprise
Run ID: e32cbbac-64bc-4625-80b8-93a1790c3d62
📒 Files selected for processing (16)
CHANGELOG.rstexamples/llm_ptq/example_utils.pyexamples/llm_ptq/hf_ptq.pymodelopt/torch/quantization/config.pymodelopt/torch/quantization/mode.pymodelopt/torch/quantization/model_calib.pymodelopt/torch/quantization/utils/layerwise_calib.pymodelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yamlmodelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8.yamlmodelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yamlmodelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yamlmodelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yamlmodelopt_recipes/huggingface/qwen3_5/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yamlmodelopt_recipes/huggingface/qwen3_5_moe/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yamltests/unit/torch/quantization/test_config_validation.pytests/unit/torch/quantization/test_layerwise_calibrate.py
There was a problem hiding this comment.
Claude review — 1 CRITICAL, 0 IMPORTANT, 2 SUGGESTIONs.
Most impactful finding (inline on modelopt/torch/quantization/utils/layerwise_calib.py):
_save_layer calls _write_manifest(...) at the end of every individual call, but during a save_every > 1 window save the non-boundary layers in the window are written without a next_inputs.pt. So if the process is killed (or torch.save raises) between the first and last _save_layer call within a window, the manifest advances to a layer that has no next_inputs.pt. On resume, _CheckpointState.setup_resume then raises FileNotFoundError("Cannot resume: next_inputs.pt missing for layer N") and the checkpoint is permanently broken — there's no fallback that re-calibrates the partial window. The PR description's claim that "interrupting mid-window re-calibrates that window on resume" is the intended invariant, but the implementation doesn't enforce it.
The fix is small: only advance the manifest after the entire window has been flushed (drop the _write_manifest call from _save_layer and call it once at the end of _CheckpointState.save() with last_completed_layer = layer_idx). A regression test that monkeypatches _write_manifest to raise mid-window would lock this in. With save_every=1 (the default) the bug isn't reachable, so the existing tests don't catch it — this only bites users who opt into the new knob.
Other:
- SUGGESTION: legacy flat
layerwise_checkpoint_diris migrated silently in_migrate_layerwise_checkpoint_dir. The bool-formlayerwiseemits aDeprecationWarning; consider symmetric treatment so the migration window for the flat key is signaled. - SUGGESTION:
resolve_checkpoint_dirinexamples/llm_ptq/example_utils.pyonly updates whichever shape it picks — if the user happens to specify both flat and nested with the same value, the rewrite produces a Pydantic conflict whose error message points at the resolver's output, not at the user's redundancy. Edge case but worth a guard.
Risk: Low–medium. The default path (save_every=1, no save_quantizers_only) is well-tested and the algorithm-correctness pieces (qdq-from-prev semantics, GPTQ default via model_fields_set, save_quantizers_only whitelist) look right and are covered by the new tests. The save_every>1 path has the resume-corruption bug above, which only opt-in users will hit.
| def _layerwise_checkpoint_dir_location(algorithm) -> tuple[str, str] | None: | ||
| """Return ``("flat"/"nested", checkpoint_dir)`` for the layerwise checkpoint dir, or None.""" | ||
| if not isinstance(algorithm, dict): | ||
| return False | ||
| return algorithm.get("layerwise_checkpoint_dir") is not None | ||
| return None | ||
| flat = algorithm.get("layerwise_checkpoint_dir") | ||
| if flat is not None: | ||
| return "flat", flat | ||
| nested = algorithm.get("layerwise") or {} | ||
| ckpt = nested.get("checkpoint_dir") if isinstance(nested, dict) else None | ||
| return ("nested", ckpt) if ckpt is not None else None | ||
|
|
||
|
|
||
| def needs_checkpoint_path_update(quant_cfg: dict) -> bool: |
There was a problem hiding this comment.
[SUGGESTION] When the user redundantly specifies the checkpoint dir under both shapes with the same value (algorithm = {"layerwise": {"checkpoint_dir": "/x"}, "layerwise_checkpoint_dir": "/x"}), this helper picks "flat" and resolve_checkpoint_dir writes the resolved path back only into the flat key — leaving the nested layerwise.checkpoint_dir at the original /x. Pydantic's _migrate_layerwise_checkpoint_dir then sees a conflict (/x_<hash> vs /x) and raises a ValidationError that points at the resolver's output rather than the user's input, which is hard to debug.
Two options:
- In
resolve_checkpoint_dir, when the user has both keys, also overwrite the nestedlayerwise.checkpoint_dir(so the rewrite is consistent). - Detect and raise here, before
resolve_checkpoint_dirruns, with a message that names both keys.
Edge case, low priority — but the resulting error is confusing enough to be worth a guard.
There was a problem hiding this comment.
Applied. resolve_checkpoint_dir now writes the resolved path back to both shapes when both are present in the input (the Pydantic migration validator already accepts matching values), so the post-resolve dict stays internally consistent and the subsequent validator doesn't see a stale value.
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review: the previously flagged critical save_every > 1 corruption bug is still present in modelopt/torch/quantization/utils/layerwise_calib.py. The new test only checks file existence, not file contents, so the corruption remains silent. Other previous comments (manifest update during partial-window crash, _coerce_layerwise_input model_dump() losing model_fields_set) are also unaddressed. Marking comment rather than nudge because the critical bug from the previous review is unresolved.
Status of previous comments:
- (critical, UNRESOLVED)
save_every > 1reads from_SkipLayerstand-ins → silent on-disk corruption. Still no resume test forsave_every > 1. - (critical, UNRESOLVED) Manifest is advanced per-
_save_layercall, so a mid-window crash leavesmanifest=Kwith nonext_inputs.ptfor layerK, breaking resume permanently. - (major, UNRESOLVED)
_coerce_layerwise_inputcalls.model_dump()on aLayerwiseConfiginstance, which expands defaults and breaks_gptq_qdq_default's reliance onmodel_fields_set. - (minor) Cross-version
save_every/save_quantizers_onlymismatch on resume — still no manifest-side guard. - (minor)
__all__declaration inconfig.py— still missing. - (minor) Symmetric
DeprecationWarningfor flatlayerwise_checkpoint_dirmigration — still silent. - (minor)
resolve_checkpoint_dirflat+nested redundancy — still no guard.
Resolved/no-longer-flagging:
- Changelog entry is present and the "TODO before marking ready" note appears to have been dropped from the PR body.
Design-review pass: the PR refactors an existing in-repo subsystem (layerwise calibration + checkpoint format) rather than introducing a new one, so the design protocol's "is there an existing tool that already does this?" question doesn't apply — the existing tool is what's being refactored. The nested-config + whitelist + per-algorithm default approach is reasonable; correctness, not design, is the issue here.
Additional comments (outside the PR diff):
modelopt/torch/quantization/utils/layerwise_calib.py:522— > Bot comment.
Critical (carried over, still unresolved): _save_layer calls _write_manifest(checkpoint_dir, idx, num_layers) at the end of every individual call, but during a save_every > 1 window save the non-boundary layers in the window are written without a next_inputs.pt. If the process is killed (or torch.save raises) between the first _save_layer and the boundary _save_layer within a window, the manifest advances to a layer that has no next_inputs.pt. On resume, setup_resume then raises FileNotFoundError("Cannot resume: next_inputs.pt missing for layer N") and the checkpoint is permanently broken — the PR description's claim that "interrupting mid-window re-calibrates that window on resume" is not what the code does.
Fix: drop the _write_manifest call from _save_layer and call it once at the end of _CheckpointState.save() (with last_completed_layer = layer_idx) so the manifest only advances after the boundary next_inputs.pt is on disk. A regression test that monkeypatches _write_manifest to raise mid-window would lock this in.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1571 +/- ##
==========================================
+ Coverage 76.66% 76.92% +0.25%
==========================================
Files 478 478
Lines 52376 53211 +835
==========================================
+ Hits 40156 40931 +775
- Misses 12220 12280 +60
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
|
/claude review |
cjluo-nv
left a comment
There was a problem hiding this comment.
Bot review — DM the bot to share feedback.
Re-review: all previously-flagged critical issues are resolved with regression tests.
save_every > 1_SkipLayercorruption (critical, was unresolved) — Fixed._CheckpointState.save()now reads fromlayers[layer_idx](the just-calibrated layer) and is called every iteration before the next-layer capture forward swaps prior layers to_SkipLayer. Per-layer files (weights/quantizer state/output_meta) are written eagerly; onlynext_inputs.pt+ manifest are deferred to window boundaries.- Mid-window crash advancing manifest (critical, was unresolved) — Fixed.
_save_layer_filesno longer writes the manifest;_CheckpointState.save()writes it once at boundary/final, so a torch.save crash mid-window leaves the manifest pointing at the previous boundary. Covered bytest_layerwise_save_every_mid_window_crash_recovers_at_prev_boundary(monkeypatches torch.save to crash on layer 2 mid-window, asserts manifest stays at layer 1). _coerce_layerwise_input.model_dump()expanding defaults (critical/major, was unresolved) — Fixed. Now usesmodel_dump(exclude_unset=True)soLayerwiseConfiginstance inputs preservemodel_fields_set, and GPTQ's_gptq_qdq_defaultcorrectly injectsTruefor fields the user didn't explicitly set. Test parametrization extended to coverLayerwiseConfig(enable=True)+LayerwiseConfig(enable=True, get_qdq_activations_from_prev_layer=False).- Save_every>1 resume parity test (was missing) — Added.
test_layerwise_checkpoint_resume_matches_one_shot_amaxparametrized over(save_every=2, save_quantizers_only=False)and(save_every=1, save_quantizers_only=True)rewinds the manifest and asserts amax equivalence with a one-shot baseline — i.e. it would fail under the previous corruption. - Cross-version
save_every/save_quantizers_onlymismatch (minor) — Addressed. Manifest now persists both keys;from_folderraises on any mismatch. Covered bytest_layerwise_checkpoint_mismatch_save_every_raises. - Symmetric DeprecationWarning for flat
layerwise_checkpoint_dir(minor) — Addressed; warning emitted in_migrate_layerwise_checkpoint_dir. resolve_checkpoint_dirflat+nested redundancy (minor) — Addressed; resolver writes back to both shapes when both present, keeping the dict internally consistent.__all__declaration (minor) — Author marked out of scope; acceptable.
Design-review pass: this PR refactors an existing in-repo subsystem (layerwise calibration + checkpoint format) rather than introducing a new one, so the "is there an existing tool that already does this?" question doesn't apply — the existing tool is what's being refactored. The nested-config + whitelist + per-algorithm default approach is reasonable.
Complex PR: spans 8 directories (≥ 5). Looping in a human for approval.
What does this PR do?
Type of change: new feature
Groups all layerwise-calibration options under a nested
LayerwiseConfigand adds three new behavior knobs to it. All changes are backward compatible.1. Nested
layerwiseconfigQuantizeAlgorithmConfig.layerwisechanges fromboolto a Pydantic submodel:Backward compatibility:
layerwise: True/Falsestill accepted (emitsDeprecationWarning).layerwise_checkpoint_dirsilently migrated intolayerwise.checkpoint_dir.use_sequentialalias preserved (and resolved during flat-key migration so it can't be dropped).checkpoint_dirvalues raise.modelopt_recipes/general/ptq/*.yaml,huggingface/qwen3_5*/ptq/*.yaml) migrated to thecanonical nested shape — no semantic change.
2.
get_qdq_activations_from_prev_layer— correct GPTQ vs max-calib semanticsControls what layer N's calibration sees:
goal.
The False branch wraps the next-layer input capture forward with the existing
set_quantizer_by_cfg_contextdeny-all idiom(
{"quantizer_name": "*", "enable": False}).GPTQ's per-algorithm default is enforced via
@model_validator(mode="after")that readsLayerwiseConfig.model_fields_set—works for every input shape (empty constructor, bool, partial dict, full dict) and lets explicit user values override.
3.
save_every— gate the large activation-cache writessave_every: int = 1(ge=1). With N > 1, the per-layernext_inputs.pt(cached activation tensors, the largest checkpointartifact for most models) is only written for the boundary layer of each N-layer window. Per-layer
weight/quantizer/output_meta files are still written every layer (resume needs them to replay skip layers correctly).
Interrupting mid-window re-calibrates that window on resume.
4.
save_quantizers_only— algorithm-aware weight-blob skippingsave_quantizers_only: bool = False. When True, skipweights.ptentirely and persist just the per-quantizerstate_dictslice (carries
_amax) to a newquantizer_buffers.pt. On resume,full_restorereloads only the quantizer slice andtrusts that algorithm semantics didn't mutate
layer.weight.Safety is enforced by a whitelist:
_supports_save_quantizers_only: ClassVar[bool] = FalseonQuantizeAlgorithmConfig,overridden to
Trueonly onMaxCalibConfig,MseCalibConfig,LocalHessianCalibConfig(audited — these only touch_amax). Weight-mutating algorithms (GPTQ folds Hessian updates, AWQ/SmoothQuant fold pre-quant scales) reject the flag atconfig-construction time so in-place weight updates can't be silently lost on resume.
Usage
Testing
New / updated unit tests in
tests/unit/torch/quantization/:test_config_validation.py—TestLayerwiseNestedConfigcovers nested-form acceptance, bool-formDeprecationWarning, flatlayerwise_checkpoint_dirmigration, conflicting flat+nested checkpoint_dir,use_sequentialalias survival under migration, per-algorithm qdq defaults (parametrized Max/GPTQ),
save_everyge=1validation, and thesave_quantizers_onlywhitelist — parametrized rejection on[GPTQ, AWQLite, SmoothQuant]and acceptance on[Max, Mse, LocalHessian].test_layerwise_calibrate.py—test_layerwise_no_qdq_matches_sequential_amax— behavioral equivalence: layerwise +qdq=Falseproduces the sameper-quantizer
_amaxas the non-layerwise (sequential) max-calibration flow (verified viatorch.testing.assert_close).test_layerwise_save_every_writes_next_inputs_only_at_window_boundaries— window-save layout (all layer dirs present,next_inputs.ptonly at boundaries).test_layerwise_save_quantizers_only_resume_matches_one_shot_amax— end-to-end resume: full run → manifest rewound →fresh model resumes → final
_amaxmatches the one-shot baseline; also pins the on-disk shape (noweights.pt,quantizer_buffers.ptpresent).End-to-end correctness verification
Ran 4 PTQ jobs on Qwen3-8B with NVFP4 W4A16 quant_cfg and
--calib_size 16, one GPUeach on 4 GPUs
enable=true, get_qdq_activations_from_prev_layer=true, save_every=5save_everyknobenable=true, get_qdq_activations_from_prev_layer=false, save_quantizers_only=truesave_quantizers_onlyknoblayerwise: true, layerwise_checkpoint_dir: ...enable=falseAcross two pairwise comparisons (A vs C ; B vs D), all 905 tensors in the exported HF checkpoints are bit-identical with hf_quant_config.json matching exactly — confirming the new layerwise knobs preserve correctness and the flat-form backward-compat path is intact.
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines
and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best
Practices (e.g.
avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).layerwise: True/Falsestill accepted with aDeprecationWarning; flatlayerwise_checkpoint_dirsilently migrated;use_sequentialalias preserved. The two new knobs default to no-op behavior(
save_every=1,save_quantizers_only=False).CONTRIBUTING.md: N/A— no new dependencies.
ready for review.
/claude reviewconsulted iteratively; review findings (GPTQ defaultsurvival, save_quantizers_only whitelist scope, docstring accuracy, dead
layerparam) addressed in-PR.Additional Information
Notes on design choices that came out of internal review:
qdq=Truedefault uses amodel_validator(mode="after")+model_fields_setcheck rather than adefault_factory— a
default_factoryis only fired when the field is absent, so any user-supplied dict (the natural way to enablelayerwise) would silently lose the GPTQ default.
save_quantizers_onlyis enforced as a whitelist (_supports_save_quantizers_only) rather than a per-algorithm blacklist,which keeps future weight-mutating algorithms safe by default.
set_quantizer_by_cfg_contextis reused for the qdq-disable scope instead of a bespoke helper, keepingmodel_calib.pyaligned with the existing "deny-all" idiom documented at
conversion.py:240.examples/llm_ptq/example_utils.py(needs_checkpoint_path_update/resolve_checkpoint_dir) accept both flat and nested shapes since they run before Pydantic validation;resolve_checkpoint_dirnow also returns the resolved path so the caller doesn't re-derive it.Summary by CodeRabbit