Skip to content

feat: Layerwise calibration: nested config + QDQ-from-prev-layer flag + checkpoint I/O knobs#1571

Open
Fridah-nv wants to merge 5 commits into
mainfrom
fridah/layerwise-config
Open

feat: Layerwise calibration: nested config + QDQ-from-prev-layer flag + checkpoint I/O knobs#1571
Fridah-nv wants to merge 5 commits into
mainfrom
fridah/layerwise-config

Conversation

@Fridah-nv
Copy link
Copy Markdown
Contributor

@Fridah-nv Fridah-nv commented May 29, 2026

What does this PR do?

Type of change: new feature

Groups all layerwise-calibration options under a nested LayerwiseConfig and adds three new behavior knobs to it. All changes are backward compatible.

1. Nested layerwise config

QuantizeAlgorithmConfig.layerwise changes from bool to a Pydantic submodel:

class LayerwiseConfig(ModeloptBaseConfig):
    enable: bool = False
    get_qdq_activations_from_prev_layer: bool = False
    checkpoint_dir: str | None = None
    save_every: int = 1
    save_quantizers_only: bool = False

Backward compatibility:

  • layerwise: True/False still accepted (emits DeprecationWarning).
  • Flat layerwise_checkpoint_dir silently migrated into layerwise.checkpoint_dir.
  • Legacy use_sequential alias preserved (and resolved during flat-key migration so it can't be dropped).
  • Conflicting flat+nested checkpoint_dir values raise.
  • All 7 shipped PTQ recipes (modelopt_recipes/general/ptq/*.yaml, huggingface/qwen3_5*/ptq/*.yaml) migrated to the
    canonical nested shape — no semantic change.

2. get_qdq_activations_from_prev_layer — correct GPTQ vs max-calib semantics

Controls what layer N's calibration sees:

  • True (GPTQ default): activations carry the quantize-dequantize error of layers 0..N-1 — GPTQ's Hessian-compensation
    goal.
  • False (max/mse/local_hessian default): full-precision activations, matching the non-layerwise pass exactly.

The False branch wraps the next-layer input capture forward with the existing set_quantizer_by_cfg_context deny-all idiom
({"quantizer_name": "*", "enable": False}).

GPTQ's per-algorithm default is enforced via @model_validator(mode="after") that reads LayerwiseConfig.model_fields_set
works for every input shape (empty constructor, bool, partial dict, full dict) and lets explicit user values override.

3. save_every — gate the large activation-cache writes

save_every: int = 1 (ge=1). With N > 1, the per-layer next_inputs.pt (cached activation tensors, the largest checkpoint
artifact for most models) is only written for the boundary layer of each N-layer window. Per-layer
weight/quantizer/output_meta files are still written every layer (resume needs them to replay skip layers correctly).
Interrupting mid-window re-calibrates that window on resume.

4. save_quantizers_only — algorithm-aware weight-blob skipping

save_quantizers_only: bool = False. When True, skip weights.pt entirely and persist just the per-quantizer state_dict
slice (carries _amax) to a new quantizer_buffers.pt. On resume, full_restore reloads only the quantizer slice and
trusts that algorithm semantics didn't mutate layer.weight.

Safety is enforced by a whitelist: _supports_save_quantizers_only: ClassVar[bool] = False on QuantizeAlgorithmConfig,
overridden to True only on MaxCalibConfig, MseCalibConfig, LocalHessianCalibConfig (audited — these only touch
_amax). Weight-mutating algorithms (GPTQ folds Hessian updates, AWQ/SmoothQuant fold pre-quant scales) reject the flag at
config-construction time so in-place weight updates can't be silently lost on resume.

Usage

import modelopt.torch.quantization as mtq

# GPTQ — `get_qdq_activations_from_prev_layer` defaults to True (Hessian semantics).
# save_every reduces activation-cache I/O.
mtq.quantize(
    model,
    {
        "quant_cfg": [...],
        "algorithm": {
            "method": "gptq",
            "layerwise": {
                "enable": True,
                "checkpoint_dir": "/path/to/ckpts",
                "save_every": 4,
            },
        },
    },
    forward_loop=forward_loop,
)

# Max-calibration — `get_qdq_activations_from_prev_layer` defaults to False (FP from prior layers).
# save_quantizers_only skips the weights blob since max only updates _amax.
mtq.quantize(
    model,
    {
        "quant_cfg": [...],
        "algorithm": {
            "method": "max",
            "layerwise": {
                "enable": True,
                "checkpoint_dir": "/path/to/ckpts",
                "save_quantizers_only": True,
            },
        },
    },
    forward_loop=forward_loop,
)

Testing

New / updated unit tests in tests/unit/torch/quantization/:

  • test_config_validation.pyTestLayerwiseNestedConfig covers nested-form acceptance, bool-form
    DeprecationWarning, flat layerwise_checkpoint_dir migration, conflicting flat+nested checkpoint_dir, use_sequential
    alias survival under migration, per-algorithm qdq defaults (parametrized Max/GPTQ), save_every ge=1 validation, and the
    save_quantizers_only whitelist — parametrized rejection on [GPTQ, AWQLite, SmoothQuant] and acceptance on [Max, Mse, LocalHessian].

  • test_layerwise_calibrate.py

    • test_layerwise_no_qdq_matches_sequential_amax — behavioral equivalence: layerwise + qdq=False produces the same
      per-quantizer _amax as the non-layerwise (sequential) max-calibration flow (verified via torch.testing.assert_close).
    • test_layerwise_save_every_writes_next_inputs_only_at_window_boundaries — window-save layout (all layer dirs present,
      next_inputs.pt only at boundaries).
    • test_layerwise_save_quantizers_only_resume_matches_one_shot_amax — end-to-end resume: full run → manifest rewound →
      fresh model resumes → final _amax matches the one-shot baseline; also pins the on-disk shape (no weights.pt,
      quantizer_buffers.pt present).

End-to-end correctness verification

Ran 4 PTQ jobs on Qwen3-8B with NVFP4 W4A16 quant_cfg and --calib_size 16, one GPU
each on 4 GPUs

Run Algorithm Layerwise config Purpose
A GPTQ enable=true, get_qdq_activations_from_prev_layer=true, save_every=5 New nested form + the newsave_every knob
B MSE enable=true, get_qdq_activations_from_prev_layer=false, save_quantizers_only=true New nested form + the new save_quantizers_only knob
C GPTQ Legacy flat form: layerwise: true, layerwise_checkpoint_dir: ... Backward-compat baseline for A (exercises the migration validator)
D MSE enable=false Non-layerwise baseline for B

Across two pairwise comparisons (A vs C ; B vs D), all 905 tensors in the exported HF checkpoints are bit-identical with hf_quant_config.json matching exactly — confirming the new layerwise knobs preserve correctness and the flat-form backward-compat path is intact.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines
and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best
Practices
(e.g.
avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ — layerwise: True/False still accepted with a DeprecationWarning; flat
    layerwise_checkpoint_dir silently migrated; use_sequential alias preserved. The two new knobs default to no-op behavior
    (save_every=1, save_quantizers_only=False).
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A
    — no new dependencies.
  • Did you write any new necessary tests?: ✅ — see Testing section.
  • Did you update Changelog?: ✅
    ready for review.
  • Did you get Claude approval on this PR?: ✅ — /claude review consulted iteratively; review findings (GPTQ default
    survival, save_quantizers_only whitelist scope, docstring accuracy, dead layer param) addressed in-PR.

Additional Information

Notes on design choices that came out of internal review:

  • GPTQ's qdq=True default uses a model_validator(mode="after") + model_fields_set check rather than a default_factory
    — a default_factory is only fired when the field is absent, so any user-supplied dict (the natural way to enable
    layerwise) would silently lose the GPTQ default.
  • save_quantizers_only is enforced as a whitelist (_supports_save_quantizers_only) rather than a per-algorithm blacklist,
    which keeps future weight-mutating algorithms safe by default.
  • set_quantizer_by_cfg_context is reused for the qdq-disable scope instead of a bespoke helper, keeping model_calib.py
    aligned with the existing "deny-all" idiom documented at conversion.py:240.
  • Pre-validation recipe helpers in examples/llm_ptq/example_utils.py (needs_checkpoint_path_update /
    resolve_checkpoint_dir) accept both flat and nested shapes since they run before Pydantic validation;
    resolve_checkpoint_dir now also returns the resolved path so the caller doesn't re-derive it.

Summary by CodeRabbit

  • New Features
    • Layerwise calibration adds nested configuration with new options: get_qdq_activations_from_prev_layer, save_every, save_quantizers_only, and clearer enable/checkpoint controls.
  • Deprecations
    • Legacy flat-style layerwise keys are deprecated but still supported with warnings; migrated to nested form.
  • Tests
    • Expanded tests for nested layerwise configs, checkpointing semantics, windowed saves, resume/recovery, and validation.

Review Change Stack

Fridah-nv added 3 commits May 29, 2026 21:50
…layer

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv Fridah-nv requested review from a team as code owners May 29, 2026 23:02
@Fridah-nv Fridah-nv requested a review from cjluo-nv May 29, 2026 23:02
@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 29, 2026

📝 Walkthrough

Walkthrough

This PR restructures layerwise quantization calibration configuration from flat boolean and separate checkpoint-directory fields to a nested LayerwiseConfig object with support for checkpoint windowing, quantizer-only persistence, and QDQ activation control. It includes backward-compatible migration logic, per-algorithm feature allowlisting, and updates to the calibration pipeline, persistence layer, example utilities, recipe files, and tests.

Changes

Layerwise Configuration Restructuring

Layer / File(s) Summary
LayerwiseConfig schema, validation, and legacy migration
modelopt/torch/quantization/config.py, tests/unit/torch/quantization/test_config_validation.py
New LayerwiseConfig model replaces flat layerwise: bool + layerwise_checkpoint_dir with nested enable/checkpoint_dir/save_every/save_quantizers_only/get_qdq_activations_from_prev_layer. _coerce_layerwise_input normalizes legacy boolean/alias inputs (emits DeprecationWarning) and migrates flat checkpoint_dir into nested form. Per-algorithm _supports_save_quantizers_only flags control quantizer-only checkpoint support. GPTQ validator injects a default for get_qdq_activations_from_prev_layer. Tests cover parsing, migration, conflicts, and algorithm-specific rules.
Config consumption in calibration pipeline
modelopt/torch/quantization/mode.py, modelopt/torch/quantization/model_calib.py
wrapped_calib_func and layerwise_calibrate extract nested layerwise parameters (enable, checkpoint_dir, get_qdq_activations_from_prev_layer, save_every, save_quantizers_only). Pipeline uses a conditional context to control QDQ activation capture for next-layer inputs and initializes _CheckpointState with the new options. Error strings and GPTQ docstring updated to reference layerwise.enable.
Checkpoint persistence with windowing and quantizer-only support
modelopt/torch/quantization/utils/layerwise_calib.py, tests/unit/torch/quantization/test_layerwise_calibrate.py
Manifest now stores save_every and save_quantizers_only. Adds _save_layer_files to write per-layer qstate/output_meta plus either weights.pt or quantizer_buffers.pt. _CheckpointState accepts save_every and save_quantizers_only, tracks _last_saved_layer for window commits, and from_folder validates/propagates these keys. save(...) snapshots per layer but only commits manifest and next_inputs.pt at window boundaries or final layer. Tests verify windowing, quantizer-only layout, resume, and crash-recovery semantics.
User-facing integration: examples, recipes, and behavior tests
examples/llm_ptq/example_utils.py, examples/llm_ptq/hf_ptq.py, modelopt_recipes/general/ptq/*.yaml, modelopt_recipes/huggingface/*/*.yaml, tests/unit/torch/quantization/test_layerwise_calibrate.py
_layerwise_checkpoint_dir_location() detects flat vs nested shapes; resolve_checkpoint_dir() returns (quant_cfg, resolved_path) and mutates the correct shape. HF PTQ example reads layerwise.enable and unpacks the resolved path. Multiple PTQ recipes updated to nested layerwise: { enable: bool, checkpoint_dir: ... }. End-to-end tests validate qdq-from-prev behavior and checkpoint resume correctness.
Changelog documentation
CHANGELOG.rst
Documents introduction of nested LayerwiseConfig with new checkpoint and QDQ activation options, and deprecation of legacy boolean and flat checkpoint_dir fields.

Sequence Diagram(s)

sequenceDiagram
  participant User
  participant QuantizationConfig as Config Schema
  participant CalibPipeline as Calibration Pipeline
  participant CheckpointState as Checkpoint Persistence
  
  User->>QuantizationConfig: provide legacy or nested layerwise config
  QuantizationConfig->>QuantizationConfig: coerce bool to nested dict (emit DeprecationWarning)
  QuantizationConfig->>QuantizationConfig: migrate layerwise_checkpoint_dir → layerwise.checkpoint_dir
  QuantizationConfig->>CalibPipeline: return normalized LayerwiseConfig
  
  CalibPipeline->>CalibPipeline: extract layerwise.enable, checkpoint_dir, save_every, save_quantizers_only, qdq_from_prev
  alt layerwise.enable=True
    CalibPipeline->>CheckpointState: create _CheckpointState(checkpoint_dir, save_every, save_quantizers_only)
    loop per layer
      CalibPipeline->>CalibPipeline: calibrate layer
      alt qdq_from_prev=False
        CalibPipeline->>CalibPipeline: disable quantizers under calibrated layer
      end
      CalibPipeline->>CalibPipeline: capture next-layer inputs
      alt layer_idx % save_every == 0 OR final layer
        CalibPipeline->>CheckpointState: save(layer_idx, model, layers)
        CheckpointState->>CheckpointState: save qstate, (weights.pt OR quantizer_buffers.pt), output_meta
        CheckpointState->>CheckpointState: write next_inputs.pt only for final layer in window
      end
    end
  else layerwise.enable=False
    CalibPipeline->>CalibPipeline: sequential calibration
  end
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~60 minutes

Possibly related PRs

  • NVIDIA/Model-Optimizer#1251: Prior layerwise + checkpoint resume work that this change builds upon (nested layerwise config and checkpoint mechanics).

Suggested reviewers

  • realAsma
  • meenchen
  • cjluo-nv
  • Edwardf0t1
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 54.17% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Title check ✅ Passed The title accurately describes the main changes: refactoring layerwise calibration into a nested config structure with new QDQ-from-prev-layer and checkpoint I/O options, which is the primary focus across config, calibration, and example files.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed All torch.load calls with weights_only=False have inline comments justifying safety; trust_remote_code are parameters with False defaults; no eval/exec, # nosec, or new dependencies found.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch fridah/layerwise-config

Comment @coderabbitai help to get the list of available commands and usage tips.

@Fridah-nv Fridah-nv requested review from realAsma and sugunav14 May 29, 2026 23:03
Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv
Copy link
Copy Markdown
Contributor Author

/claude review

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 29, 2026

PR Preview Action v1.8.1

QR code for preview link

🚀 View preview at
https://NVIDIA.github.io/Model-Optimizer/pr-preview/pr-1571/

Built to branch gh-pages at 2026-05-30 00:11 UTC.
Preview will be ready when the GitHub Pages deployment is complete.

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

The nested LayerwiseConfig refactor, get_qdq_activations_from_prev_layer flag, and save_quantizers_only whitelist look well-designed and the back-compat shims (bool coercion + flat→nested migration with the use_sequential rescue) are thorough. Tests around config validation and the qdq=False-matches-sequential-amax behavioral check are good.

However, save_every > 1 looks broken: by the time _CheckpointState.save() is invoked for the boundary layer, the earlier layers in the window have already been replaced by _SkipLayer instances in transformer_layers (via _set_layer_states(layer_idx+1) swapping layer_idx-1 to a dummy during cache_outputs_for_next_layer_calib). Reading layers[i].state_dict() / quantizer_state(layers[i]) / layers[i]._layerwise_calib.output_meta on a _SkipLayer returns empty/None state because _original is attached via object.__setattr__ and is not a registered submodule. The new test_layerwise_save_every_writes_next_inputs_only_at_window_boundaries only checks file existence, so the silent corruption isn't caught — there is no resume test for save_every > 1. See inline comment for details and the suggested fix (snapshot to memory immediately after each layer is calibrated, then flush at boundaries; or save real layers eagerly and only window-gate next_inputs.pt).

Minor: changelog says "TODO before marking ready for review" in the PR body but the changelog entry is already added — please confirm the description is up to date before merge.

Comment thread modelopt/torch/quantization/utils/layerwise_calib.py Outdated
Comment thread modelopt/torch/quantization/utils/layerwise_calib.py
Comment thread modelopt/torch/quantization/config.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 3

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@modelopt/torch/quantization/config.py`:
- Around line 153-160: Add an explicit __all__ declaration near the top of the
module that lists the new public API symbols (e.g. "LayerwiseConfig" and any
other public config classes, functions or constants defined in this file and the
other block referenced around 636-691); place it after the imports and ensure it
enumerates every symbol intended for export so star-imports are safe and stable,
and update corresponding package __init__.py re-exports if needed to re-export
these names.
- Around line 706-708: _coerce_layerwise_input currently returns
value.model_dump() for LayerwiseConfig which expands defaults and loses the
original model_fields_set causing GPTQ's _gptq_qdq_default to miss injecting
defaults; change _coerce_layerwise_input to return the LayerwiseConfig instance
(value) unchanged when isinstance(value, LayerwiseConfig) so downstream parsing
preserves model_fields_set (e.g., self.layerwise.model_fields_set) and
GPTQCalibConfig._gptq_qdq_default can correctly detect which fields the user
actually set.

In `@modelopt/torch/quantization/utils/layerwise_calib.py`:
- Around line 729-761: The loop that saves windowed checkpoints is capturing
_SkipLayer placeholders rather than the underlying calibrated layer because
cache_outputs_for_next_layer_calib() can replace entries in layers with
_SkipLayer before save; update the save loop in the function that iterates
layers[i] to detect instances of _SkipLayer and use layer = layer._original (or
otherwise snapshot the original layer list before cache outputs mutate it) so
_save_layer receives the real calibrated layer, its quantizer state
(quantizer_state(layer)), and correct output_meta rather than the placeholder.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: e32cbbac-64bc-4625-80b8-93a1790c3d62

📥 Commits

Reviewing files that changed from the base of the PR and between 40a4dd3 and dbca1a5.

📒 Files selected for processing (16)
  • CHANGELOG.rst
  • examples/llm_ptq/example_utils.py
  • examples/llm_ptq/hf_ptq.py
  • modelopt/torch/quantization/config.py
  • modelopt/torch/quantization/mode.py
  • modelopt/torch/quantization/model_calib.py
  • modelopt/torch/quantization/utils/layerwise_calib.py
  • modelopt_recipes/general/ptq/nvfp4_default-kv_none-gptq.yaml
  • modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8.yaml
  • modelopt_recipes/general/ptq/nvfp4_experts_only-kv_fp8_layerwise.yaml
  • modelopt_recipes/general/ptq/nvfp4_experts_only_mse-kv_fp8_cast.yaml
  • modelopt_recipes/general/ptq/nvfp4_mlp_only_mse-kv_fp8_cast.yaml
  • modelopt_recipes/huggingface/qwen3_5/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yaml
  • modelopt_recipes/huggingface/qwen3_5_moe/ptq/w4a16_nvfp4-fp8_attn-kv_fp8_cast.yaml
  • tests/unit/torch/quantization/test_config_validation.py
  • tests/unit/torch/quantization/test_layerwise_calibrate.py

Comment thread modelopt/torch/quantization/config.py
Comment thread modelopt/torch/quantization/config.py
Comment thread modelopt/torch/quantization/utils/layerwise_calib.py Outdated
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review — 1 CRITICAL, 0 IMPORTANT, 2 SUGGESTIONs.

Most impactful finding (inline on modelopt/torch/quantization/utils/layerwise_calib.py):

_save_layer calls _write_manifest(...) at the end of every individual call, but during a save_every > 1 window save the non-boundary layers in the window are written without a next_inputs.pt. So if the process is killed (or torch.save raises) between the first and last _save_layer call within a window, the manifest advances to a layer that has no next_inputs.pt. On resume, _CheckpointState.setup_resume then raises FileNotFoundError("Cannot resume: next_inputs.pt missing for layer N") and the checkpoint is permanently broken — there's no fallback that re-calibrates the partial window. The PR description's claim that "interrupting mid-window re-calibrates that window on resume" is the intended invariant, but the implementation doesn't enforce it.

The fix is small: only advance the manifest after the entire window has been flushed (drop the _write_manifest call from _save_layer and call it once at the end of _CheckpointState.save() with last_completed_layer = layer_idx). A regression test that monkeypatches _write_manifest to raise mid-window would lock this in. With save_every=1 (the default) the bug isn't reachable, so the existing tests don't catch it — this only bites users who opt into the new knob.

Other:

  • SUGGESTION: legacy flat layerwise_checkpoint_dir is migrated silently in _migrate_layerwise_checkpoint_dir. The bool-form layerwise emits a DeprecationWarning; consider symmetric treatment so the migration window for the flat key is signaled.
  • SUGGESTION: resolve_checkpoint_dir in examples/llm_ptq/example_utils.py only updates whichever shape it picks — if the user happens to specify both flat and nested with the same value, the rewrite produces a Pydantic conflict whose error message points at the resolver's output, not at the user's redundancy. Edge case but worth a guard.

Risk: Low–medium. The default path (save_every=1, no save_quantizers_only) is well-tested and the algorithm-correctness pieces (qdq-from-prev semantics, GPTQ default via model_fields_set, save_quantizers_only whitelist) look right and are covered by the new tests. The save_every>1 path has the resume-corruption bug above, which only opt-in users will hit.

Comment thread modelopt/torch/quantization/utils/layerwise_calib.py
Comment thread modelopt/torch/quantization/config.py
Comment on lines +853 to +865
def _layerwise_checkpoint_dir_location(algorithm) -> tuple[str, str] | None:
"""Return ``("flat"/"nested", checkpoint_dir)`` for the layerwise checkpoint dir, or None."""
if not isinstance(algorithm, dict):
return False
return algorithm.get("layerwise_checkpoint_dir") is not None
return None
flat = algorithm.get("layerwise_checkpoint_dir")
if flat is not None:
return "flat", flat
nested = algorithm.get("layerwise") or {}
ckpt = nested.get("checkpoint_dir") if isinstance(nested, dict) else None
return ("nested", ckpt) if ckpt is not None else None


def needs_checkpoint_path_update(quant_cfg: dict) -> bool:
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[SUGGESTION] When the user redundantly specifies the checkpoint dir under both shapes with the same value (algorithm = {"layerwise": {"checkpoint_dir": "/x"}, "layerwise_checkpoint_dir": "/x"}), this helper picks "flat" and resolve_checkpoint_dir writes the resolved path back only into the flat key — leaving the nested layerwise.checkpoint_dir at the original /x. Pydantic's _migrate_layerwise_checkpoint_dir then sees a conflict (/x_<hash> vs /x) and raises a ValidationError that points at the resolver's output rather than the user's input, which is hard to debug.

Two options:

  1. In resolve_checkpoint_dir, when the user has both keys, also overwrite the nested layerwise.checkpoint_dir (so the rewrite is consistent).
  2. Detect and raise here, before resolve_checkpoint_dir runs, with a message that names both keys.

Edge case, low priority — but the resulting error is confusing enough to be worth a guard.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Applied. resolve_checkpoint_dir now writes the resolved path back to both shapes when both are present in the input (the Pydantic migration validator already accepts matching values), so the post-resolve dict stays internally consistent and the subsequent validator doesn't see a stale value.

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review: the previously flagged critical save_every > 1 corruption bug is still present in modelopt/torch/quantization/utils/layerwise_calib.py. The new test only checks file existence, not file contents, so the corruption remains silent. Other previous comments (manifest update during partial-window crash, _coerce_layerwise_input model_dump() losing model_fields_set) are also unaddressed. Marking comment rather than nudge because the critical bug from the previous review is unresolved.

Status of previous comments:

  • (critical, UNRESOLVED) save_every > 1 reads from _SkipLayer stand-ins → silent on-disk corruption. Still no resume test for save_every > 1.
  • (critical, UNRESOLVED) Manifest is advanced per-_save_layer call, so a mid-window crash leaves manifest=K with no next_inputs.pt for layer K, breaking resume permanently.
  • (major, UNRESOLVED) _coerce_layerwise_input calls .model_dump() on a LayerwiseConfig instance, which expands defaults and breaks _gptq_qdq_default's reliance on model_fields_set.
  • (minor) Cross-version save_every / save_quantizers_only mismatch on resume — still no manifest-side guard.
  • (minor) __all__ declaration in config.py — still missing.
  • (minor) Symmetric DeprecationWarning for flat layerwise_checkpoint_dir migration — still silent.
  • (minor) resolve_checkpoint_dir flat+nested redundancy — still no guard.

Resolved/no-longer-flagging:

  • Changelog entry is present and the "TODO before marking ready" note appears to have been dropped from the PR body.

Design-review pass: the PR refactors an existing in-repo subsystem (layerwise calibration + checkpoint format) rather than introducing a new one, so the design protocol's "is there an existing tool that already does this?" question doesn't apply — the existing tool is what's being refactored. The nested-config + whitelist + per-algorithm default approach is reasonable; correctness, not design, is the issue here.


Additional comments (outside the PR diff):

  • modelopt/torch/quantization/utils/layerwise_calib.py:522 — > Bot comment.

Critical (carried over, still unresolved): _save_layer calls _write_manifest(checkpoint_dir, idx, num_layers) at the end of every individual call, but during a save_every > 1 window save the non-boundary layers in the window are written without a next_inputs.pt. If the process is killed (or torch.save raises) between the first _save_layer and the boundary _save_layer within a window, the manifest advances to a layer that has no next_inputs.pt. On resume, setup_resume then raises FileNotFoundError("Cannot resume: next_inputs.pt missing for layer N") and the checkpoint is permanently broken — the PR description's claim that "interrupting mid-window re-calibrates that window on resume" is not what the code does.

Fix: drop the _write_manifest call from _save_layer and call it once at the end of _CheckpointState.save() (with last_completed_layer = layer_idx) so the manifest only advances after the boundary next_inputs.pt is on disk. A regression test that monkeypatches _write_manifest to raise mid-window would lock this in.

Comment thread modelopt/torch/quantization/utils/layerwise_calib.py Outdated
Comment thread modelopt/torch/quantization/config.py Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 29, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.92%. Comparing base (eb5ed2d) to head (eba56ee).
⚠️ Report is 6 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1571      +/-   ##
==========================================
+ Coverage   76.66%   76.92%   +0.25%     
==========================================
  Files         478      478              
  Lines       52376    53211     +835     
==========================================
+ Hits        40156    40931     +775     
- Misses      12220    12280      +60     
Flag Coverage Δ
examples 41.50% <44.14%> (+8.62%) ⬆️
gpu 59.52% <90.99%> (-0.55%) ⬇️
regression 15.20% <37.83%> (+0.02%) ⬆️
unit 53.69% <100.00%> (+0.19%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Signed-off-by: Frida Hou <201670829+Fridah-nv@users.noreply.github.com>
@Fridah-nv
Copy link
Copy Markdown
Contributor Author

/claude review

Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bot review — DM the bot to share feedback.

Re-review: all previously-flagged critical issues are resolved with regression tests.

  • save_every > 1 _SkipLayer corruption (critical, was unresolved) — Fixed. _CheckpointState.save() now reads from layers[layer_idx] (the just-calibrated layer) and is called every iteration before the next-layer capture forward swaps prior layers to _SkipLayer. Per-layer files (weights/quantizer state/output_meta) are written eagerly; only next_inputs.pt + manifest are deferred to window boundaries.
  • Mid-window crash advancing manifest (critical, was unresolved) — Fixed. _save_layer_files no longer writes the manifest; _CheckpointState.save() writes it once at boundary/final, so a torch.save crash mid-window leaves the manifest pointing at the previous boundary. Covered by test_layerwise_save_every_mid_window_crash_recovers_at_prev_boundary (monkeypatches torch.save to crash on layer 2 mid-window, asserts manifest stays at layer 1).
  • _coerce_layerwise_input .model_dump() expanding defaults (critical/major, was unresolved) — Fixed. Now uses model_dump(exclude_unset=True) so LayerwiseConfig instance inputs preserve model_fields_set, and GPTQ's _gptq_qdq_default correctly injects True for fields the user didn't explicitly set. Test parametrization extended to cover LayerwiseConfig(enable=True) + LayerwiseConfig(enable=True, get_qdq_activations_from_prev_layer=False).
  • Save_every>1 resume parity test (was missing) — Added. test_layerwise_checkpoint_resume_matches_one_shot_amax parametrized over (save_every=2, save_quantizers_only=False) and (save_every=1, save_quantizers_only=True) rewinds the manifest and asserts amax equivalence with a one-shot baseline — i.e. it would fail under the previous corruption.
  • Cross-version save_every/save_quantizers_only mismatch (minor) — Addressed. Manifest now persists both keys; from_folder raises on any mismatch. Covered by test_layerwise_checkpoint_mismatch_save_every_raises.
  • Symmetric DeprecationWarning for flat layerwise_checkpoint_dir (minor) — Addressed; warning emitted in _migrate_layerwise_checkpoint_dir.
  • resolve_checkpoint_dir flat+nested redundancy (minor) — Addressed; resolver writes back to both shapes when both present, keeping the dict internally consistent.
  • __all__ declaration (minor) — Author marked out of scope; acceptable.

Design-review pass: this PR refactors an existing in-repo subsystem (layerwise calibration + checkpoint format) rather than introducing a new one, so the "is there an existing tool that already does this?" question doesn't apply — the existing tool is what's being refactored. The nested-config + whitelist + per-algorithm default approach is reasonable.

Complex PR: spans 8 directories (≥ 5). Looping in a human for approval.

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude review passed — no blocking issues found. LGTM

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants