Skip to content

Drive hf_ptq qformat choices from preset YAMLs (remove hardcoded CLI quant configs)#1525

Open
shengliangxu wants to merge 4 commits into
mainfrom
shengliangx/hf-ptq-dereference-hardcoded-configs
Open

Drive hf_ptq qformat choices from preset YAMLs (remove hardcoded CLI quant configs)#1525
shengliangxu wants to merge 4 commits into
mainfrom
shengliangx/hf-ptq-dereference-hardcoded-configs

Conversation

@shengliangxu
Copy link
Copy Markdown
Collaborator

@shengliangxu shengliangxu commented May 21, 2026

What does this PR do?

Type of change: Refactor

Replace the hardcoded QUANT_CFG_CHOICES / KV_QUANT_CFG_CHOICES dicts in examples/llm_ptq/hf_ptq.py with a lazy Mapping that discovers available qformat names by listing modelopt_recipes/configs/ptq/presets/{model,kv}/ and loads each YAML on first access via the existing load_config(..., schema_type=QuantizeConfig) path. The directory listing becomes the source of truth for --qformat / --kv_cache_qformat CLI vocabulary.

A small _QFORMAT_ALIASES table preserves previously-supported short CLI names (int8_sq, nvfp4_awq, fp8_pb_wo, ...) as deprecation shims. It is documented as not-for-extension — new formats land as preset YAMLs, and longer term, configurations should be authored as full recipes (--recipe).

Also adds presets/kv/fp8_cast.yaml and presets/kv/nvfp4_cast.yaml, composed from the existing kv_fp8_cast / kv_nvfp4_cast unit fragments. This promotes fp8_cast / nvfp4_cast to first-class KV presets and lets us delete the runtime _set_kv_cache_constant_amax helper and all three of its call sites — use_constant_amax is now authoritative in the YAML.

Side effect: every preset YAML under presets/model/ (mxfp4, mxfp6, mxint8, nvfp4_awq_full, nvfp4_fp8_mha, mamba_moe_*, ...) is now automatically exposed as a valid --qformat value with no further code change.

Usage

# Old short names still work via the alias shim
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat int8_sq \
    --kv_cache_qformat fp8_cast \
    --export_path out/

# New canonical preset basenames work directly
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat int8_smoothquant \
    --kv_cache_qformat fp8_cast \
    --export_path out/

# Newly-exposed presets (previously not on the CLI)
python examples/llm_ptq/hf_ptq.py \
    --pyt_ckpt_path <model> \
    --qformat nvfp4_awq_full \
    --export_path out/

Testing

Verified locally with both .venv (uv, py3.13) and the dev-py310-modelopt conda env:

  • All 20 previously-supported --qformat short names resolve and produce dicts that are exactly equal to the corresponding mtq.X_DEFAULT_CFG constants.
  • All 7 KV qformat names (fp8, fp8_cast, fp8_affine, nvfp4, nvfp4_cast, nvfp4_affine, nvfp4_rotate) resolve and match.
  • fp8_cast / nvfp4_cast YAML presets now contain use_constant_amax: true baked into the [kv]_bmm_quantizer cfg.
  • Non-cast variants (fp8, nvfp4) still do not set use_constant_amax (data-driven calibration preserved).
  • argparse accepts --kv_cache_qformat none plus all cast / affine / rotate variants.
  • Unknown qformats raise KeyError at lookup time and argparse choice error at the CLI.
  • All pre-commit hooks pass (ruff, mypy, bandit, license, yaml format, recipe validation).

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ — all previously-valid --qformat and --kv_cache_qformat values continue to work via the alias table; output configs are bit-equivalent to the prior hardcoded path.
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: N/A — no new deps.
  • Did you write any new necessary tests?: ❌ — existing PTQ integration tests exercise these qformats; the refactor is config-equivalence-preserving and was spot-verified against mtq.X_DEFAULT_CFG constants.
  • Did you update Changelog?: ❌ — happy to add an entry; treating this as an internal refactor for now.
  • Did you get Claude approval on this PR?: ❌ — will run /claude review once ready.

Additional Information

  • Two new YAML presets: modelopt_recipes/configs/ptq/presets/kv/{fp8_cast,nvfp4_cast}.yaml.
  • Deletes: _set_kv_cache_constant_amax helper + all 3 call sites in hf_ptq.py.
  • multinode_ptq.py is intentionally untouched (out of scope for this branch).

Summary by CodeRabbit

  • New Features

    • Added FP8-cast and NVFP4-cast KV-cache quantization presets to expand supported quantization options.
  • Refactor

    • qformat and KV-cache qformat choices are now discovered from preset files, with backward-compatible aliases preserved.
    • KV-cache calibration/enabling behavior is driven by presets rather than a runtime override; CLI validation and help updated to reflect preset-driven choices.
  • Documentation

    • Changelog updated to describe preset-driven options and new KV presets.

Review Change Stack

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 21, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 21, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

This PR refactors quantization configuration discovery in a PTQ example script from static mtq.*_CFG dictionaries to lazy-loaded YAML presets. It introduces a memoized preset loader, applies it across all quantization pipelines, removes post-hoc configuration overrides, and adds new KV-cache cast preset files.

Changes

PTQ quantization preset refactoring

Layer / File(s) Summary
Preset infrastructure foundation
examples/llm_ptq/hf_ptq.py (line 21, 59, 70–75, 95–229)
Imports support for lazy mappings and recipe utilities. Defines BUILTIN_CONFIG_ROOT, _PresetCfgChoices lazy mapping class, preset directory constants, backward-compatible qformat aliases, _KV_NONE sentinel, and _AUTO_QUANTIZE_QFORMATS validation set. Replaces static QUANT_CFG_CHOICES and KV_QUANT_CFG_CHOICES dicts with discovered preset mappings.
Using presets across quantization pipelines
examples/llm_ptq/hf_ptq.py (lines 406–408, 484–491, 517–524, 1174, 1186–1193, 1206)
Auto-quantize, low-memory, and mono quantization paths now retrieve KV-cache configs from KV_QUANT_CFG_CHOICES preset mappings instead of mtq module lookups. KV-cache enabling switches from string "none" to _KV_NONE sentinel. Removes _set_kv_cache_constant_amax helper and post-hoc override logic that forced use_constant_amax for cast formats. Updates error messages and validation to reflect the new preset mapping type.
CLI argument updates
examples/llm_ptq/hf_ptq.py (line 1356)
--kv_cache_qformat argument choices now includes _KV_NONE and dynamic keys from KV_QUANT_CFG_CHOICES preset mapping instead of static dict keys.
New KV-cache cast presets & changelog
modelopt_recipes/configs/ptq/presets/kv/fp8_cast.yaml, modelopt_recipes/configs/ptq/presets/kv/nvfp4_cast.yaml, CHANGELOG.rst
Adds FP8 E4M3 and NVFP4 KV-cache cast preset YAML files (each with imports and $import usage in quant_cfg) and updates the changelog describing preset-driven CLI discovery and removal of runtime use_constant_amax patching.

🎯 3 (Moderate) | ⏱️ ~25 minutes

Suggested reviewers:

  • Edwardf0t1
  • realAsma
  • meenchen
🚥 Pre-merge checks | ✅ 5 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 25.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (5 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: replacing hardcoded CLI quantization config choices with YAML preset discovery.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Security Anti-Patterns ✅ Passed No SECURITY.md violations found: safe yaml.safe_load_all in config_loader, trust_remote_code is a CLI arg with default=False, no eval/exec/nosec, no new unsafe dependencies.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch shengliangx/hf-ptq-dereference-hardcoded-configs

Comment @coderabbitai help to get the list of available commands and usage tips.

@shengliangxu shengliangxu changed the title Drive hf_ptq qformat choices from preset YAMLs Drive hf_ptq qformat choices from preset YAMLs (remove hardcoded CLI quant configs) May 21, 2026
@shengliangxu shengliangxu marked this pull request as ready for review May 21, 2026 19:12
@shengliangxu shengliangxu requested review from a team as code owners May 21, 2026 19:12
@codecov
Copy link
Copy Markdown

codecov Bot commented May 21, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 77.32%. Comparing base (9d0d978) to head (6204de4).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1525      +/-   ##
==========================================
+ Coverage   76.82%   77.32%   +0.50%     
==========================================
  Files         477      477              
  Lines       51957    51957              
==========================================
+ Hits        39916    40178     +262     
+ Misses      12041    11779     -262     
Flag Coverage Δ
examples 41.69% <ø> (+0.96%) ⬆️
unit 52.76% <ø> (ø)

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/llm_ptq/hf_ptq.py (1)

130-133: ⚡ Quick win

Derive KV calibration skip from config semantics, not hardcoded format names.

Line 476 hardcodes cast-format names via _KV_CAST_FORMATS. Since presets are YAML-driven now, this risks drift when presets evolve. Prefer checking whether the selected KV config actually needs calibration.

Suggested refactor
-        if args.kv_cache_qformat not in _KV_CAST_FORMATS:
+        if need_calibration({"quant_cfg": kv_cache_quant_cfg, "algorithm": "max"}):
             # Calibrate only the KV cache quantizers; disable all others.
             with mtq.set_quantizer_by_cfg_context(
                 language_model,
                 [{"quantizer_name": "*", "enable": False}, *kv_cache_quant_cfg],
             ):
                 mtq.calibrate(language_model, algorithm="max", forward_loop=calibrate_loop)

Also applies to: 476-483

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 130 - 133, Replace the hardcoded
_KV_CAST_FORMATS check with a semantic check on the chosen KV preset: instead of
testing the format name via _KV_CAST_FORMATS, inspect the selected KV
configuration object (the loaded preset used for KV, e.g., the variable that
selects the KV preset in this module—refer to the code that chooses the "kv"
preset) and decide to skip calibration when that config explicitly pins
use_constant_amax (or an equivalent flag like
requires_calibration/use_constant_amax) — remove the frozenset usage and branch
on the KV config's semantic field so YAML-driven presets control the behavior.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 194-211: The CLI validation currently tests raw tokens against
_AUTO_QUANTIZE_QFORMATS, which rejects valid canonical names because
canonical/alias resolution happens later via QUANT_CFG_CHOICES; change the
validation to check against the full set of accepted keys (e.g., use
QUANT_CFG_CHOICES.keys() or build a normalized set of canonical names/aliases)
or resolve each token through the same lookup used later before rejecting.
Update the checks that reference _AUTO_QUANTIZE_QFORMATS (and any logic around
parsing auto-quantize tokens) to use QUANT_CFG_CHOICES (or a derived normalized
set) so canonical names and aliases are accepted consistently.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 130-133: Replace the hardcoded _KV_CAST_FORMATS check with a
semantic check on the chosen KV preset: instead of testing the format name via
_KV_CAST_FORMATS, inspect the selected KV configuration object (the loaded
preset used for KV, e.g., the variable that selects the KV preset in this
module—refer to the code that chooses the "kv" preset) and decide to skip
calibration when that config explicitly pins use_constant_amax (or an equivalent
flag like requires_calibration/use_constant_amax) — remove the frozenset usage
and branch on the KV config's semantic field so YAML-driven presets control the
behavior.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 61cb3534-3f30-4331-b8b7-3a3cf32cca68

📥 Commits

Reviewing files that changed from the base of the PR and between c9098b6 and aae0fe1.

📒 Files selected for processing (3)
  • examples/llm_ptq/hf_ptq.py
  • modelopt_recipes/configs/ptq/presets/kv/fp8_cast.yaml
  • modelopt_recipes/configs/ptq/presets/kv/nvfp4_cast.yaml

Comment thread examples/llm_ptq/hf_ptq.py
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
examples/llm_ptq/hf_ptq.py (1)

403-407: ⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Normalize auto-quantize qformats before checking _AUTO_QUANTIZE_QFORMATS.

Line 406 validates raw CLI tokens, but Line 465 resolves them through QUANT_CFG_CHOICES, which now accepts both canonical preset basenames and legacy aliases. A canonical preset like int8_smoothquant is therefore accepted later but rejected here first.

💡 Suggested fix
-    qformat_list = args.qformat.split(",")
+    qformat_list = [q.strip() for q in args.qformat.split(",")]
     assert qformat_list, "No quantization formats provided"
-    # Check if all provided quantization formats are supported
-    assert all(qformat in _AUTO_QUANTIZE_QFORMATS for qformat in qformat_list), (
+    canonical_qformats = [
+        QUANT_CFG_CHOICES._canonical(qformat) if isinstance(QUANT_CFG_CHOICES, _PresetCfgChoices) else qformat
+        for qformat in qformat_list
+    ]
+    assert all(qformat is not None for qformat in canonical_qformats), (
+        "Unsupported quantization format provided"
+    )
+    assert all(qformat in _AUTO_QUANTIZE_QFORMATS for qformat in canonical_qformats), (
         "One or more quantization formats provided are not supported for unified checkpoint export"
     )

Also applies to: 465-465

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 403 - 407, The assertion is checking
raw CLI tokens in qformat_list against _AUTO_QUANTIZE_QFORMATS before they are
normalized later; update the validation to normalize each args.qformat token
using the same resolution used at Line 465 (QUANT_CFG_CHOICES/its alias mapping)
and then check the normalized canonical names against _AUTO_QUANTIZE_QFORMATS.
Concretely, transform qformat_list by mapping each entry through the
QUANT_CFG_CHOICES lookup (or its alias→canonical resolver) to produce
canonical_qformats, then assert canonical_qformats is non-empty and that all
entries are in _AUTO_QUANTIZE_QFORMATS (referencing qformat_list, args.qformat,
QUANT_CFG_CHOICES, and _AUTO_QUANTIZE_QFORMATS).
🧹 Nitpick comments (1)
examples/llm_ptq/hf_ptq.py (1)

487-499: ⚡ Quick win

Drive KV-cache calibration skipping from the preset config, not the preset name.

Line 493 reintroduces a hardcoded name check after this refactor made YAML authoritative. If a future KV preset sets use_constant_amax, it will be CLI-exposed here and then immediately recalibrated anyway.

💡 Suggested fix
         kv_cache_quant_cfg = copy.deepcopy(KV_QUANT_CFG_CHOICES[args.kv_cache_qformat]["quant_cfg"])
         kv_cache_quant_cfg = [
             e for e in kv_cache_quant_cfg if e["quantizer_name"] != "*"
         ]  # keep other quantizers from auto_quantize

         mtq.set_quantizer_by_cfg(language_model, quant_cfg=kv_cache_quant_cfg)
-        if args.kv_cache_qformat not in _KV_CAST_FORMATS:
+        needs_kv_calibration = any(
+            not entry.get("use_constant_amax", False) for entry in kv_cache_quant_cfg
+        )
+        if needs_kv_calibration:
             # Calibrate only the KV cache quantizers; disable all others.
             with mtq.set_quantizer_by_cfg_context(
                 language_model,
                 [{"quantizer_name": "*", "enable": False}, *kv_cache_quant_cfg],
             ):
                 mtq.calibrate(language_model, algorithm="max", forward_loop=calibrate_loop)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 487 - 499, The code currently
decides whether to skip KV-cache calibration by checking the preset name
(args.kv_cache_qformat not in _KV_CAST_FORMATS); instead, inspect the actual
preset config (kv_cache_quant_cfg) and skip calibration when the preset
indicates constant amax behavior. Replace the name-based condition with a
config-based check (e.g., if not any(entry.get("use_constant_amax") for entry in
kv_cache_quant_cfg): ... ) so mtq.calibrate(...) runs only when none of the KV
quantizer entries specify use_constant_amax; reference kv_cache_quant_cfg,
KV_QUANT_CFG_CHOICES, args.kv_cache_qformat, and mtq.calibrate in your change.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Duplicate comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 403-407: The assertion is checking raw CLI tokens in qformat_list
against _AUTO_QUANTIZE_QFORMATS before they are normalized later; update the
validation to normalize each args.qformat token using the same resolution used
at Line 465 (QUANT_CFG_CHOICES/its alias mapping) and then check the normalized
canonical names against _AUTO_QUANTIZE_QFORMATS. Concretely, transform
qformat_list by mapping each entry through the QUANT_CFG_CHOICES lookup (or its
alias→canonical resolver) to produce canonical_qformats, then assert
canonical_qformats is non-empty and that all entries are in
_AUTO_QUANTIZE_QFORMATS (referencing qformat_list, args.qformat,
QUANT_CFG_CHOICES, and _AUTO_QUANTIZE_QFORMATS).

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 487-499: The code currently decides whether to skip KV-cache
calibration by checking the preset name (args.kv_cache_qformat not in
_KV_CAST_FORMATS); instead, inspect the actual preset config
(kv_cache_quant_cfg) and skip calibration when the preset indicates constant
amax behavior. Replace the name-based condition with a config-based check (e.g.,
if not any(entry.get("use_constant_amax") for entry in kv_cache_quant_cfg): ...
) so mtq.calibrate(...) runs only when none of the KV quantizer entries specify
use_constant_amax; reference kv_cache_quant_cfg, KV_QUANT_CFG_CHOICES,
args.kv_cache_qformat, and mtq.calibrate in your change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: a74f2b60-827e-4805-a9ff-ac9644c33ec5

📥 Commits

Reviewing files that changed from the base of the PR and between 7ff7c9e and 76fc552.

📒 Files selected for processing (1)
  • examples/llm_ptq/hf_ptq.py

Replace the hardcoded QUANT_CFG_CHOICES / KV_QUANT_CFG_CHOICES dicts in
examples/llm_ptq/hf_ptq.py with a lazy Mapping that discovers available
qformat names by listing modelopt_recipes/configs/ptq/presets/{model,kv}/
and loads each YAML on first access via the existing
load_config(..., schema_type=QuantizeConfig) path.

A small _QFORMAT_ALIASES table keeps the previously-supported short CLI
names (int8_sq, nvfp4_awq, fp8_pb_wo, ...) working as deprecation
shims; the table is documented as not-for-extension since new formats
should land as preset YAMLs (or, longer term, as full recipes).

Also add presets/kv/fp8_cast.yaml and presets/kv/nvfp4_cast.yaml so
fp8_cast / nvfp4_cast become first-class KV presets composed from the
existing kv_fp8_cast / kv_nvfp4_cast unit fragments. This drops the
KV alias entries and lets us delete the runtime _set_kv_cache_constant_amax
helper and all three of its call sites; use_constant_amax is now
authoritative in the YAML.

Side effect: every preset YAML under presets/model/ (mxfp4, mxfp6,
mxint8, nvfp4_awq_full, nvfp4_fp8_mha, mamba_moe_*, ...) is now
automatically exposed as a valid --qformat value with no further
code change.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
- Deepcopy in _PresetCfgChoices.__getitem__ so callers can freely mutate
  the returned quant_cfg without poisoning the cache.
- Assert that _KV_NONE does not collide with any discovered KV preset.
- Expand the comment on _AUTO_QUANTIZE_QFORMATS explaining why it stays
  hardcoded (auto_quantize compatibility is an export-path property, not
  a YAML-derivable one).
- Add CHANGELOG entry for the qformat discovery refactor and the
  fp8_cast / nvfp4_cast preset promotion (including the note that
  out-of-tree recipes targeting cast KV must set use_constant_amax
  themselves now that the runtime override is gone).

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
Codify the policy that the preset directory listing IS the CLI vocabulary —
there is intentionally no separate allow-list. New presets are CLI-visible
the moment they land in the directory; this is a feature, not an oversight.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
1. Auto-quantize validation: _AUTO_QUANTIZE_QFORMATS previously only listed the
   short alias names, so passing canonical preset basenames (e.g. int8_smoothquant
   instead of int8_sq) would be rejected even though the underlying configs are
   identical. Switch the set to canonical names and canonicalize incoming tokens
   via a new _canonical_qformat() helper so both forms are accepted.

2. KV cast detection: replace the hardcoded _KV_CAST_FORMATS = {fp8_cast,
   nvfp4_cast} name set with a semantic check (_kv_cfg_uses_constant_amax) that
   inspects the loaded KV cfg's *[kv]_bmm_quantizer entry for use_constant_amax.
   This makes "should we skip KV calibration?" YAML-driven: any future cast-style
   KV preset works without touching this script.

Signed-off-by: Shengliang Xu <shengliangx@nvidia.com>
@shengliangxu shengliangxu force-pushed the shengliangx/hf-ptq-dereference-hardcoded-configs branch from 29f03fa to 6204de4 Compare May 27, 2026 00:00
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Warning

CodeRabbit couldn't request changes on this pull request because it doesn't have sufficient GitHub permissions.

Please grant CodeRabbit Pull requests: Read and write permission and re-run the review.

👉 Steps to fix this

Actionable comments posted: 1

🧹 Nitpick comments (1)
examples/llm_ptq/hf_ptq.py (1)

210-215: ⚡ Quick win

Validate the KV default against the dynamic preset registry.

--kv_cache_qformat now derives its valid values from KV_QUANT_CFG_CHOICES, but the default is still the literal "fp8_cast". If that preset is renamed or removed, the parser will still inject an invalid default and the first lookup later will fail. Hoist the default into a constant and check it next to the _KV_NONE collision guard.

💡 Consistency check
 _KV_NONE = "none"
+_DEFAULT_KV_QFORMAT = "fp8_cast"
 ...
 assert _KV_NONE not in KV_QUANT_CFG_CHOICES, (
     f"_KV_NONE sentinel {_KV_NONE!r} collides with a KV preset; rename the preset."
 )
+assert _DEFAULT_KV_QFORMAT in KV_QUANT_CFG_CHOICES, (
+    f"Default KV preset {_DEFAULT_KV_QFORMAT!r} is not present; update the default."
+)
 ...
     parser.add_argument(
         "--kv_cache_qformat",
         required=False,
-        default="fp8_cast",
+        default=_DEFAULT_KV_QFORMAT,
         choices=[_KV_NONE, *KV_QUANT_CFG_CHOICES],

Also applies to: 1365-1368

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 210 - 215, The default value for
--kv_cache_qformat is hardcoded as "fp8_cast" and may become invalid if the
dynamic preset registry KV_QUANT_CFG_CHOICES changes; hoist that literal into a
named constant (e.g., DEFAULT_KV_QFORMAT = "fp8_cast") and add an assertion that
DEFAULT_KV_QFORMAT is present in KV_QUANT_CFG_CHOICES immediately alongside the
existing _KV_NONE collision guard (which checks _KV_NONE not in
KV_QUANT_CFG_CHOICES). Repeat the same hoist-and-assert change for the other
preset-usage site referenced in the file (the earlier block that currently uses
the literal "fp8_cast") so both parser defaults are validated against
KV_QUANT_CFG_CHOICES.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 544-552: When running in low-memory mode the code currently seeds
init_quantized_weights() from args.qformat/args.kv_cache_qformat
(QUANT_CFG_CHOICES, _KV_NONE, mtq.utils.update_quant_cfg_with_kv_cache_quant),
which conflicts with later quantize_main() when a --recipe is provided; add an
explicit reject: if args.low_memory_mode and args.recipe is set, raise/exit with
a clear message that recipes are not supported in low-memory mode (or
alternatively ensure this branch reads/uses the recipe quant config instead), so
the pre-instrumented quantizer layout cannot diverge from the recipe-driven
layout.

---

Nitpick comments:
In `@examples/llm_ptq/hf_ptq.py`:
- Around line 210-215: The default value for --kv_cache_qformat is hardcoded as
"fp8_cast" and may become invalid if the dynamic preset registry
KV_QUANT_CFG_CHOICES changes; hoist that literal into a named constant (e.g.,
DEFAULT_KV_QFORMAT = "fp8_cast") and add an assertion that DEFAULT_KV_QFORMAT is
present in KV_QUANT_CFG_CHOICES immediately alongside the existing _KV_NONE
collision guard (which checks _KV_NONE not in KV_QUANT_CFG_CHOICES). Repeat the
same hoist-and-assert change for the other preset-usage site referenced in the
file (the earlier block that currently uses the literal "fp8_cast") so both
parser defaults are validated against KV_QUANT_CFG_CHOICES.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 4095c6fc-e78b-4a31-a837-a01c308d9261

📥 Commits

Reviewing files that changed from the base of the PR and between 29f03fa and 6204de4.

📒 Files selected for processing (4)
  • CHANGELOG.rst
  • examples/llm_ptq/hf_ptq.py
  • modelopt_recipes/configs/ptq/presets/kv/fp8_cast.yaml
  • modelopt_recipes/configs/ptq/presets/kv/nvfp4_cast.yaml
✅ Files skipped from review due to trivial changes (2)
  • CHANGELOG.rst
  • modelopt_recipes/configs/ptq/presets/kv/nvfp4_cast.yaml
🚧 Files skipped from review as they are similar to previous changes (1)
  • modelopt_recipes/configs/ptq/presets/kv/fp8_cast.yaml

Comment on lines 544 to 552
assert args.qformat in QUANT_CFG_CHOICES, (
f"Quantization format is not supported for low memory mode. Supported formats: {QUANT_CFG_CHOICES.keys()}"
f"Quantization format is not supported for low memory mode. Supported formats: {list(QUANT_CFG_CHOICES)}"
)
quant_cfg = QUANT_CFG_CHOICES[args.qformat]
if args.kv_cache_qformat != "none":
if args.kv_cache_qformat != _KV_NONE:
quant_cfg = mtq.utils.update_quant_cfg_with_kv_cache_quant(
quant_cfg,
getattr(mtq, KV_QUANT_CFG_CHOICES[args.kv_cache_qformat])["quant_cfg"],
KV_QUANT_CFG_CHOICES[args.kv_cache_qformat]["quant_cfg"],
)
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major | ⚡ Quick win

Reject --recipe in low-memory mode until this path uses the recipe config.

This branch still seeds init_quantized_weights() from args.qformat / args.kv_cache_qformat. With --low_memory_mode --recipe ..., the model is pre-instrumented with a different quantizer layout than quantize_main() later applies, so the new “recipe is authoritative” contract is broken before quantization even starts.

💡 Minimal safe fix
     else:
+        if args.recipe is not None:
+            raise ValueError(
+                "--low_memory_mode does not yet support --recipe; the low-memory "
+                "loader still initializes quantizers from --qformat/--kv_cache_qformat."
+            )
         assert args.qformat in QUANT_CFG_CHOICES, (
             f"Quantization format is not supported for low memory mode. Supported formats: {list(QUANT_CFG_CHOICES)}"
         )
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@examples/llm_ptq/hf_ptq.py` around lines 544 - 552, When running in
low-memory mode the code currently seeds init_quantized_weights() from
args.qformat/args.kv_cache_qformat (QUANT_CFG_CHOICES, _KV_NONE,
mtq.utils.update_quant_cfg_with_kv_cache_quant), which conflicts with later
quantize_main() when a --recipe is provided; add an explicit reject: if
args.low_memory_mode and args.recipe is set, raise/exit with a clear message
that recipes are not supported in low-memory mode (or alternatively ensure this
branch reads/uses the recipe quant config instead), so the pre-instrumented
quantizer layout cannot diverge from the recipe-driven layout.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant