Skip to content

fix: handle accelerate CPU-offloaded models in FakeQuant export#1194

Open
sungsooha wants to merge 4 commits intoNVIDIA:mainfrom
sungsooha:sungsooh/fix-offload-export
Open

fix: handle accelerate CPU-offloaded models in FakeQuant export#1194
sungsooha wants to merge 4 commits intoNVIDIA:mainfrom
sungsooha:sungsooh/fix-offload-export

Conversation

@sungsooha
Copy link
Copy Markdown
Contributor

@sungsooha sungsooha commented Apr 8, 2026

What does this PR do?

Type of change: Bug fix

When models are loaded with device_map="auto" and layers are offloaded to CPU via AlignDevicesHook, two issues arise during FakeQuant export:

  1. model.state_dict() returns meta tensors for offloaded layers (no actual data)
  2. model.save_pretrained(state_dict=clean_sd) is ignored by accelerate — it saves from internal state instead, leaking quantizer keys (input_quantizer._amax) into safetensors and preserving auto_map in config.json

This causes vLLM's weight loader to crash with KeyError on the unknown quantizer keys, and OSError when auto_map references custom Python files not present in the export directory.

Affected models: Any model large enough to trigger CPU offloading during PTQ (e.g., DeepSeek-R1 671B on 8xH100).

Four fixes:

  1. _materialize_offloaded_weights(): Walk accelerate's AlignDevicesHook.weights_map to resolve meta tensors to actual CPU data before export.

  2. GPU hop in weight processing: Move CPU tensors to the quantizer's device before calling quantizer kernels (e.g., fp4_fake_quant_block) which require CUDA. Uses quantizer buffers (_amax) for device detection since NVFP4 quantizers have no parameters, only buffers.

  3. _save_clean_checkpoint(): Bypass save_pretrained() entirely for weight saving. Write safetensors directly from clean_sd via safetensors.torch.save_file() + split_torch_state_dict_into_shards(). Also strips auto_map from config.json (custom code files are not present in the export directory).

  4. FakeQuantWorker.compile_or_warm_up_model(): Fix return type from None to float and add missing return before super() call. Without this, the multiproc executor gets [None, None, ...] from collective_rpc and crashes with TypeError in max(compilation_times).

Usage

No API changes. The fixes are internal to export_hf_vllm_fq_checkpoint() and FakeQuantWorker. Existing callers work without modification.

# Export works the same — offloaded models now produce clean checkpoints
from modelopt.torch.export.plugins.vllm_fakequant_hf import export_hf_vllm_fq_checkpoint

export_hf_vllm_fq_checkpoint(model, export_dir="/path/to/export")
# Output: clean safetensors (no quantizer keys), config.json (no auto_map),
#         vllm_fq_modelopt_state.pth (quantizer state)

Testing

Tested on H100 (dlcluster) with forced CPU offloading:

  • Model: Qwen3-0.6B with max_memory={0: "500MiB", "cpu": "32GiB"}
  • 254/311 parameters offloaded (meta tensors in state_dict())
  • Verified: no quantizer keys in safetensor files, no auto_map in config.json, valid vllm_fq_modelopt_state.pth (257KB)

Also validated end-to-end on DeepSeek-R1 671B (8xH100, v7.1 image) — PTQ export + vLLM FakeQuant serving successful.

Before your PR is "Ready for review"

Make sure you read and follow Contributor guidelines and your commits are signed (git commit -s -S).

Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded trust_remote_code=True, torch.load(..., weights_only=False), pickle, etc.).

  • Is this change backward compatible?: ✅ (no API changes, internal fix only)
  • If you copied code from any other sources or added a new PIP dependency, did you follow guidance in CONTRIBUTING.md: ✅ (huggingface_hub.split_torch_state_dict_into_shards and safetensors.torch.save_file are existing dependencies)
  • Did you write any new necessary tests?: ❌ (manual GPU test — automated test requires multi-GPU with forced offloading, hard to set up in CI)
  • Did you update Changelog?: N/A (bug fix, not a new feature or API change)

Additional Information

  • Related to PR feat: parallelize fakequant export across GPUs via ThreadPoolExecutor #1177 (parallel FakeQuant export) which addresses the performance side
  • Root cause discovered during P0 NVFP4 sweep on DeepSeek-R1 671B
  • The compile_or_warm_up_model fix (item 4) is needed for single-node multiproc executor (TP>1) — Ray executor has a different code path that doesn't hit this bug

Summary by CodeRabbit

  • Bug Fixes

    • Improved handling and materialization of offloaded (meta) weights during fake‑quant export to avoid missing tensors.
    • More robust export flow to reliably write clean, sharded checkpoints and preserve configuration.
    • Ensures quantized tensors are placed on appropriate CUDA devices and surfaces clear errors if no device is found.
  • New Features

    • Worker warm‑up now returns a numeric compile/warm‑up metric for reporting.
  • Tests

    • Added unit tests covering offloaded weight materialization and export behaviors.

When models are loaded with device_map="auto" and layers are offloaded
to CPU via AlignDevicesHook, model.state_dict() returns meta tensors
and model.save_pretrained(state_dict=clean_sd) is ignored by accelerate.

Three fixes:
1. _materialize_offloaded_weights(): resolve meta tensors from
   accelerate's AlignDevicesHook.weights_map before export.
2. GPU hop in weight processing: move CPU tensors to quantizer's
   device (quantizer kernels like fp4_fake_quant_block require CUDA).
   Uses quantizer buffers (amax) for device detection.
3. _save_clean_checkpoint(): bypass save_pretrained entirely, write
   safetensors directly via save_file() + split_torch_state_dict_into_shards().
   Also strips auto_map from config.json (custom code files not in export).
4. FakeQuantWorker.compile_or_warm_up_model: return float (not None)
   to fix multiproc executor TypeError in max(compilation_times).

Tested: Qwen3-0.6B on H100 with forced CPU offloading (500MiB GPU limit,
254/311 meta tensors). All checks passed — no quantizer keys in safetensors,
no auto_map in config.json.

Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
@sungsooha sungsooha requested review from a team as code owners April 8, 2026 05:16
@sungsooha sungsooha requested a review from meenchen April 8, 2026 05:16
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 8, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai bot commented Apr 8, 2026

📝 Walkthrough

Walkthrough

Updated FakeQuantWorker to return a float from its compile/warm-up flow; added helpers to materialize accelerate-offloaded (meta) tensors and to write clean sharded safetensors/config during VLLM fake-quant checkpoint export; added tests for the new export behaviors.

Changes

Cohort / File(s) Summary
FakeQuant Worker
examples/vllm_serve/fakequant_worker.py
Changed FakeQuantWorker.compile_or_warm_up_model() return type to float; now optionally runs _fakequant_run_prolog_worker() then returns the numeric result from BaseWorker.compile_or_warm_up_model().
Weight Materialization & Checkpoint Export
modelopt/torch/export/plugins/vllm_fakequant_hf.py
Added logger, _materialize_offloaded_weights() to replace meta tensors using Accelerate offload hooks, _save_clean_checkpoint() to write sharded safetensors and config.json; updated export_hf_vllm_fq_checkpoint() to detect/materialize offloaded weights, move target tensors to CUDA when required, wrap quantizer-disable/save in try/finally, and conditionally use the new clean-save flow.
Tests
tests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py
New unit tests covering materialization selection logic, export failure when meta tensors remain, branch behavior for offloaded vs non-offloaded models (clean checkpoint vs save_pretrained), and error when CUDA device for quantizer kernels cannot be found.

Sequence Diagram(s)

sequenceDiagram
    participant Caller as Caller
    participant Export as export_hf_vllm_fq_checkpoint
    participant Detector as MetaTensorDetector
    participant Materialize as _materialize_offloaded_weights
    participant Quantizer as Quantizer/DeviceManager
    participant Saver as _save_clean_checkpoint
    participant FS as FileSystem

    Caller->>Export: invoke export_hf_vllm_fq_checkpoint(model, ...)
    Export->>Export: state_dict = model.state_dict()
    Export->>Detector: scan state_dict for meta/offloaded tensors
    alt meta tensors found
        Export->>Materialize: request materialization via accelerate weights_map
        Materialize->>Materialize: resolve keys, replace meta with real tensors
        Materialize->>Export: return materialized state_dict (or error)
    end
    Export->>Quantizer: disable weight quantizers and build modelopt state
    Quantizer->>Export: may require CUDA device for target tensors
    alt CUDA device required and found
        Export->>Quantizer: move tensors to discovered CUDA device
    else CUDA device required but not found
        Quantizer->>Export: raise RuntimeError
    end
    Export->>Saver: save via _save_clean_checkpoint (if offloaded) or model.save_pretrained(state_dict=clean_sd)
    Saver->>FS: write shards, safetensors and config.json
    FS->>Saver: write complete
    Export->>Caller: return / complete
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

🚥 Pre-merge checks | ✅ 3 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 17.86% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (3 passed)
Check name Status Explanation
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
Title check ✅ Passed The title accurately summarizes the main change: handling CPU-offloaded models in FakeQuant export, which is the primary focus of the PR.
Security Anti-Patterns ✅ Passed No instances of torch.load with weights_only=False, no use of numpy.load(..., allow_pickle=True), no hardcoded trust_remote_code=True, no eval/exec on external input, and no # nosec comments were introduced in the modified files.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

248-252: ⚠️ Potential issue | 🟠 Major

Restore the quantizer state in a finally.

Once this block disables the weight quantizers, any exception from torch.save() or _save_clean_checkpoint() leaves the in-memory model partially mutated. Wrapping the disable/save/restore path in try/finally keeps a failed export from poisoning subsequent use of the model.

Also applies to: 282-291

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 248 - 252,
The disable/modify sequence for weight quantizers (calls to quantizer.disable(),
setting quantizer._rotate via disable_rotate(quantizer), and appending to
wqs_to_restore) must be wrapped with a try/finally so the original quantizer
state (orig_rotate and enabled state) is always restored even if torch.save() or
_save_clean_checkpoint() throws; update both the block around
quantizer.disable()/orig_rotate/quantizer._rotate and the similar block at
282-291 to perform the disable and then run save inside try, and restore
quantizer._rotate and re-enable the quantizer in the finally regardless of
exceptions, referencing quantizer, disable_rotate, wqs_to_restore, torch.save
and _save_clean_checkpoint to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 203-208: The code currently only moves weight w to a CUDA device
if quantizer.parameters()/buffers() contains a CUDA tensor; when TensorQuantizer
has no qtensors this misses a CUDA-only quantizer path and can still call
quantizer(w.float()) on CPU. Update the block around w and quantizer (symbols:
w, quantizer, TensorQuantizer) to, if qtensors is empty, find a sibling tensor
on the same module (e.g., iterate module.parameters()/buffers() for the first
is_cuda tensor) and use its device, or fall back to an explicit export device
variable (e.g., export_device or torch.device('cuda')) before calling quantizer;
ensure w = w.to(found_device) is applied only when needed.
- Around line 112-115: The code calls split_torch_state_dict_into_shards() and
save_file() on cpu_sd without resolving tied/shared tensors, which can cause
failures or duplicated large tensors; before sharding, detect aliasing in cpu_sd
(e.g., by comparing tensor.data_ptr()/storage()/is) and replace aliased entries
with a single canonical tensor reference (or remove duplicate keys mapping to
the same storage) so split_torch_state_dict_into_shards() and subsequent
save_file() operate on a deduplicated state dict; alternatively, delegate to the
HF helper used by save_pretrained() that already performs this alias
cleanup—perform this deduplication on cpu_sd prior to calling
split_torch_state_dict_into_shards() and use the resulting filename_to_tensors
mapping to build shard dicts for save_file().

---

Outside diff comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 248-252: The disable/modify sequence for weight quantizers (calls
to quantizer.disable(), setting quantizer._rotate via disable_rotate(quantizer),
and appending to wqs_to_restore) must be wrapped with a try/finally so the
original quantizer state (orig_rotate and enabled state) is always restored even
if torch.save() or _save_clean_checkpoint() throws; update both the block around
quantizer.disable()/orig_rotate/quantizer._rotate and the similar block at
282-291 to perform the disable and then run save inside try, and restore
quantizer._rotate and re-enable the quantizer in the finally regardless of
exceptions, referencing quantizer, disable_rotate, wqs_to_restore, torch.save
and _save_clean_checkpoint to locate the code to change.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 22c6dd83-d33b-430f-8cbf-6526445753cf

📥 Commits

Reviewing files that changed from the base of the PR and between 82d96a6 and 3895d21.

📒 Files selected for processing (2)
  • examples/vllm_serve/fakequant_worker.py
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py

Comment thread modelopt/torch/export/plugins/vllm_fakequant_hf.py Outdated
Comment thread modelopt/torch/export/plugins/vllm_fakequant_hf.py Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Apr 8, 2026

Codecov Report

❌ Patch coverage is 52.34375% with 61 lines in your changes missing coverage. Please review.
✅ Project coverage is 57.56%. Comparing base (82d96a6) to head (46776ad).
⚠️ Report is 33 commits behind head on main.

Files with missing lines Patch % Lines
modelopt/torch/export/plugins/vllm_fakequant_hf.py 52.34% 61 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1194      +/-   ##
==========================================
- Coverage   59.45%   57.56%   -1.90%     
==========================================
  Files         352      353       +1     
  Lines       40343    43428    +3085     
==========================================
+ Hits        23987    25000    +1013     
- Misses      16356    18428    +2072     
Flag Coverage Δ
unit 55.69% <52.34%> (+0.66%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

1. Fix ruff import ordering: move logger after all imports
2. Clone tensors in _save_clean_checkpoint to handle tied weights
   (safetensors rejects shared storage)
3. Robust GPU device fallback: check quantizer params/buffers,
   then parent module params (handles uninitialized quantizers)

Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 2

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 167-178: After calling _materialize_offloaded_weights(model,
state_dict, meta_keys) recompute unresolved_meta = [k for k,v in
state_dict.items() if v.is_meta]; if unresolved_meta is non-empty and contains
any keys that are not quantizer-related (e.g. not containing "quant" or
"quantizer"), raise a RuntimeError listing unresolved_meta and a short message
mentioning that materialization failed and will break subsequent fake-quant
folding or _save_clean_checkpoint; reference the symbols meta_keys,
_materialize_offloaded_weights, state_dict, and _save_clean_checkpoint so the
error helps locate the problem.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: fccfaa7b-0635-4d84-8e1d-3888a156a198

📥 Commits

Reviewing files that changed from the base of the PR and between b85c4e0 and 8c189ac.

📒 Files selected for processing (1)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py

Comment thread modelopt/torch/export/plugins/vllm_fakequant_hf.py Outdated
Comment thread modelopt/torch/export/plugins/vllm_fakequant_hf.py
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR addresses real and well-documented bugs for CPU-offloaded models during FakeQuant export. The four fixes are logically sound and the code is well-commented. However, there are several issues worth addressing:

  1. Code duplication with existing accelerate plugin_materialize_offloaded_weights re-implements offload hook traversal logic that already exists in modelopt/torch/quantization/plugins/accelerate.py (_get_cpu_offload_hook and weight_access_and_writeback_context). Consider reusing those utilities.

  2. Potential regression for non-offloaded models_save_clean_checkpoint replaces model.save_pretrained() for ALL models, not just offloaded ones. This may lose behaviors like saving tokenizer_config.json, generation_config.json, or other files that save_pretrained() handles. The existing test should still pass (it only checks weights and config), but it's a behavioral change for non-offloaded users.

  3. Missing test coverage — No automated tests are added. While the PR notes multi-GPU testing is hard in CI, a unit test for _materialize_offloaded_weights with mocked hooks and a test for _save_clean_checkpoint verifying shard output + auto_map stripping would be feasible.

  4. Ambiguity in hook_map key matching — The prefix matching uses key.startswith(prefix) with longest-prefix-first not guaranteed, which could cause wrong matches with nested modules sharing prefixes.

continue
hooks = [hook]
if hasattr(hook, "hooks"):
hooks = hook.hooks
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Code duplication: The hook traversal logic here (checking _hf_hook, handling SequentialHook-like hooks with .hooks attribute, and PrefixedDataset access via .dataset.state_dict) is largely duplicated from modelopt/torch/quantization/plugins/accelerate.py (_get_cpu_offload_hook and weight_access_and_writeback_context). Consider extracting a shared utility, or at minimum importing/reusing _get_cpu_offload_hook to find the relevant AlignDevicesHook for each module.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updated to reuse accelerate plugin helper

local_key = key[len(prefix) :]
wmap = hook.weights_map
if hasattr(wmap, "dataset"):
lookup_key = wmap.prefix + local_key
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential prefix ordering issue: hook_map is a regular dict iterated by insertion order (module tree traversal order). For nested modules, a key like model.layers.0.self_attn.q_proj.weight could match prefix model. before it matches the more specific prefix model.layers.0.self_attn.q_proj.. The first match wins due to break, which could pick the wrong hook/weights_map.

Consider sorting hook_map by prefix length (descending) so longest (most specific) prefixes are checked first, or using a different lookup strategy.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switched to longest-prefix-first matching

"""Save clean weights + config directly, bypassing model.save_pretrained().

For accelerate-offloaded models, ``save_pretrained(state_dict=clean_sd)``
ignores the provided state_dict and saves from internal state, leaking
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Behavioral change for ALL models: _save_clean_checkpoint now replaces model.save_pretrained() unconditionally — not just for offloaded models. The old save_pretrained() also saved generation_config.json, tokenizer files (if applicable), and ran any save hooks. The new code only saves safetensors + config.json.

For the vLLM FakeQuant use case this is probably fine (vLLM doesn't need generation_config.json from the export dir). But it's worth documenting this behavioral change, or alternatively only using _save_clean_checkpoint when offloading is detected (i.e., when meta_keys is non-empty) and falling back to model.save_pretrained() otherwise.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_save_clean_checkpoint is now used only when offloaded/meta tensors are detected.

(export_dir / "model.safetensors.index.json").write_text(json.dumps(index, indent=2))

if hasattr(model, "config"):
model.config.save_pretrained(export_dir)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: model.config.save_pretrained(export_dir) writes config.json then we immediately read it back, parse JSON, possibly modify, and write again. This is a small inefficiency. You could do:

import json
config_dict = model.config.to_dict()
config_dict.pop("auto_map", None)
(export_dir / "config.json").write_text(json.dumps(config_dict, indent=2))

This avoids the read-modify-write cycle and the conditional. Though the current approach is functionally correct.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

replaced read-modify-write flow

if t.is_cuda:
cuda_dev = t.device
break
if cuda_dev is None:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Potential silent failure: If no CUDA device is found (e.g., all quantizer buffers/params and module params happen to be on CPU/meta), cuda_dev remains None and w stays on CPU. The subsequent quantizer(w.float()) call will likely fail with a cryptic CUDA error deep in the kernel. Consider raising a clear error:

if cuda_dev is None:
    raise RuntimeError(
        f"Cannot find CUDA device for quantizer kernel on offloaded weight '{sd_key}'. "
        "Ensure at least one quantizer buffer or module parameter is on CUDA."
    )

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added explicit error


state_dict_split = split_torch_state_dict_into_shards(cpu_sd, max_shard_size="5GB")
for shard_file, tensor_keys in state_dict_split.filename_to_tensors.items():
shard = {k: cpu_sd[k] for k in tensor_keys}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing export_dir.mkdir() call: The export_dir directory must exist before save_file() is called. The caller export_hf_vllm_fq_checkpoint does create it (export_dir.mkdir(parents=True, exist_ok=True)), but _save_clean_checkpoint as a standalone function doesn't ensure this. Consider adding a defensive export_dir.mkdir(parents=True, exist_ok=True) or documenting the precondition.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added export_dir.mkdir(parents=True, exist_ok=True)

Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
@sungsooha sungsooha requested a review from cjluo-nv April 14, 2026 17:16
Copy link
Copy Markdown
Collaborator

@cjluo-nv cjluo-nv left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The PR addresses a real and important bug with accelerate-offloaded models during FakeQuant export. The four fixes are well-motivated and the approach is sound. The code is well-structured with good logging and error messages. However, I found a few issues worth addressing:

  1. Missing auto_map stripping: The PR description prominently lists "strips auto_map from config.json" as a fix, but the _save_clean_checkpoint function just calls model.config.to_dict() and writes it as-is. HuggingFace's to_dict() preserves auto_map if set. This means the described fix is not actually implemented, and the vLLM OSError from auto_map referencing missing custom Python files will still occur.

  2. Missing tokenizer save: The _save_clean_checkpoint bypass of save_pretrained saves config and generation_config but doesn't save the tokenizer. save_pretrained normally saves the tokenizer too. Downstream users may expect the tokenizer to be in the export directory.

  3. GPU hop doesn't move result back to original device: When a CPU weight gets moved to CUDA for quantization, w_quant is moved back to CPU via .cpu(), which is correct. But the original w device context is lost — this is fine since .cpu() is explicitly called, just noting for clarity.

  4. Test coverage is reasonable for a bug that's hard to test in CI (requires multi-GPU with forced offloading). The unit tests cover the branching logic and error paths well.


if hasattr(model, "config"):
config = model.config.to_dict()
config_path = export_dir / "config.json"
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Bug: auto_map is not stripped from config.json. The PR description explicitly lists 'strips auto_map from config.json' as one of the four fixes, noting that auto_map references custom Python files not present in the export directory, causing OSError in vLLM. However, model.config.to_dict() will preserve auto_map if the model config has it set. You need to explicitly remove it:

config = model.config.to_dict()
config.pop("auto_map", None)  # Custom code files are not in the export dir

generation_config = getattr(model, "generation_config", None)
if generation_config is not None:
generation_config.save_pretrained(export_dir)

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing tokenizer save. model.save_pretrained() typically also saves the tokenizer (if the model has one). This bypass only saves weights, config, and generation_config. Consider whether the tokenizer should be saved here as well, or document why it's intentionally omitted. If the user expects a complete self-contained checkpoint, the missing tokenizer could be a problem.

Also, some models have additional files saved by save_pretrained (e.g., preprocessor_config.json for multimodal models). This manual approach may miss them.

align_hook = _get_cpu_offload_hook(hook)
except AssertionError:
# Some accelerate hook variants do not expose a plain "weight" key.
# Fall back to generic weights_map extraction for export-time readout.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: The guard if _get_cpu_offload_hook is None is checked at the top of the loop with continue, so the inner if _get_cpu_offload_hook is not None: on line 88 is always True and can be removed.

# This is dead code — _get_cpu_offload_hook is guaranteed non-None here
if _get_cpu_offload_hook is not None:

for name, module in model.named_modules():
hook = getattr(module, "_hf_hook", None)
if hook is None or _get_cpu_offload_hook is None:
continue
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: When _get_cpu_offload_hook is None (ImportError), the function will skip all modules with _hf_hook, meaning no weights get materialized. This will then trigger the RuntimeError downstream for any non-quantizer meta keys, which is fine. But a more informative early error or warning when accelerate plugin import fails would help debugging:

if _get_cpu_offload_hook is None and meta_keys:
    logger.warning(
        "Could not import accelerate plugin (_get_cpu_offload_hook). "
        "Cannot materialize %d offloaded weights.", len(meta_keys)
    )
    return

if unresolved_meta_keys:
shown = ", ".join(unresolved_meta_keys[:10])
suffix = " ..." if len(unresolved_meta_keys) > 10 else ""
raise RuntimeError(
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The unresolved meta key check filters out keys containing "quantizer" or "quant". This heuristic could accidentally suppress real errors — a weight key like "dequant_proj.weight" or "quantize_proj.weight" (if such names ever exist) would be silently ignored. Consider being more precise, e.g., checking if the key corresponds to a TensorQuantizer module rather than using substring matching.

# Cloning also breaks shared storage, which safetensors rejects.
shard = {k: clean_sd[k].cpu().clone() for k in tensor_keys}
save_file(shard, str(export_dir / shard_file))
logger.info("Saved shard: %s (%d tensors)", shard_file, len(shard))
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding "model_type" to the metadata in the index JSON. The HuggingFace convention for model.safetensors.index.json typically includes {"total_size": ...} in metadata. state_dict_split.metadata may already have this, but worth verifying — some tools expect total_size in the index metadata.

Copy link
Copy Markdown
Contributor

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

♻️ Duplicate comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

157-160: ⚠️ Potential issue | 🟠 Major

config.json still preserves auto_map in the clean-save path.

At Line 158–Line 160, config is written verbatim from model.config.to_dict(). This can reintroduce the same downstream failure mode (auto_map referencing missing custom code) the PR is fixing.

💡 Suggested fix
     if hasattr(model, "config"):
         config = model.config.to_dict()
+        # Avoid remote-code remapping in exported clean checkpoint.
+        config.pop("auto_map", None)
         config_path = export_dir / "config.json"
         config_path.write_text(json.dumps(config, indent=2) + "\n")
         logger.info("Saved config.json")
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 157 - 160,
The code writes model.config.to_dict() directly to config.json which can
preserve auto_map and reintroduce failures; update the block around
model.config.to_dict() (the variables config and config_path in
vllm_fakequant_hf.py) to remove any "auto_map" entry from the config dict (e.g.,
if "auto_map" in config: del config["auto_map"]) before calling
config_path.write_text(json.dumps(config, indent=2) + "\n"), ensuring the
exported config.json no longer contains auto_map references.
🧹 Nitpick comments (2)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)

183-185: Docstring is slightly outdated for the save path.

Line 183–Line 185 says export writes via save_pretrained(...), but offloaded models now go through _save_clean_checkpoint(...).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 183 - 185,
The docstring incorrectly states that the export writes the model via
save_pretrained(..., save_modelopt_state=False); update the text to reflect that
offloaded models are now written using _save_clean_checkpoint(...) (and still
ensure ModelOpt/quantizer state is re-enabled as described). Replace the mention
of save_pretrained with a brief description that offloaded models are saved
through _save_clean_checkpoint(export_dir, ...) while preserving the existing
note about re-enabling ModelOpt/quantizer state and not saving modelopt state.
tests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py (1)

99-119: Add one regression check for config.json sanitization on the offloaded path.

This branch currently asserts _save_clean_checkpoint() is called, but it doesn’t verify that exported config.json drops auto_map (the downstream OSError trigger described in the PR).

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@tests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py` around lines
99 - 119, The test test_export_uses_clean_checkpoint_when_offloaded is missing a
regression assertion that the exported config.json is sanitized (i.e., has
auto_map removed) when exporting via vllm_fq.export_hf_vllm_fq_checkpoint for
offloaded models; after calling export_hf_vllm_fq_checkpoint and before
asserting _save_clean_checkpoint was called, read the produced config.json from
the tmp_path export directory and assert that the "auto_map" key is absent (or
set to a safe value), ensuring the code path that writes config.json in the
offloaded-export branch drops auto_map just like the non-offloaded path that
uses _save_clean_checkpoint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 157-160: The code writes model.config.to_dict() directly to
config.json which can preserve auto_map and reintroduce failures; update the
block around model.config.to_dict() (the variables config and config_path in
vllm_fakequant_hf.py) to remove any "auto_map" entry from the config dict (e.g.,
if "auto_map" in config: del config["auto_map"]) before calling
config_path.write_text(json.dumps(config, indent=2) + "\n"), ensuring the
exported config.json no longer contains auto_map references.

---

Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 183-185: The docstring incorrectly states that the export writes
the model via save_pretrained(..., save_modelopt_state=False); update the text
to reflect that offloaded models are now written using
_save_clean_checkpoint(...) (and still ensure ModelOpt/quantizer state is
re-enabled as described). Replace the mention of save_pretrained with a brief
description that offloaded models are saved through
_save_clean_checkpoint(export_dir, ...) while preserving the existing note about
re-enabling ModelOpt/quantizer state and not saving modelopt state.

In `@tests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py`:
- Around line 99-119: The test test_export_uses_clean_checkpoint_when_offloaded
is missing a regression assertion that the exported config.json is sanitized
(i.e., has auto_map removed) when exporting via
vllm_fq.export_hf_vllm_fq_checkpoint for offloaded models; after calling
export_hf_vllm_fq_checkpoint and before asserting _save_clean_checkpoint was
called, read the produced config.json from the tmp_path export directory and
assert that the "auto_map" key is absent (or set to a safe value), ensuring the
code path that writes config.json in the offloaded-export branch drops auto_map
just like the non-offloaded path that uses _save_clean_checkpoint.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro Plus

Run ID: 5b87f424-348c-4bed-ba94-bbc10bacc01e

📥 Commits

Reviewing files that changed from the base of the PR and between 8c189ac and 46776ad.

📒 Files selected for processing (2)
  • modelopt/torch/export/plugins/vllm_fakequant_hf.py
  • tests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants