fix: handle accelerate CPU-offloaded models in FakeQuant export#1194
fix: handle accelerate CPU-offloaded models in FakeQuant export#1194sungsooha wants to merge 4 commits intoNVIDIA:mainfrom
Conversation
When models are loaded with device_map="auto" and layers are offloaded to CPU via AlignDevicesHook, model.state_dict() returns meta tensors and model.save_pretrained(state_dict=clean_sd) is ignored by accelerate. Three fixes: 1. _materialize_offloaded_weights(): resolve meta tensors from accelerate's AlignDevicesHook.weights_map before export. 2. GPU hop in weight processing: move CPU tensors to quantizer's device (quantizer kernels like fp4_fake_quant_block require CUDA). Uses quantizer buffers (amax) for device detection. 3. _save_clean_checkpoint(): bypass save_pretrained entirely, write safetensors directly via save_file() + split_torch_state_dict_into_shards(). Also strips auto_map from config.json (custom code files not in export). 4. FakeQuantWorker.compile_or_warm_up_model: return float (not None) to fix multiproc executor TypeError in max(compilation_times). Tested: Qwen3-0.6B on H100 with forced CPU offloading (500MiB GPU limit, 254/311 meta tensors). All checks passed — no quantizer keys in safetensors, no auto_map in config.json. Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
📝 WalkthroughWalkthroughUpdated FakeQuantWorker to return a float from its compile/warm-up flow; added helpers to materialize accelerate-offloaded (meta) tensors and to write clean sharded safetensors/config during VLLM fake-quant checkpoint export; added tests for the new export behaviors. Changes
Sequence Diagram(s)sequenceDiagram
participant Caller as Caller
participant Export as export_hf_vllm_fq_checkpoint
participant Detector as MetaTensorDetector
participant Materialize as _materialize_offloaded_weights
participant Quantizer as Quantizer/DeviceManager
participant Saver as _save_clean_checkpoint
participant FS as FileSystem
Caller->>Export: invoke export_hf_vllm_fq_checkpoint(model, ...)
Export->>Export: state_dict = model.state_dict()
Export->>Detector: scan state_dict for meta/offloaded tensors
alt meta tensors found
Export->>Materialize: request materialization via accelerate weights_map
Materialize->>Materialize: resolve keys, replace meta with real tensors
Materialize->>Export: return materialized state_dict (or error)
end
Export->>Quantizer: disable weight quantizers and build modelopt state
Quantizer->>Export: may require CUDA device for target tensors
alt CUDA device required and found
Export->>Quantizer: move tensors to discovered CUDA device
else CUDA device required but not found
Quantizer->>Export: raise RuntimeError
end
Export->>Saver: save via _save_clean_checkpoint (if offloaded) or model.save_pretrained(state_dict=clean_sd)
Saver->>FS: write shards, safetensors and config.json
FS->>Saver: write complete
Export->>Caller: return / complete
Estimated code review effort🎯 4 (Complex) | ⏱️ ~45 minutes 🚥 Pre-merge checks | ✅ 3 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (3 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
There was a problem hiding this comment.
Actionable comments posted: 2
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
248-252:⚠️ Potential issue | 🟠 MajorRestore the quantizer state in a
finally.Once this block disables the weight quantizers, any exception from
torch.save()or_save_clean_checkpoint()leaves the in-memory model partially mutated. Wrapping the disable/save/restore path intry/finallykeeps a failed export from poisoning subsequent use of the model.Also applies to: 282-291
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 248 - 252, The disable/modify sequence for weight quantizers (calls to quantizer.disable(), setting quantizer._rotate via disable_rotate(quantizer), and appending to wqs_to_restore) must be wrapped with a try/finally so the original quantizer state (orig_rotate and enabled state) is always restored even if torch.save() or _save_clean_checkpoint() throws; update both the block around quantizer.disable()/orig_rotate/quantizer._rotate and the similar block at 282-291 to perform the disable and then run save inside try, and restore quantizer._rotate and re-enable the quantizer in the finally regardless of exceptions, referencing quantizer, disable_rotate, wqs_to_restore, torch.save and _save_clean_checkpoint to locate the code to change.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 203-208: The code currently only moves weight w to a CUDA device
if quantizer.parameters()/buffers() contains a CUDA tensor; when TensorQuantizer
has no qtensors this misses a CUDA-only quantizer path and can still call
quantizer(w.float()) on CPU. Update the block around w and quantizer (symbols:
w, quantizer, TensorQuantizer) to, if qtensors is empty, find a sibling tensor
on the same module (e.g., iterate module.parameters()/buffers() for the first
is_cuda tensor) and use its device, or fall back to an explicit export device
variable (e.g., export_device or torch.device('cuda')) before calling quantizer;
ensure w = w.to(found_device) is applied only when needed.
- Around line 112-115: The code calls split_torch_state_dict_into_shards() and
save_file() on cpu_sd without resolving tied/shared tensors, which can cause
failures or duplicated large tensors; before sharding, detect aliasing in cpu_sd
(e.g., by comparing tensor.data_ptr()/storage()/is) and replace aliased entries
with a single canonical tensor reference (or remove duplicate keys mapping to
the same storage) so split_torch_state_dict_into_shards() and subsequent
save_file() operate on a deduplicated state dict; alternatively, delegate to the
HF helper used by save_pretrained() that already performs this alias
cleanup—perform this deduplication on cpu_sd prior to calling
split_torch_state_dict_into_shards() and use the resulting filename_to_tensors
mapping to build shard dicts for save_file().
---
Outside diff comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 248-252: The disable/modify sequence for weight quantizers (calls
to quantizer.disable(), setting quantizer._rotate via disable_rotate(quantizer),
and appending to wqs_to_restore) must be wrapped with a try/finally so the
original quantizer state (orig_rotate and enabled state) is always restored even
if torch.save() or _save_clean_checkpoint() throws; update both the block around
quantizer.disable()/orig_rotate/quantizer._rotate and the similar block at
282-291 to perform the disable and then run save inside try, and restore
quantizer._rotate and re-enable the quantizer in the finally regardless of
exceptions, referencing quantizer, disable_rotate, wqs_to_restore, torch.save
and _save_clean_checkpoint to locate the code to change.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 22c6dd83-d33b-430f-8cbf-6526445753cf
📒 Files selected for processing (2)
examples/vllm_serve/fakequant_worker.pymodelopt/torch/export/plugins/vllm_fakequant_hf.py
Codecov Report❌ Patch coverage is
Additional details and impacted files@@ Coverage Diff @@
## main #1194 +/- ##
==========================================
- Coverage 59.45% 57.56% -1.90%
==========================================
Files 352 353 +1
Lines 40343 43428 +3085
==========================================
+ Hits 23987 25000 +1013
- Misses 16356 18428 +2072
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
1. Fix ruff import ordering: move logger after all imports 2. Clone tensors in _save_clean_checkpoint to handle tied weights (safetensors rejects shared storage) 3. Robust GPU device fallback: check quantizer params/buffers, then parent module params (handles uninitialized quantizers) Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
There was a problem hiding this comment.
Actionable comments posted: 2
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 167-178: After calling _materialize_offloaded_weights(model,
state_dict, meta_keys) recompute unresolved_meta = [k for k,v in
state_dict.items() if v.is_meta]; if unresolved_meta is non-empty and contains
any keys that are not quantizer-related (e.g. not containing "quant" or
"quantizer"), raise a RuntimeError listing unresolved_meta and a short message
mentioning that materialization failed and will break subsequent fake-quant
folding or _save_clean_checkpoint; reference the symbols meta_keys,
_materialize_offloaded_weights, state_dict, and _save_clean_checkpoint so the
error helps locate the problem.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: fccfaa7b-0635-4d84-8e1d-3888a156a198
📒 Files selected for processing (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py
cjluo-nv
left a comment
There was a problem hiding this comment.
The PR addresses real and well-documented bugs for CPU-offloaded models during FakeQuant export. The four fixes are logically sound and the code is well-commented. However, there are several issues worth addressing:
-
Code duplication with existing accelerate plugin —
_materialize_offloaded_weightsre-implements offload hook traversal logic that already exists inmodelopt/torch/quantization/plugins/accelerate.py(_get_cpu_offload_hookandweight_access_and_writeback_context). Consider reusing those utilities. -
Potential regression for non-offloaded models —
_save_clean_checkpointreplacesmodel.save_pretrained()for ALL models, not just offloaded ones. This may lose behaviors like savingtokenizer_config.json,generation_config.json, or other files thatsave_pretrained()handles. The existing test should still pass (it only checks weights and config), but it's a behavioral change for non-offloaded users. -
Missing test coverage — No automated tests are added. While the PR notes multi-GPU testing is hard in CI, a unit test for
_materialize_offloaded_weightswith mocked hooks and a test for_save_clean_checkpointverifying shard output + auto_map stripping would be feasible. -
Ambiguity in hook_map key matching — The prefix matching uses
key.startswith(prefix)with longest-prefix-first not guaranteed, which could cause wrong matches with nested modules sharing prefixes.
| continue | ||
| hooks = [hook] | ||
| if hasattr(hook, "hooks"): | ||
| hooks = hook.hooks |
There was a problem hiding this comment.
Code duplication: The hook traversal logic here (checking _hf_hook, handling SequentialHook-like hooks with .hooks attribute, and PrefixedDataset access via .dataset.state_dict) is largely duplicated from modelopt/torch/quantization/plugins/accelerate.py (_get_cpu_offload_hook and weight_access_and_writeback_context). Consider extracting a shared utility, or at minimum importing/reusing _get_cpu_offload_hook to find the relevant AlignDevicesHook for each module.
There was a problem hiding this comment.
updated to reuse accelerate plugin helper
| local_key = key[len(prefix) :] | ||
| wmap = hook.weights_map | ||
| if hasattr(wmap, "dataset"): | ||
| lookup_key = wmap.prefix + local_key |
There was a problem hiding this comment.
Potential prefix ordering issue: hook_map is a regular dict iterated by insertion order (module tree traversal order). For nested modules, a key like model.layers.0.self_attn.q_proj.weight could match prefix model. before it matches the more specific prefix model.layers.0.self_attn.q_proj.. The first match wins due to break, which could pick the wrong hook/weights_map.
Consider sorting hook_map by prefix length (descending) so longest (most specific) prefixes are checked first, or using a different lookup strategy.
There was a problem hiding this comment.
switched to longest-prefix-first matching
| """Save clean weights + config directly, bypassing model.save_pretrained(). | ||
|
|
||
| For accelerate-offloaded models, ``save_pretrained(state_dict=clean_sd)`` | ||
| ignores the provided state_dict and saves from internal state, leaking |
There was a problem hiding this comment.
Behavioral change for ALL models: _save_clean_checkpoint now replaces model.save_pretrained() unconditionally — not just for offloaded models. The old save_pretrained() also saved generation_config.json, tokenizer files (if applicable), and ran any save hooks. The new code only saves safetensors + config.json.
For the vLLM FakeQuant use case this is probably fine (vLLM doesn't need generation_config.json from the export dir). But it's worth documenting this behavioral change, or alternatively only using _save_clean_checkpoint when offloading is detected (i.e., when meta_keys is non-empty) and falling back to model.save_pretrained() otherwise.
There was a problem hiding this comment.
_save_clean_checkpoint is now used only when offloaded/meta tensors are detected.
| (export_dir / "model.safetensors.index.json").write_text(json.dumps(index, indent=2)) | ||
|
|
||
| if hasattr(model, "config"): | ||
| model.config.save_pretrained(export_dir) |
There was a problem hiding this comment.
Minor: model.config.save_pretrained(export_dir) writes config.json then we immediately read it back, parse JSON, possibly modify, and write again. This is a small inefficiency. You could do:
import json
config_dict = model.config.to_dict()
config_dict.pop("auto_map", None)
(export_dir / "config.json").write_text(json.dumps(config_dict, indent=2))This avoids the read-modify-write cycle and the conditional. Though the current approach is functionally correct.
There was a problem hiding this comment.
replaced read-modify-write flow
| if t.is_cuda: | ||
| cuda_dev = t.device | ||
| break | ||
| if cuda_dev is None: |
There was a problem hiding this comment.
Potential silent failure: If no CUDA device is found (e.g., all quantizer buffers/params and module params happen to be on CPU/meta), cuda_dev remains None and w stays on CPU. The subsequent quantizer(w.float()) call will likely fail with a cryptic CUDA error deep in the kernel. Consider raising a clear error:
if cuda_dev is None:
raise RuntimeError(
f"Cannot find CUDA device for quantizer kernel on offloaded weight '{sd_key}'. "
"Ensure at least one quantizer buffer or module parameter is on CUDA."
)There was a problem hiding this comment.
added explicit error
|
|
||
| state_dict_split = split_torch_state_dict_into_shards(cpu_sd, max_shard_size="5GB") | ||
| for shard_file, tensor_keys in state_dict_split.filename_to_tensors.items(): | ||
| shard = {k: cpu_sd[k] for k in tensor_keys} |
There was a problem hiding this comment.
Missing export_dir.mkdir() call: The export_dir directory must exist before save_file() is called. The caller export_hf_vllm_fq_checkpoint does create it (export_dir.mkdir(parents=True, exist_ok=True)), but _save_clean_checkpoint as a standalone function doesn't ensure this. Consider adding a defensive export_dir.mkdir(parents=True, exist_ok=True) or documenting the precondition.
There was a problem hiding this comment.
added export_dir.mkdir(parents=True, exist_ok=True)
Signed-off-by: Sungsoo Ha <sungsooh@nvidia.com>
cjluo-nv
left a comment
There was a problem hiding this comment.
The PR addresses a real and important bug with accelerate-offloaded models during FakeQuant export. The four fixes are well-motivated and the approach is sound. The code is well-structured with good logging and error messages. However, I found a few issues worth addressing:
-
Missing
auto_mapstripping: The PR description prominently lists "stripsauto_mapfromconfig.json" as a fix, but the_save_clean_checkpointfunction just callsmodel.config.to_dict()and writes it as-is. HuggingFace'sto_dict()preservesauto_mapif set. This means the described fix is not actually implemented, and the vLLMOSErrorfromauto_mapreferencing missing custom Python files will still occur. -
Missing tokenizer save: The
_save_clean_checkpointbypass ofsave_pretrainedsaves config and generation_config but doesn't save the tokenizer.save_pretrainednormally saves the tokenizer too. Downstream users may expect the tokenizer to be in the export directory. -
GPU hop doesn't move result back to original device: When a CPU weight gets moved to CUDA for quantization,
w_quantis moved back to CPU via.cpu(), which is correct. But the originalwdevice context is lost — this is fine since.cpu()is explicitly called, just noting for clarity. -
Test coverage is reasonable for a bug that's hard to test in CI (requires multi-GPU with forced offloading). The unit tests cover the branching logic and error paths well.
|
|
||
| if hasattr(model, "config"): | ||
| config = model.config.to_dict() | ||
| config_path = export_dir / "config.json" |
There was a problem hiding this comment.
Bug: auto_map is not stripped from config.json. The PR description explicitly lists 'strips auto_map from config.json' as one of the four fixes, noting that auto_map references custom Python files not present in the export directory, causing OSError in vLLM. However, model.config.to_dict() will preserve auto_map if the model config has it set. You need to explicitly remove it:
config = model.config.to_dict()
config.pop("auto_map", None) # Custom code files are not in the export dir| generation_config = getattr(model, "generation_config", None) | ||
| if generation_config is not None: | ||
| generation_config.save_pretrained(export_dir) | ||
|
|
There was a problem hiding this comment.
Missing tokenizer save. model.save_pretrained() typically also saves the tokenizer (if the model has one). This bypass only saves weights, config, and generation_config. Consider whether the tokenizer should be saved here as well, or document why it's intentionally omitted. If the user expects a complete self-contained checkpoint, the missing tokenizer could be a problem.
Also, some models have additional files saved by save_pretrained (e.g., preprocessor_config.json for multimodal models). This manual approach may miss them.
| align_hook = _get_cpu_offload_hook(hook) | ||
| except AssertionError: | ||
| # Some accelerate hook variants do not expose a plain "weight" key. | ||
| # Fall back to generic weights_map extraction for export-time readout. |
There was a problem hiding this comment.
Nit: The guard if _get_cpu_offload_hook is None is checked at the top of the loop with continue, so the inner if _get_cpu_offload_hook is not None: on line 88 is always True and can be removed.
# This is dead code — _get_cpu_offload_hook is guaranteed non-None here
if _get_cpu_offload_hook is not None:| for name, module in model.named_modules(): | ||
| hook = getattr(module, "_hf_hook", None) | ||
| if hook is None or _get_cpu_offload_hook is None: | ||
| continue |
There was a problem hiding this comment.
Minor: When _get_cpu_offload_hook is None (ImportError), the function will skip all modules with _hf_hook, meaning no weights get materialized. This will then trigger the RuntimeError downstream for any non-quantizer meta keys, which is fine. But a more informative early error or warning when accelerate plugin import fails would help debugging:
if _get_cpu_offload_hook is None and meta_keys:
logger.warning(
"Could not import accelerate plugin (_get_cpu_offload_hook). "
"Cannot materialize %d offloaded weights.", len(meta_keys)
)
return| if unresolved_meta_keys: | ||
| shown = ", ".join(unresolved_meta_keys[:10]) | ||
| suffix = " ..." if len(unresolved_meta_keys) > 10 else "" | ||
| raise RuntimeError( |
There was a problem hiding this comment.
The unresolved meta key check filters out keys containing "quantizer" or "quant". This heuristic could accidentally suppress real errors — a weight key like "dequant_proj.weight" or "quantize_proj.weight" (if such names ever exist) would be silently ignored. Consider being more precise, e.g., checking if the key corresponds to a TensorQuantizer module rather than using substring matching.
| # Cloning also breaks shared storage, which safetensors rejects. | ||
| shard = {k: clean_sd[k].cpu().clone() for k in tensor_keys} | ||
| save_file(shard, str(export_dir / shard_file)) | ||
| logger.info("Saved shard: %s (%d tensors)", shard_file, len(shard)) |
There was a problem hiding this comment.
Consider adding "model_type" to the metadata in the index JSON. The HuggingFace convention for model.safetensors.index.json typically includes {"total_size": ...} in metadata. state_dict_split.metadata may already have this, but worth verifying — some tools expect total_size in the index metadata.
There was a problem hiding this comment.
♻️ Duplicate comments (1)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
157-160:⚠️ Potential issue | 🟠 Major
config.jsonstill preservesauto_mapin the clean-save path.At Line 158–Line 160, config is written verbatim from
model.config.to_dict(). This can reintroduce the same downstream failure mode (auto_mapreferencing missing custom code) the PR is fixing.💡 Suggested fix
if hasattr(model, "config"): config = model.config.to_dict() + # Avoid remote-code remapping in exported clean checkpoint. + config.pop("auto_map", None) config_path = export_dir / "config.json" config_path.write_text(json.dumps(config, indent=2) + "\n") logger.info("Saved config.json")🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 157 - 160, The code writes model.config.to_dict() directly to config.json which can preserve auto_map and reintroduce failures; update the block around model.config.to_dict() (the variables config and config_path in vllm_fakequant_hf.py) to remove any "auto_map" entry from the config dict (e.g., if "auto_map" in config: del config["auto_map"]) before calling config_path.write_text(json.dumps(config, indent=2) + "\n"), ensuring the exported config.json no longer contains auto_map references.
🧹 Nitpick comments (2)
modelopt/torch/export/plugins/vllm_fakequant_hf.py (1)
183-185: Docstring is slightly outdated for the save path.Line 183–Line 185 says export writes via
save_pretrained(...), but offloaded models now go through_save_clean_checkpoint(...).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py` around lines 183 - 185, The docstring incorrectly states that the export writes the model via save_pretrained(..., save_modelopt_state=False); update the text to reflect that offloaded models are now written using _save_clean_checkpoint(...) (and still ensure ModelOpt/quantizer state is re-enabled as described). Replace the mention of save_pretrained with a brief description that offloaded models are saved through _save_clean_checkpoint(export_dir, ...) while preserving the existing note about re-enabling ModelOpt/quantizer state and not saving modelopt state.tests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py (1)
99-119: Add one regression check forconfig.jsonsanitization on the offloaded path.This branch currently asserts
_save_clean_checkpoint()is called, but it doesn’t verify that exportedconfig.jsondropsauto_map(the downstream OSError trigger described in the PR).🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@tests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py` around lines 99 - 119, The test test_export_uses_clean_checkpoint_when_offloaded is missing a regression assertion that the exported config.json is sanitized (i.e., has auto_map removed) when exporting via vllm_fq.export_hf_vllm_fq_checkpoint for offloaded models; after calling export_hf_vllm_fq_checkpoint and before asserting _save_clean_checkpoint was called, read the produced config.json from the tmp_path export directory and assert that the "auto_map" key is absent (or set to a safe value), ensuring the code path that writes config.json in the offloaded-export branch drops auto_map just like the non-offloaded path that uses _save_clean_checkpoint.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Duplicate comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 157-160: The code writes model.config.to_dict() directly to
config.json which can preserve auto_map and reintroduce failures; update the
block around model.config.to_dict() (the variables config and config_path in
vllm_fakequant_hf.py) to remove any "auto_map" entry from the config dict (e.g.,
if "auto_map" in config: del config["auto_map"]) before calling
config_path.write_text(json.dumps(config, indent=2) + "\n"), ensuring the
exported config.json no longer contains auto_map references.
---
Nitpick comments:
In `@modelopt/torch/export/plugins/vllm_fakequant_hf.py`:
- Around line 183-185: The docstring incorrectly states that the export writes
the model via save_pretrained(..., save_modelopt_state=False); update the text
to reflect that offloaded models are now written using
_save_clean_checkpoint(...) (and still ensure ModelOpt/quantizer state is
re-enabled as described). Replace the mention of save_pretrained with a brief
description that offloaded models are saved through
_save_clean_checkpoint(export_dir, ...) while preserving the existing note about
re-enabling ModelOpt/quantizer state and not saving modelopt state.
In `@tests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py`:
- Around line 99-119: The test test_export_uses_clean_checkpoint_when_offloaded
is missing a regression assertion that the exported config.json is sanitized
(i.e., has auto_map removed) when exporting via
vllm_fq.export_hf_vllm_fq_checkpoint for offloaded models; after calling
export_hf_vllm_fq_checkpoint and before asserting _save_clean_checkpoint was
called, read the produced config.json from the tmp_path export directory and
assert that the "auto_map" key is absent (or set to a safe value), ensuring the
code path that writes config.json in the offloaded-export branch drops auto_map
just like the non-offloaded path that uses _save_clean_checkpoint.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro Plus
Run ID: 5b87f424-348c-4bed-ba94-bbc10bacc01e
📒 Files selected for processing (2)
modelopt/torch/export/plugins/vllm_fakequant_hf.pytests/unit/torch/export/test_vllm_fakequant_hf_export_utils.py
What does this PR do?
Type of change: Bug fix
When models are loaded with
device_map="auto"and layers are offloaded to CPU viaAlignDevicesHook, two issues arise during FakeQuant export:model.state_dict()returns meta tensors for offloaded layers (no actual data)model.save_pretrained(state_dict=clean_sd)is ignored by accelerate — it saves from internal state instead, leaking quantizer keys (input_quantizer._amax) into safetensors and preservingauto_mapinconfig.jsonThis causes vLLM's weight loader to crash with
KeyErroron the unknown quantizer keys, andOSErrorwhenauto_mapreferences custom Python files not present in the export directory.Affected models: Any model large enough to trigger CPU offloading during PTQ (e.g., DeepSeek-R1 671B on 8xH100).
Four fixes:
_materialize_offloaded_weights(): Walk accelerate'sAlignDevicesHook.weights_mapto resolve meta tensors to actual CPU data before export.GPU hop in weight processing: Move CPU tensors to the quantizer's device before calling quantizer kernels (e.g.,
fp4_fake_quant_block) which require CUDA. Uses quantizer buffers (_amax) for device detection since NVFP4 quantizers have no parameters, only buffers._save_clean_checkpoint(): Bypasssave_pretrained()entirely for weight saving. Write safetensors directly fromclean_sdviasafetensors.torch.save_file()+split_torch_state_dict_into_shards(). Also stripsauto_mapfromconfig.json(custom code files are not present in the export directory).FakeQuantWorker.compile_or_warm_up_model(): Fix return type fromNonetofloatand add missingreturnbeforesuper()call. Without this, the multiproc executor gets[None, None, ...]fromcollective_rpcand crashes withTypeErrorinmax(compilation_times).Usage
No API changes. The fixes are internal to
export_hf_vllm_fq_checkpoint()andFakeQuantWorker. Existing callers work without modification.Testing
Tested on H100 (dlcluster) with forced CPU offloading:
max_memory={0: "500MiB", "cpu": "32GiB"}state_dict())auto_mapin config.json, validvllm_fq_modelopt_state.pth(257KB)Also validated end-to-end on DeepSeek-R1 671B (8xH100, v7.1 image) — PTQ export + vLLM FakeQuant serving successful.
Before your PR is "Ready for review"
Make sure you read and follow Contributor guidelines and your commits are signed (
git commit -s -S).Make sure you read and follow the Security Best Practices (e.g. avoiding hardcoded
trust_remote_code=True,torch.load(..., weights_only=False),pickle, etc.).CONTRIBUTING.md: ✅ (huggingface_hub.split_torch_state_dict_into_shardsandsafetensors.torch.save_fileare existing dependencies)Additional Information
compile_or_warm_up_modelfix (item 4) is needed for single-node multiproc executor (TP>1) — Ray executor has a different code path that doesn't hit this bugSummary by CodeRabbit
Bug Fixes
New Features
Tests