Scratch: NVFP4 activation input_scale calibration study by cjluo-nv · Pull Request #1545 · NVIDIA/Model-Optimizer

cjluo-nv · 2026-05-26T18:17:34Z

What does this PR do?

Type of change: research artifact (in scratch/, not for merge)

Studies how the choice of input_scale affects NVFP4 activation
quantization MSE, on:

synthetic distributions — including two pathological cases where
more calibration hurts (a rare 1e6 spike in 1e-7 of samples; a
log-normal σ=3 distribution whose amax never converges); and
real Qwen3.5-9B MLP-input activations captured under the
Nemotron-family calibration datasets used by modelopt's hf_ptq.py
(chat-only, cnn_nemotron_v2_mix, nemotron-post-training-v3).

Three questions answered

Does more calibration always help?
In principle, no — synthetic stress tests show 2.4 dB / 0.8 dB SNR
losses when calibration captures rare outliers that inference
doesn't see. In practice on Qwen3.5-9B + Nemotron-family data:
calibration converges by ~500 sequences and bigger calibration is
harmless (≤ 0.01 dB change up to N=2048).
Should we always use amax to derive input_scale?
Yes — amax is within 0.01–0.06 dB of the MSE-optimal oracle on
every layer × dataset combination tested. Percentile calibration
(p99 / p99.9 / p99.99) under-shoots inference-time outliers and
loses 1–25 dB SNR; only p99.999 matches amax.
Does calibration dataset choice matter?
No — with a fixed held-out test tensor, cnn_nemotron_v2_mix and
nemotron-post-training-v3 produce input_scale values within ~5%
of each other, translating to ≤ 0.013 dB SNR spread on both
measured layers.

Usage

# Synthetic distributions
python scratch/nvfp4_activation_calib_mse.py

# Real activations (modelopt PTQ default combo)
python scratch/capture_qwen35_mlp_activations.py \
    --n_seqs 2600 --max_tokens 512 --dataset cnn_nemotron_v2_mix
python scratch/nvfp4_real_activation_calib_mse.py

Testing

All scripts run end-to-end on a single GPU (synthetic ~3 min,
capture ~12 min, sweep ~1 min on RTX 6000 Ada).
Pre-commit hooks (ruff, mypy, markdownlint, license headers) pass.
No changes to library code — strictly additive under scratch/.

Notes

This is exploratory. The captured .pt activation tensors (~17 GB)
are excluded from the commit; everything in the report is
reproducible via the included scripts. Full discussion of methodology,
per-dataset sweep tables, fixed-test cross-combo comparison, and
limitations are in scratch/nvfp4_activation_calib_report.md.

Before your PR is "Ready for review"

Is this change backward compatible?: ✅ (strictly additive under scratch/)
If you copied code from any other sources or added a new PIP dependency: N/A
Did you write any new necessary tests?: N/A (research artifact)
Did you update Changelog?: N/A (scratch/ — not part of the library)
Did you get Claude approval on this PR?: N/A (this is a research draft)

🤖 Generated with Claude Code

Research artifact (in scratch/) studying how the choice of input_scale affects NVFP4 activation quantization MSE — both on synthetic distributions and on real Qwen3.5-9B MLP-input activations captured through the Nemotron-family calibration datasets used by modelopt's hf_ptq.py. Three questions answered: - Does more calibration always help? Not always (synthetic rare-giant-spike and log-normal sigma=3 distributions show 0.8-2.4 dB SNR losses with too much calibration), but for realistic LLM activations on Qwen3.5-9B, calibration is converged at ~500 sequences and more is harmless. - Should we always use amax? Yes — amax is within 0.01-0.06 dB of the MSE-optimal oracle on every layer x dataset combination tested. No percentile policy below p99.999 matches it. - Does calibration dataset choice matter? No — with a fixed test tensor, cnn_nemotron_v2_mix and nemotron-post-training-v3 produce input_scale values within ~5% of each other, translating to <= 0.013 dB SNR spread on layer 0 / layer 31. Not for merge — opening for visibility and discussion of the findings, strictly additive under scratch/. The 17 GB of captured .pt activation tensors are excluded from the commit; capture is reproducible via the included script (~12 minutes on a single GPU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

copy-pr-bot · 2026-05-26T18:17:38Z

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

coderabbitai · 2026-05-26T18:17:41Z

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8ced1814-b5c8-46ef-9789-4e0ec8c1ae51

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

🔍 Trigger review

✨ Finishing Touches

🧪 Generate unit tests (beta)

Create PR with unit tests
Commit unit tests in branch chenjiel/nvfp4-activation-calib-study

_{Comment @coderabbitai help to get the list of available commands and usage tips.}

codecov · 2026-05-26T18:31:33Z

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.49%. Comparing base (c9098b6) to head (c32f97f).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #1545      +/-   ##
==========================================
- Coverage   76.75%   76.49%   -0.27%     
==========================================
  Files         476      477       +1     
  Lines       51811    52819    +1008     
==========================================
+ Hits        39767    40403     +636     
- Misses      12044    12416     +372

Flag	Coverage Δ
unit	`52.76% <ø> (+0.13%)`	⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Adds a controlled experiment that decouples calibration from the test-tensor confound in the earlier per-dataset sweeps: - capture_calib_and_test_split.py runs the model once and captures separate calibration + test activation pools for both cnn_nemotron_v2_mix and nemotron-post-training-v3. Test sequences are drawn from positions strictly after each combo's calibration range, so no test sample appears in any calibration set. - nvfp4_shared_test_sweep.py concatenates both combos' test pools into one shared held-out test tensor and applies each combo's amax-derived input_scale to it. This is the apples-to-apples experiment. Result on Qwen3.5-9B MLP inputs: spread between the two combos on the shared test tensor is 0.002 dB (layer 0) and 0.009 dB (layer 31). Default amax is within 0.008-0.018 dB of the MSE-optimal oracle on both layers. Confirms the conclusion of the earlier (legacy) fixed-test comparison under a clean experimental design. Report updated with a new "Clean shared-test comparison" section and a sharper answer to the original "does calibration dataset matter" question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

- SESSION_HANDOFF.md: focused handoff doc summarising findings, artifacts, environment, and follow-up options (Option B vLLM hooks recommended) so a future session — possibly against a different debugger / Docker server — can resume the study without re-reading the whole chat. - nvfp4_activation_calib_report.md: shared-test results refreshed with N ∈ {128, 256, 512} (sizes that fit both combo pools — v3 pool is 881 after the Agentic-v2 streaming dropout) and a percentile-baselines subsection added. - nvfp4_shared_test_sweep.py: n_seqs_list adjusted to {128, 256, 512}. - nvfp4_shared_test_sweep_results.json: curves regenerated. Conclusion is unchanged: combo-to-combo SNR spread on the shared test is 0.002 dB (layer 0) and 0.009 dB (layer 31); default amax is within 0.017-0.018 dB of the oracle on both layers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>

cjluo-nv and others added 2 commits May 26, 2026 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scratch: NVFP4 activation input_scale calibration study#1545

Scratch: NVFP4 activation input_scale calibration study#1545
cjluo-nv wants to merge 3 commits into
mainfrom
chenjiel/nvfp4-activation-calib-study

cjluo-nv commented May 26, 2026

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 •

edited

Loading

Review skipped

Uh oh!

codecov Bot commented May 26, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

cjluo-nv commented May 26, 2026

What does this PR do?

Three questions answered

Usage

Testing

Notes

Before your PR is "Ready for review"

Uh oh!

copy-pr-bot Bot commented May 26, 2026

Uh oh!

coderabbitai Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Review skipped

Uh oh!

codecov Bot commented May 26, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

coderabbitai Bot commented May 26, 2026 •

edited

Loading

codecov Bot commented May 26, 2026 •

edited

Loading