Scratch: NVFP4 activation input_scale calibration study#1545
Conversation
Research artifact (in scratch/) studying how the choice of input_scale affects NVFP4 activation quantization MSE — both on synthetic distributions and on real Qwen3.5-9B MLP-input activations captured through the Nemotron-family calibration datasets used by modelopt's hf_ptq.py. Three questions answered: - Does more calibration always help? Not always (synthetic rare-giant-spike and log-normal sigma=3 distributions show 0.8-2.4 dB SNR losses with too much calibration), but for realistic LLM activations on Qwen3.5-9B, calibration is converged at ~500 sequences and more is harmless. - Should we always use amax? Yes — amax is within 0.01-0.06 dB of the MSE-optimal oracle on every layer x dataset combination tested. No percentile policy below p99.999 matches it. - Does calibration dataset choice matter? No — with a fixed test tensor, cnn_nemotron_v2_mix and nemotron-post-training-v3 produce input_scale values within ~5% of each other, translating to <= 0.013 dB SNR spread on layer 0 / layer 31. Not for merge — opening for visibility and discussion of the findings, strictly additive under scratch/. The 17 GB of captured .pt activation tensors are excluded from the commit; capture is reproducible via the included script (~12 minutes on a single GPU). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
|
Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually. Contributors can view more details about this message here. |
|
Important Review skippedDraft detected. Please check the settings in the CodeRabbit UI or the ⚙️ Run configurationConfiguration used: Path: .coderabbit.yaml Review profile: CHILL Plan: Enterprise Run ID: You can disable this status message by setting the Use the checkbox below for a quick retry:
✨ Finishing Touches🧪 Generate unit tests (beta)
Comment |
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #1545 +/- ##
==========================================
- Coverage 76.75% 76.49% -0.27%
==========================================
Files 476 477 +1
Lines 51811 52819 +1008
==========================================
+ Hits 39767 40403 +636
- Misses 12044 12416 +372
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Adds a controlled experiment that decouples calibration from the test-tensor confound in the earlier per-dataset sweeps: - capture_calib_and_test_split.py runs the model once and captures separate calibration + test activation pools for both cnn_nemotron_v2_mix and nemotron-post-training-v3. Test sequences are drawn from positions strictly after each combo's calibration range, so no test sample appears in any calibration set. - nvfp4_shared_test_sweep.py concatenates both combos' test pools into one shared held-out test tensor and applies each combo's amax-derived input_scale to it. This is the apples-to-apples experiment. Result on Qwen3.5-9B MLP inputs: spread between the two combos on the shared test tensor is 0.002 dB (layer 0) and 0.009 dB (layer 31). Default amax is within 0.008-0.018 dB of the MSE-optimal oracle on both layers. Confirms the conclusion of the earlier (legacy) fixed-test comparison under a clean experimental design. Report updated with a new "Clean shared-test comparison" section and a sharper answer to the original "does calibration dataset matter" question. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
- SESSION_HANDOFF.md: focused handoff doc summarising findings,
artifacts, environment, and follow-up options (Option B vLLM hooks
recommended) so a future session — possibly against a different
debugger / Docker server — can resume the study without re-reading
the whole chat.
- nvfp4_activation_calib_report.md: shared-test results refreshed with
N ∈ {128, 256, 512} (sizes that fit both combo pools — v3 pool is
881 after the Agentic-v2 streaming dropout) and a percentile-baselines
subsection added.
- nvfp4_shared_test_sweep.py: n_seqs_list adjusted to {128, 256, 512}.
- nvfp4_shared_test_sweep_results.json: curves regenerated.
Conclusion is unchanged: combo-to-combo SNR spread on the shared test
is 0.002 dB (layer 0) and 0.009 dB (layer 31); default amax is within
0.017-0.018 dB of the oracle on both layers.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
What does this PR do?
Type of change: research artifact (in
scratch/, not for merge)Studies how the choice of
input_scaleaffects NVFP4 activationquantization MSE, on:
more calibration hurts (a rare 1e6 spike in 1e-7 of samples; a
log-normal σ=3 distribution whose amax never converges); and
Nemotron-family calibration datasets used by modelopt's
hf_ptq.py(
chat-only,cnn_nemotron_v2_mix,nemotron-post-training-v3).Three questions answered
Does more calibration always help?
In principle, no — synthetic stress tests show 2.4 dB / 0.8 dB SNR
losses when calibration captures rare outliers that inference
doesn't see. In practice on Qwen3.5-9B + Nemotron-family data:
calibration converges by ~500 sequences and bigger calibration is
harmless (≤ 0.01 dB change up to N=2048).
Should we always use amax to derive
input_scale?Yes — amax is within 0.01–0.06 dB of the MSE-optimal oracle on
every layer × dataset combination tested. Percentile calibration
(p99 / p99.9 / p99.99) under-shoots inference-time outliers and
loses 1–25 dB SNR; only p99.999 matches amax.
Does calibration dataset choice matter?
No — with a fixed held-out test tensor,
cnn_nemotron_v2_mixandnemotron-post-training-v3produceinput_scalevalues within ~5%of each other, translating to ≤ 0.013 dB SNR spread on both
measured layers.
Usage
Testing
capture ~12 min, sweep ~1 min on RTX 6000 Ada).
scratch/.Notes
This is exploratory. The captured
.ptactivation tensors (~17 GB)are excluded from the commit; everything in the report is
reproducible via the included scripts. Full discussion of methodology,
per-dataset sweep tables, fixed-test cross-combo comparison, and
limitations are in
scratch/nvfp4_activation_calib_report.md.Before your PR is "Ready for review"
scratch/)🤖 Generated with Claude Code