Skip to content

Scratch: NVFP4 activation input_scale calibration study#1545

Draft
cjluo-nv wants to merge 3 commits into
mainfrom
chenjiel/nvfp4-activation-calib-study
Draft

Scratch: NVFP4 activation input_scale calibration study#1545
cjluo-nv wants to merge 3 commits into
mainfrom
chenjiel/nvfp4-activation-calib-study

Conversation

@cjluo-nv
Copy link
Copy Markdown
Collaborator

What does this PR do?

Type of change: research artifact (in scratch/, not for merge)

Studies how the choice of input_scale affects NVFP4 activation
quantization MSE, on:

  • synthetic distributions — including two pathological cases where
    more calibration hurts (a rare 1e6 spike in 1e-7 of samples; a
    log-normal σ=3 distribution whose amax never converges); and
  • real Qwen3.5-9B MLP-input activations captured under the
    Nemotron-family calibration datasets used by modelopt's hf_ptq.py
    (chat-only, cnn_nemotron_v2_mix, nemotron-post-training-v3).

Three questions answered

  1. Does more calibration always help?
    In principle, no — synthetic stress tests show 2.4 dB / 0.8 dB SNR
    losses when calibration captures rare outliers that inference
    doesn't see. In practice on Qwen3.5-9B + Nemotron-family data:
    calibration converges by ~500 sequences and bigger calibration is
    harmless (≤ 0.01 dB change up to N=2048).

  2. Should we always use amax to derive input_scale?
    Yes — amax is within 0.01–0.06 dB of the MSE-optimal oracle on
    every layer × dataset combination tested. Percentile calibration
    (p99 / p99.9 / p99.99) under-shoots inference-time outliers and
    loses 1–25 dB SNR; only p99.999 matches amax.

  3. Does calibration dataset choice matter?
    No — with a fixed held-out test tensor, cnn_nemotron_v2_mix and
    nemotron-post-training-v3 produce input_scale values within ~5%
    of each other, translating to ≤ 0.013 dB SNR spread on both
    measured layers.

Usage

# Synthetic distributions
python scratch/nvfp4_activation_calib_mse.py

# Real activations (modelopt PTQ default combo)
python scratch/capture_qwen35_mlp_activations.py \
    --n_seqs 2600 --max_tokens 512 --dataset cnn_nemotron_v2_mix
python scratch/nvfp4_real_activation_calib_mse.py

Testing

  • All scripts run end-to-end on a single GPU (synthetic ~3 min,
    capture ~12 min, sweep ~1 min on RTX 6000 Ada).
  • Pre-commit hooks (ruff, mypy, markdownlint, license headers) pass.
  • No changes to library code — strictly additive under scratch/.

Notes

This is exploratory. The captured .pt activation tensors (~17 GB)
are excluded from the commit; everything in the report is
reproducible via the included scripts. Full discussion of methodology,
per-dataset sweep tables, fixed-test cross-combo comparison, and
limitations are in scratch/nvfp4_activation_calib_report.md.

Before your PR is "Ready for review"

  • Is this change backward compatible?: ✅ (strictly additive under scratch/)
  • If you copied code from any other sources or added a new PIP dependency: N/A
  • Did you write any new necessary tests?: N/A (research artifact)
  • Did you update Changelog?: N/A (scratch/ — not part of the library)
  • Did you get Claude approval on this PR?: N/A (this is a research draft)

🤖 Generated with Claude Code

Research artifact (in scratch/) studying how the choice of input_scale
affects NVFP4 activation quantization MSE — both on synthetic
distributions and on real Qwen3.5-9B MLP-input activations captured
through the Nemotron-family calibration datasets used by modelopt's
hf_ptq.py.

Three questions answered:
- Does more calibration always help? Not always (synthetic
  rare-giant-spike and log-normal sigma=3 distributions show 0.8-2.4 dB
  SNR losses with too much calibration), but for realistic LLM
  activations on Qwen3.5-9B, calibration is converged at ~500 sequences
  and more is harmless.
- Should we always use amax? Yes — amax is within 0.01-0.06 dB of the
  MSE-optimal oracle on every layer x dataset combination tested. No
  percentile policy below p99.999 matches it.
- Does calibration dataset choice matter? No — with a fixed test
  tensor, cnn_nemotron_v2_mix and nemotron-post-training-v3 produce
  input_scale values within ~5% of each other, translating to <= 0.013
  dB SNR spread on layer 0 / layer 31.

Not for merge — opening for visibility and discussion of the findings,
strictly additive under scratch/. The 17 GB of captured .pt activation
tensors are excluded from the commit; capture is reproducible via the
included script (~12 minutes on a single GPU).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot Bot commented May 26, 2026

Auto-sync is disabled for draft pull requests in this repository. Workflows must be run manually.

Contributors can view more details about this message here.

@coderabbitai
Copy link
Copy Markdown
Contributor

coderabbitai Bot commented May 26, 2026

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Enterprise

Run ID: 8ced1814-b5c8-46ef-9789-4e0ec8c1ae51

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch chenjiel/nvfp4-activation-calib-study

Comment @coderabbitai help to get the list of available commands and usage tips.

@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 76.49%. Comparing base (c9098b6) to head (c32f97f).
⚠️ Report is 11 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1545      +/-   ##
==========================================
- Coverage   76.75%   76.49%   -0.27%     
==========================================
  Files         476      477       +1     
  Lines       51811    52819    +1008     
==========================================
+ Hits        39767    40403     +636     
- Misses      12044    12416     +372     
Flag Coverage Δ
unit 52.76% <ø> (+0.13%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

cjluo-nv and others added 2 commits May 26, 2026 12:08
Adds a controlled experiment that decouples calibration from the
test-tensor confound in the earlier per-dataset sweeps:

- capture_calib_and_test_split.py runs the model once and captures
  separate calibration + test activation pools for both
  cnn_nemotron_v2_mix and nemotron-post-training-v3. Test sequences
  are drawn from positions strictly after each combo's calibration
  range, so no test sample appears in any calibration set.

- nvfp4_shared_test_sweep.py concatenates both combos' test pools
  into one shared held-out test tensor and applies each combo's
  amax-derived input_scale to it. This is the apples-to-apples
  experiment.

Result on Qwen3.5-9B MLP inputs: spread between the two combos on
the shared test tensor is 0.002 dB (layer 0) and 0.009 dB (layer 31).
Default amax is within 0.008-0.018 dB of the MSE-optimal oracle on
both layers. Confirms the conclusion of the earlier (legacy)
fixed-test comparison under a clean experimental design.

Report updated with a new "Clean shared-test comparison" section
and a sharper answer to the original "does calibration dataset
matter" question.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
- SESSION_HANDOFF.md: focused handoff doc summarising findings,
  artifacts, environment, and follow-up options (Option B vLLM hooks
  recommended) so a future session — possibly against a different
  debugger / Docker server — can resume the study without re-reading
  the whole chat.
- nvfp4_activation_calib_report.md: shared-test results refreshed with
  N ∈ {128, 256, 512} (sizes that fit both combo pools — v3 pool is
  881 after the Agentic-v2 streaming dropout) and a percentile-baselines
  subsection added.
- nvfp4_shared_test_sweep.py: n_seqs_list adjusted to {128, 256, 512}.
- nvfp4_shared_test_sweep_results.json: curves regenerated.

Conclusion is unchanged: combo-to-combo SNR spread on the shared test
is 0.002 dB (layer 0) and 0.009 dB (layer 31); default amax is within
0.017-0.018 dB of the oracle on both layers.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Signed-off-by: Chenjie Luo <chenjiel@nvidia.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant