Skip to content

Add ZeroInflatedImputer: regime-aware wrapper around base imputers#186

Merged
MaxGhenis merged 2 commits intomainfrom
regime-aware-imputer
Apr 19, 2026
Merged

Add ZeroInflatedImputer: regime-aware wrapper around base imputers#186
MaxGhenis merged 2 commits intomainfrom
regime-aware-imputer

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

@MaxGhenis MaxGhenis commented Apr 19, 2026

Summary

Wraps any base Imputer with automatic regime detection and multi-regime gating for zero-inflated, sign-inflated, and three-sign target variables. Gives microimpute a first-class way to handle the zero-heavy / sign-mixed distributions that tabular microdata routinely has.

Motivation, accurately stated:

  1. microplex-us has a silent drop-negatives bug at ColumnwiseQRFDonorImputer:235 that applies y > 0 to every donor-imputed variable. Variables that can be negative (capital gains long & short, partnership/S-corp income, farm income, rental income) lose their entire negative tail because the QRF trains on positives only and produces zero-or-positive predictions. Fixing this in microplex-us requires either duplicating zero/sign-handling logic there or pulling it up into the shared imputation library — this PR is the latter.

  2. policyengine-us-data has one localized cps_ss > 0 filter in _qrf_ss_shares (SS-specific, correct-by-accident since SS is non-negative). Not a bug, but a pattern that belongs inside microimpute rather than duplicated in every consumer.

  3. Secondary concern: a single QRF trained on mixed-sign nonzero data interpolates across zero. Records the gate marks "nonzero" can draw from the interval between max(train_negatives) and min(train_positives) — a region no actual record occupies. This shows up on synthetic data with a designed gap; on real PUF data the gap may be softer but the interpolation is still present.

Regime detection

At fit time, each numeric target is classified into one of seven regimes based on which of {negative, zero, positive} are present (with configurable min_class_count / min_class_fraction to collapse tiny minorities into adjacent regimes):

detected classes regime architecture
{+, 0, −} THREE_SIGN 3-way classifier + QRF on positives + QRF on negatives
{+, 0} ZI_POSITIVE binary gate + QRF on positives
{−, 0} ZI_NEGATIVE binary gate + QRF on negatives
{+, −} SIGN_ONLY binary sign gate + two QRFs
{+} POSITIVE_ONLY single QRF, no gate
{−} NEGATIVE_ONLY single QRF, no gate
{0} DEGENERATE_ZERO constant 0

Gate classifier is HistGradientBoostingClassifier by default (rf option kept). On microplex-us's 26-target isolated log-loss benchmark, HistGB Pareto-dominates a 50-tree RF on log-loss, Brier, ECE, and ROC-AUC.

Exact zeros come from the gate path directly (never a QRF draw that rounds to zero). Tripartite predictions are structurally guaranteed to stay out of the positive/negative interior band.

Holdout experiment

experiments/regime_aware_holdout.py (5 seeds, 10k records each, designed positive/negative gap):

approach pinball zero_mae sign_hit ks interior_violation
A Tripartite 9.50 0.008 0.532 0.033 0.000 ± 0.000
B Binary+mixed QRF 9.46 0.007 0.532 0.026 0.006
C Positive-only (the microplex-us bug) 9.43 0.422 0.339 0.427 0.000
D No gate 9.43 0.014 0.529 0.026 0.014

The tripartite regime doesn't statistically beat "binary nonzero + mixed QRF" on marginal distributional metrics, but structurally guarantees no predictions in the impossible interior band. For tax variables where positive and negative regimes represent genuinely different populations (gainers vs loss-harvesters), the structural guarantee matters even when marginal metrics agree.

Test plan

  • pytest tests/test_models/test_zero_inflated.py — 13 new tests pass (regime detection × 7 + prediction correctness × 5 + parity).
  • pytest tests/ on origin/main at baseline — 8 pre-existing failures in test_autoimpute.py (missing Matching/OLS imports in the test file), unrelated to this change. Confirmed by stash-checkout.
  • Holdout experiment reproducible: uv run python experiments/regime_aware_holdout.py writes JSON with per-seed raw results.
  • make format clean.

Downstream wiring

microplex-us will adopt this as a follow-up PR to fix the silent-drop-negatives bug. policyengine-us-data can optionally adopt it to remove the inline cps_ss > 0 filter, but that's a cleanup not a bugfix.

🤖 Generated with Claude Code

Tabular microdata variables fall into distinct regimes based on which
of {negative, zero, positive} values appear. Imputing them with a
single regressor mixes regimes together, causing two bugs seen in
downstream ecosystems:

1. Negative-dropping. The common "fit QRF on y > 0" pattern drops
   negative training rows along with zeros. Variables like
   short_term_capital_gains lose their entire negative tail at
   prediction time. (Present today in microplex_us.pipelines.us
   ColumnwiseQRFDonorImputer:235.)

2. Zero-crossing interpolation. A single QRF fit on all nonzero values
   (both signs) learns leaf distributions that interpolate between
   positive and negative training rows. Gate-called-nonzero records
   can draw from an interval between max(train_neg) and min(train_pos)
   that no actual record occupies.

ZeroInflatedImputer wraps any base Imputer and:

- Detects the regime automatically at fit time from the training
  distribution. No per-variable hand configuration.
- Seven regimes: THREE_SIGN, ZI_POSITIVE, ZI_NEGATIVE, SIGN_ONLY,
  POSITIVE_ONLY, NEGATIVE_ONLY, DEGENERATE_ZERO. Minority classes
  below min_class_count or min_class_fraction collapse into the
  nearest adjacent regime (so a 5-sample negative outlier doesn't
  trigger a full three-way split).
- Composes a gate classifier (HistGradientBoosting default, RF option)
  with the base Imputer as appropriate for the detected regime.
- At predict time, routes each record through the gate to its
  regime-specific base imputer. Exact zeros come from the gate path
  directly (no QRF involvement). Tripartite predictions never cross
  the positive/negative boundary.

Generic over the base imputer — QRF is the tested default, but MDN,
OLS, Matching compose the same way.

Tests (tests/test_models/test_zero_inflated.py, 13 pass):

- Regime detection across seven cases including the minority-class
  collapse rule.
- Predictions respect the regime: ZI_POSITIVE draws are >= 0,
  ZI_NEGATIVE draws are <= 0, three-sign draws stay out of the
  (max_neg, min_pos) interior band, constant-zero draws are exactly 0.
- Parity: POSITIVE_ONLY wrapper produces distributions similar to
  bare QRF on the same data.

Holdout experiment (experiments/regime_aware_holdout.py, 5 seeds,
10k records each with a designed positive/negative gap):

- Current microplex-us bug (y > 0 + QRF on positives only): zero-rate
  MAE = 0.422, sign-match = 0.339. Catastrophic.
- Binary-nonzero gate + mixed QRF: zero-rate MAE = 0.007, sign-match
  = 0.532, interior-band violations = 0.6 %. Distributionally fine
  but produces some impossible-value draws.
- Tripartite (this PR): zero-rate MAE = 0.008, sign-match = 0.532,
  interior-band violations = 0.000 ± 0.000 across all seeds.

Summary: tripartite isn't statistically better than binary-nonzero on
marginal distributional metrics (pinball, KS, zero-rate, sign-match
all tie), but structurally guarantees no zero-crossing artifacts. For
tax variables where positive and negative regimes represent genuinely
different populations (gainers vs loss-harvesters), the structural
guarantee matters even when the marginal metrics agree.

Not included in this PR (noted for follow-up):

- tests/test_autoimpute.py has 8 failures on origin/main from missing
  Matching/OLS imports — pre-existing, unrelated.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vercel
Copy link
Copy Markdown

vercel Bot commented Apr 19, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
microimpute-dashboard Ready Ready Preview, Comment Apr 19, 2026 5:13pm

make format reformatted the three new files (line-length, trailing
commas) — no semantic changes. 13/13 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit a13b1f4 into main Apr 19, 2026
7 checks passed
@MaxGhenis MaxGhenis deleted the regime-aware-imputer branch April 19, 2026 17:22
MaxGhenis added a commit to CosilicoAI/microplex-us that referenced this pull request Apr 22, 2026
ColumnwiseQRFDonorImputer previously trained its zero-inflation
classifier with label `(y > 0).astype(int)` and filtered the
downstream QRF training set to `y > 0`. For any target that can be
negative (short_term_capital_gains, partnership_s_corp_income,
farm_income, rental_income, self_employment_income, etc.), the QRF
only ever saw positive training rows and could therefore never emit
a negative value at generate time — the entire negative tail of the
synthetic frame was blanked out.

Minimal fix:

- Label the classifier as `(y != 0).astype(int)` so the positive
  class is "nonzero (either sign)" rather than "positive only".
- Filter the QRF training set to `y != 0`, mixing positives and
  negatives so the QRF learns the full nonzero conditional
  distribution.

Test (TDD):

tests/pipelines/test_donor_imputer_negative_preservation.py fits on
a synthetic frame with ~40% negatives, ~20% zeros, ~40% positives,
generates 2000 synthetic rows, asserts at least 5% of the generated
values are negative. Pre-fix: 0 negatives produced. Post-fix: passes.

Scope:

This is the minimal fix. The full upgrade is to replace
`ColumnwiseQRFDonorImputer`'s ad-hoc gate entirely with
`microimpute.models.ZeroInflatedImputer` (PolicyEngine/microimpute#186,
merged), which auto-detects the three-sign regime on each target and
routes nonzero-positive and nonzero-negative predictions through
separate QRFs. That gives a structural guarantee against
interior-band leakage in addition to the drop-negatives fix — see
the holdout experiment in PolicyEngine/microimpute@a13b1f4 for the
quantitative comparison. Tracked for v9 as a standalone refactor.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis added a commit to CosilicoAI/microplex-us that referenced this pull request Apr 22, 2026
…Imputer

Introduces a new donor_imputer_backend option, `regime_aware`, that
wraps microimpute.ZeroInflatedImputer (PolicyEngine/microimpute#186,
merged) per target column. ZeroInflatedImputer auto-detects the
three-sign regime on the training distribution and routes predictions
through sign-specific QRFs, giving a structural guarantee that no
prediction lands in the interior band between max(train_negatives)
and min(train_positives).

Differences from the existing backends:

- `qrf`: single QRF, no gate. Zeros come out as whatever the QRF
  happens to predict near zero. Interior-band violations typical.
- `zi_qrf`: ad-hoc `y > 0` gate (since commit 8c88277, `y != 0` — keeps
  negatives). Binary gate + single QRF on the mixed nonzero subset.
  Interior-band violations still possible because one QRF trained on
  both signs interpolates near zero.
- `regime_aware` (new): ZeroInflatedImputer auto-detects one of seven
  regimes (THREE_SIGN / ZI_POSITIVE / ZI_NEGATIVE / SIGN_ONLY /
  POSITIVE_ONLY / NEGATIVE_ONLY / DEGENERATE_ZERO) per target, and
  for three-sign variables routes to separate positive and negative
  QRFs. Interior-band violations structurally impossible.

Tests (6 pass):

- `tests/pipelines/test_regime_aware_donor_imputer.py`:
  - Class importable from microplex_us.pipelines.us
  - Factory dispatches `backend='regime_aware'` to the new class
  - Fit+generate preserves negatives, positives, and exact zeros
  - **Zero interior-band violations** on a three-sign fixture with a
    designed (-100, 100) empty band in training data — the structural
    guarantee the upstream PR provides

CLI flag `--donor-imputer-backend` now accepts `regime_aware` alongside
maf / qrf / zi_qrf. Ready to launch v9 once v8 completes.

Known upstream issue: microimpute 2.x's
ZeroInflatedImputer._fit_base_single hardcodes log_level="ERROR" and
conflicts with any caller that passes log_level via base_imputer_kwargs.
Worked around here by leaving base_imputer_kwargs={}. Will file
follow-up PR to microimpute to make the hardcode conditional.

v8 pipeline unaffected: its in-memory process imported the pre-edit
modules at start and is still running on the `zi_qrf` backend with the
v7-era `ColumnwiseQRFDonorImputer`. This change lands cleanly for v9
without interfering.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant