Add ZeroInflatedImputer: regime-aware wrapper around base imputers by MaxGhenis · Pull Request #186 · PolicyEngine/microimpute

MaxGhenis · 2026-04-19T16:41:27Z

Summary

Wraps any base Imputer with automatic regime detection and multi-regime gating for zero-inflated, sign-inflated, and three-sign target variables. Gives microimpute a first-class way to handle the zero-heavy / sign-mixed distributions that tabular microdata routinely has.

Motivation, accurately stated:

microplex-us has a silent drop-negatives bug at ColumnwiseQRFDonorImputer:235 that applies y > 0 to every donor-imputed variable. Variables that can be negative (capital gains long & short, partnership/S-corp income, farm income, rental income) lose their entire negative tail because the QRF trains on positives only and produces zero-or-positive predictions. Fixing this in microplex-us requires either duplicating zero/sign-handling logic there or pulling it up into the shared imputation library — this PR is the latter.
policyengine-us-data has one localized cps_ss > 0 filter in _qrf_ss_shares (SS-specific, correct-by-accident since SS is non-negative). Not a bug, but a pattern that belongs inside microimpute rather than duplicated in every consumer.
Secondary concern: a single QRF trained on mixed-sign nonzero data interpolates across zero. Records the gate marks "nonzero" can draw from the interval between max(train_negatives) and min(train_positives) — a region no actual record occupies. This shows up on synthetic data with a designed gap; on real PUF data the gap may be softer but the interpolation is still present.

Regime detection

At fit time, each numeric target is classified into one of seven regimes based on which of {negative, zero, positive} are present (with configurable min_class_count / min_class_fraction to collapse tiny minorities into adjacent regimes):

detected classes	regime	architecture
{+, 0, −}	`THREE_SIGN`	3-way classifier + QRF on positives + QRF on negatives
{+, 0}	`ZI_POSITIVE`	binary gate + QRF on positives
{−, 0}	`ZI_NEGATIVE`	binary gate + QRF on negatives
{+, −}	`SIGN_ONLY`	binary sign gate + two QRFs
{+}	`POSITIVE_ONLY`	single QRF, no gate
{−}	`NEGATIVE_ONLY`	single QRF, no gate
{0}	`DEGENERATE_ZERO`	constant 0

Gate classifier is HistGradientBoostingClassifier by default (rf option kept). On microplex-us's 26-target isolated log-loss benchmark, HistGB Pareto-dominates a 50-tree RF on log-loss, Brier, ECE, and ROC-AUC.

Exact zeros come from the gate path directly (never a QRF draw that rounds to zero). Tripartite predictions are structurally guaranteed to stay out of the positive/negative interior band.

Holdout experiment

experiments/regime_aware_holdout.py (5 seeds, 10k records each, designed positive/negative gap):

approach	pinball	zero_mae	sign_hit	ks	interior_violation
A Tripartite	9.50	0.008	0.532	0.033	0.000 ± 0.000
B Binary+mixed QRF	9.46	0.007	0.532	0.026	0.006
C Positive-only (the microplex-us bug)	9.43	0.422	0.339	0.427	0.000
D No gate	9.43	0.014	0.529	0.026	0.014

The tripartite regime doesn't statistically beat "binary nonzero + mixed QRF" on marginal distributional metrics, but structurally guarantees no predictions in the impossible interior band. For tax variables where positive and negative regimes represent genuinely different populations (gainers vs loss-harvesters), the structural guarantee matters even when marginal metrics agree.

Test plan

pytest tests/test_models/test_zero_inflated.py — 13 new tests pass (regime detection × 7 + prediction correctness × 5 + parity).
pytest tests/ on origin/main at baseline — 8 pre-existing failures in test_autoimpute.py (missing Matching/OLS imports in the test file), unrelated to this change. Confirmed by stash-checkout.
Holdout experiment reproducible: uv run python experiments/regime_aware_holdout.py writes JSON with per-seed raw results.
make format clean.

Downstream wiring

microplex-us will adopt this as a follow-up PR to fix the silent-drop-negatives bug. policyengine-us-data can optionally adopt it to remove the inline cps_ss > 0 filter, but that's a cleanup not a bugfix.

🤖 Generated with Claude Code

Tabular microdata variables fall into distinct regimes based on which of {negative, zero, positive} values appear. Imputing them with a single regressor mixes regimes together, causing two bugs seen in downstream ecosystems: 1. Negative-dropping. The common "fit QRF on y > 0" pattern drops negative training rows along with zeros. Variables like short_term_capital_gains lose their entire negative tail at prediction time. (Present today in microplex_us.pipelines.us ColumnwiseQRFDonorImputer:235.) 2. Zero-crossing interpolation. A single QRF fit on all nonzero values (both signs) learns leaf distributions that interpolate between positive and negative training rows. Gate-called-nonzero records can draw from an interval between max(train_neg) and min(train_pos) that no actual record occupies. ZeroInflatedImputer wraps any base Imputer and: - Detects the regime automatically at fit time from the training distribution. No per-variable hand configuration. - Seven regimes: THREE_SIGN, ZI_POSITIVE, ZI_NEGATIVE, SIGN_ONLY, POSITIVE_ONLY, NEGATIVE_ONLY, DEGENERATE_ZERO. Minority classes below min_class_count or min_class_fraction collapse into the nearest adjacent regime (so a 5-sample negative outlier doesn't trigger a full three-way split). - Composes a gate classifier (HistGradientBoosting default, RF option) with the base Imputer as appropriate for the detected regime. - At predict time, routes each record through the gate to its regime-specific base imputer. Exact zeros come from the gate path directly (no QRF involvement). Tripartite predictions never cross the positive/negative boundary. Generic over the base imputer — QRF is the tested default, but MDN, OLS, Matching compose the same way. Tests (tests/test_models/test_zero_inflated.py, 13 pass): - Regime detection across seven cases including the minority-class collapse rule. - Predictions respect the regime: ZI_POSITIVE draws are >= 0, ZI_NEGATIVE draws are <= 0, three-sign draws stay out of the (max_neg, min_pos) interior band, constant-zero draws are exactly 0. - Parity: POSITIVE_ONLY wrapper produces distributions similar to bare QRF on the same data. Holdout experiment (experiments/regime_aware_holdout.py, 5 seeds, 10k records each with a designed positive/negative gap): - Current microplex-us bug (y > 0 + QRF on positives only): zero-rate MAE = 0.422, sign-match = 0.339. Catastrophic. - Binary-nonzero gate + mixed QRF: zero-rate MAE = 0.007, sign-match = 0.532, interior-band violations = 0.6 %. Distributionally fine but produces some impossible-value draws. - Tripartite (this PR): zero-rate MAE = 0.008, sign-match = 0.532, interior-band violations = 0.000 ± 0.000 across all seeds. Summary: tripartite isn't statistically better than binary-nonzero on marginal distributional metrics (pinball, KS, zero-rate, sign-match all tie), but structurally guarantees no zero-crossing artifacts. For tax variables where positive and negative regimes represent genuinely different populations (gainers vs loss-harvesters), the structural guarantee matters even when the marginal metrics agree. Not included in this PR (noted for follow-up): - tests/test_autoimpute.py has 8 failures on origin/main from missing Matching/OLS imports — pre-existing, unrelated. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel · 2026-04-19T16:41:33Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
microimpute-dashboard	Ready	Preview, Comment	Apr 19, 2026 5:13pm

make format reformatted the three new files (line-length, trailing commas) — no semantic changes. 13/13 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

ColumnwiseQRFDonorImputer previously trained its zero-inflation classifier with label `(y > 0).astype(int)` and filtered the downstream QRF training set to `y > 0`. For any target that can be negative (short_term_capital_gains, partnership_s_corp_income, farm_income, rental_income, self_employment_income, etc.), the QRF only ever saw positive training rows and could therefore never emit a negative value at generate time — the entire negative tail of the synthetic frame was blanked out. Minimal fix: - Label the classifier as `(y != 0).astype(int)` so the positive class is "nonzero (either sign)" rather than "positive only". - Filter the QRF training set to `y != 0`, mixing positives and negatives so the QRF learns the full nonzero conditional distribution. Test (TDD): tests/pipelines/test_donor_imputer_negative_preservation.py fits on a synthetic frame with ~40% negatives, ~20% zeros, ~40% positives, generates 2000 synthetic rows, asserts at least 5% of the generated values are negative. Pre-fix: 0 negatives produced. Post-fix: passes. Scope: This is the minimal fix. The full upgrade is to replace `ColumnwiseQRFDonorImputer`'s ad-hoc gate entirely with `microimpute.models.ZeroInflatedImputer` (PolicyEngine/microimpute#186, merged), which auto-detects the three-sign regime on each target and routes nonzero-positive and nonzero-negative predictions through separate QRFs. That gives a structural guarantee against interior-band leakage in addition to the drop-negatives fix — see the holdout experiment in PolicyEngine/microimpute@a13b1f4 for the quantitative comparison. Tracked for v9 as a standalone refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…Imputer Introduces a new donor_imputer_backend option, `regime_aware`, that wraps microimpute.ZeroInflatedImputer (PolicyEngine/microimpute#186, merged) per target column. ZeroInflatedImputer auto-detects the three-sign regime on the training distribution and routes predictions through sign-specific QRFs, giving a structural guarantee that no prediction lands in the interior band between max(train_negatives) and min(train_positives). Differences from the existing backends: - `qrf`: single QRF, no gate. Zeros come out as whatever the QRF happens to predict near zero. Interior-band violations typical. - `zi_qrf`: ad-hoc `y > 0` gate (since commit 8c88277, `y != 0` — keeps negatives). Binary gate + single QRF on the mixed nonzero subset. Interior-band violations still possible because one QRF trained on both signs interpolates near zero. - `regime_aware` (new): ZeroInflatedImputer auto-detects one of seven regimes (THREE_SIGN / ZI_POSITIVE / ZI_NEGATIVE / SIGN_ONLY / POSITIVE_ONLY / NEGATIVE_ONLY / DEGENERATE_ZERO) per target, and for three-sign variables routes to separate positive and negative QRFs. Interior-band violations structurally impossible. Tests (6 pass): - `tests/pipelines/test_regime_aware_donor_imputer.py`: - Class importable from microplex_us.pipelines.us - Factory dispatches `backend='regime_aware'` to the new class - Fit+generate preserves negatives, positives, and exact zeros - **Zero interior-band violations** on a three-sign fixture with a designed (-100, 100) empty band in training data — the structural guarantee the upstream PR provides CLI flag `--donor-imputer-backend` now accepts `regime_aware` alongside maf / qrf / zi_qrf. Ready to launch v9 once v8 completes. Known upstream issue: microimpute 2.x's ZeroInflatedImputer._fit_base_single hardcodes log_level="ERROR" and conflicts with any caller that passes log_level via base_imputer_kwargs. Worked around here by leaving base_imputer_kwargs={}. Will file follow-up PR to microimpute to make the hardcode conditional. v8 pipeline unaffected: its in-memory process imported the pre-edit modules at start and is still running on the `zi_qrf` backend with the v7-era `ColumnwiseQRFDonorImputer`. This change lands cleanly for v9 without interfering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview April 19, 2026 16:41 View deployment

Apply ruff format per microimpute make check-format

6992ce2

make format reformatted the three new files (line-length, trailing commas) — no semantic changes. 13/13 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

vercel Bot deployed to Preview April 19, 2026 17:13 View deployment

MaxGhenis merged commit a13b1f4 into main Apr 19, 2026
7 checks passed

MaxGhenis deleted the regime-aware-imputer branch April 19, 2026 17:22

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add ZeroInflatedImputer: regime-aware wrapper around base imputers#186

Add ZeroInflatedImputer: regime-aware wrapper around base imputers#186
MaxGhenis merged 2 commits intomainfrom
regime-aware-imputer

MaxGhenis commented Apr 19, 2026 •

edited

Loading

Uh oh!

vercel Bot commented Apr 19, 2026 •

edited

Loading

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Regime detection

Holdout experiment

Test plan

Downstream wiring

Uh oh!

vercel Bot commented Apr 19, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

MaxGhenis commented Apr 19, 2026 •

edited

Loading

vercel Bot commented Apr 19, 2026 •

edited

Loading