Add ZeroInflatedImputer: regime-aware wrapper around base imputers#186
Merged
Add ZeroInflatedImputer: regime-aware wrapper around base imputers#186
Conversation
Tabular microdata variables fall into distinct regimes based on which
of {negative, zero, positive} values appear. Imputing them with a
single regressor mixes regimes together, causing two bugs seen in
downstream ecosystems:
1. Negative-dropping. The common "fit QRF on y > 0" pattern drops
negative training rows along with zeros. Variables like
short_term_capital_gains lose their entire negative tail at
prediction time. (Present today in microplex_us.pipelines.us
ColumnwiseQRFDonorImputer:235.)
2. Zero-crossing interpolation. A single QRF fit on all nonzero values
(both signs) learns leaf distributions that interpolate between
positive and negative training rows. Gate-called-nonzero records
can draw from an interval between max(train_neg) and min(train_pos)
that no actual record occupies.
ZeroInflatedImputer wraps any base Imputer and:
- Detects the regime automatically at fit time from the training
distribution. No per-variable hand configuration.
- Seven regimes: THREE_SIGN, ZI_POSITIVE, ZI_NEGATIVE, SIGN_ONLY,
POSITIVE_ONLY, NEGATIVE_ONLY, DEGENERATE_ZERO. Minority classes
below min_class_count or min_class_fraction collapse into the
nearest adjacent regime (so a 5-sample negative outlier doesn't
trigger a full three-way split).
- Composes a gate classifier (HistGradientBoosting default, RF option)
with the base Imputer as appropriate for the detected regime.
- At predict time, routes each record through the gate to its
regime-specific base imputer. Exact zeros come from the gate path
directly (no QRF involvement). Tripartite predictions never cross
the positive/negative boundary.
Generic over the base imputer — QRF is the tested default, but MDN,
OLS, Matching compose the same way.
Tests (tests/test_models/test_zero_inflated.py, 13 pass):
- Regime detection across seven cases including the minority-class
collapse rule.
- Predictions respect the regime: ZI_POSITIVE draws are >= 0,
ZI_NEGATIVE draws are <= 0, three-sign draws stay out of the
(max_neg, min_pos) interior band, constant-zero draws are exactly 0.
- Parity: POSITIVE_ONLY wrapper produces distributions similar to
bare QRF on the same data.
Holdout experiment (experiments/regime_aware_holdout.py, 5 seeds,
10k records each with a designed positive/negative gap):
- Current microplex-us bug (y > 0 + QRF on positives only): zero-rate
MAE = 0.422, sign-match = 0.339. Catastrophic.
- Binary-nonzero gate + mixed QRF: zero-rate MAE = 0.007, sign-match
= 0.532, interior-band violations = 0.6 %. Distributionally fine
but produces some impossible-value draws.
- Tripartite (this PR): zero-rate MAE = 0.008, sign-match = 0.532,
interior-band violations = 0.000 ± 0.000 across all seeds.
Summary: tripartite isn't statistically better than binary-nonzero on
marginal distributional metrics (pinball, KS, zero-rate, sign-match
all tie), but structurally guarantees no zero-crossing artifacts. For
tax variables where positive and negative regimes represent genuinely
different populations (gainers vs loss-harvesters), the structural
guarantee matters even when the marginal metrics agree.
Not included in this PR (noted for follow-up):
- tests/test_autoimpute.py has 8 failures on origin/main from missing
Matching/OLS imports — pre-existing, unrelated.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
make format reformatted the three new files (line-length, trailing commas) — no semantic changes. 13/13 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis
added a commit
to CosilicoAI/microplex-us
that referenced
this pull request
Apr 22, 2026
ColumnwiseQRFDonorImputer previously trained its zero-inflation classifier with label `(y > 0).astype(int)` and filtered the downstream QRF training set to `y > 0`. For any target that can be negative (short_term_capital_gains, partnership_s_corp_income, farm_income, rental_income, self_employment_income, etc.), the QRF only ever saw positive training rows and could therefore never emit a negative value at generate time — the entire negative tail of the synthetic frame was blanked out. Minimal fix: - Label the classifier as `(y != 0).astype(int)` so the positive class is "nonzero (either sign)" rather than "positive only". - Filter the QRF training set to `y != 0`, mixing positives and negatives so the QRF learns the full nonzero conditional distribution. Test (TDD): tests/pipelines/test_donor_imputer_negative_preservation.py fits on a synthetic frame with ~40% negatives, ~20% zeros, ~40% positives, generates 2000 synthetic rows, asserts at least 5% of the generated values are negative. Pre-fix: 0 negatives produced. Post-fix: passes. Scope: This is the minimal fix. The full upgrade is to replace `ColumnwiseQRFDonorImputer`'s ad-hoc gate entirely with `microimpute.models.ZeroInflatedImputer` (PolicyEngine/microimpute#186, merged), which auto-detects the three-sign regime on each target and routes nonzero-positive and nonzero-negative predictions through separate QRFs. That gives a structural guarantee against interior-band leakage in addition to the drop-negatives fix — see the holdout experiment in PolicyEngine/microimpute@a13b1f4 for the quantitative comparison. Tracked for v9 as a standalone refactor. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis
added a commit
to CosilicoAI/microplex-us
that referenced
this pull request
Apr 22, 2026
…Imputer Introduces a new donor_imputer_backend option, `regime_aware`, that wraps microimpute.ZeroInflatedImputer (PolicyEngine/microimpute#186, merged) per target column. ZeroInflatedImputer auto-detects the three-sign regime on the training distribution and routes predictions through sign-specific QRFs, giving a structural guarantee that no prediction lands in the interior band between max(train_negatives) and min(train_positives). Differences from the existing backends: - `qrf`: single QRF, no gate. Zeros come out as whatever the QRF happens to predict near zero. Interior-band violations typical. - `zi_qrf`: ad-hoc `y > 0` gate (since commit 8c88277, `y != 0` — keeps negatives). Binary gate + single QRF on the mixed nonzero subset. Interior-band violations still possible because one QRF trained on both signs interpolates near zero. - `regime_aware` (new): ZeroInflatedImputer auto-detects one of seven regimes (THREE_SIGN / ZI_POSITIVE / ZI_NEGATIVE / SIGN_ONLY / POSITIVE_ONLY / NEGATIVE_ONLY / DEGENERATE_ZERO) per target, and for three-sign variables routes to separate positive and negative QRFs. Interior-band violations structurally impossible. Tests (6 pass): - `tests/pipelines/test_regime_aware_donor_imputer.py`: - Class importable from microplex_us.pipelines.us - Factory dispatches `backend='regime_aware'` to the new class - Fit+generate preserves negatives, positives, and exact zeros - **Zero interior-band violations** on a three-sign fixture with a designed (-100, 100) empty band in training data — the structural guarantee the upstream PR provides CLI flag `--donor-imputer-backend` now accepts `regime_aware` alongside maf / qrf / zi_qrf. Ready to launch v9 once v8 completes. Known upstream issue: microimpute 2.x's ZeroInflatedImputer._fit_base_single hardcodes log_level="ERROR" and conflicts with any caller that passes log_level via base_imputer_kwargs. Worked around here by leaving base_imputer_kwargs={}. Will file follow-up PR to microimpute to make the hardcode conditional. v8 pipeline unaffected: its in-memory process imported the pre-edit modules at start and is still running on the `zi_qrf` backend with the v7-era `ColumnwiseQRFDonorImputer`. This change lands cleanly for v9 without interfering. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Wraps any base
Imputerwith automatic regime detection and multi-regime gating for zero-inflated, sign-inflated, and three-sign target variables. Gives microimpute a first-class way to handle the zero-heavy / sign-mixed distributions that tabular microdata routinely has.Motivation, accurately stated:
microplex-us has a silent drop-negatives bug at
ColumnwiseQRFDonorImputer:235that appliesy > 0to every donor-imputed variable. Variables that can be negative (capital gains long & short, partnership/S-corp income, farm income, rental income) lose their entire negative tail because the QRF trains on positives only and produces zero-or-positive predictions. Fixing this in microplex-us requires either duplicating zero/sign-handling logic there or pulling it up into the shared imputation library — this PR is the latter.policyengine-us-data has one localized
cps_ss > 0filter in_qrf_ss_shares(SS-specific, correct-by-accident since SS is non-negative). Not a bug, but a pattern that belongs inside microimpute rather than duplicated in every consumer.Secondary concern: a single QRF trained on mixed-sign nonzero data interpolates across zero. Records the gate marks "nonzero" can draw from the interval between
max(train_negatives)andmin(train_positives)— a region no actual record occupies. This shows up on synthetic data with a designed gap; on real PUF data the gap may be softer but the interpolation is still present.Regime detection
At fit time, each numeric target is classified into one of seven regimes based on which of {negative, zero, positive} are present (with configurable
min_class_count/min_class_fractionto collapse tiny minorities into adjacent regimes):THREE_SIGNZI_POSITIVEZI_NEGATIVESIGN_ONLYPOSITIVE_ONLYNEGATIVE_ONLYDEGENERATE_ZEROGate classifier is
HistGradientBoostingClassifierby default (rfoption kept). On microplex-us's 26-target isolated log-loss benchmark, HistGB Pareto-dominates a 50-tree RF on log-loss, Brier, ECE, and ROC-AUC.Exact zeros come from the gate path directly (never a QRF draw that rounds to zero). Tripartite predictions are structurally guaranteed to stay out of the positive/negative interior band.
Holdout experiment
experiments/regime_aware_holdout.py(5 seeds, 10k records each, designed positive/negative gap):The tripartite regime doesn't statistically beat "binary nonzero + mixed QRF" on marginal distributional metrics, but structurally guarantees no predictions in the impossible interior band. For tax variables where positive and negative regimes represent genuinely different populations (gainers vs loss-harvesters), the structural guarantee matters even when marginal metrics agree.
Test plan
pytest tests/test_models/test_zero_inflated.py— 13 new tests pass (regime detection × 7 + prediction correctness × 5 + parity).pytest tests/onorigin/mainat baseline — 8 pre-existing failures intest_autoimpute.py(missingMatching/OLSimports in the test file), unrelated to this change. Confirmed by stash-checkout.uv run python experiments/regime_aware_holdout.pywrites JSON with per-seed raw results.make formatclean.Downstream wiring
microplex-uswill adopt this as a follow-up PR to fix the silent-drop-negatives bug.policyengine-us-datacan optionally adopt it to remove the inlinecps_ss > 0filter, but that's a cleanup not a bugfix.🤖 Generated with Claude Code