Fix Imputer.fit discarding weights and accepting NaN weights by MaxGhenis · Pull Request #178 · PolicyEngine/microimpute

MaxGhenis · 2026-04-17T12:39:01Z

Summary

Fixes finding #4 from the bug hunt. Imputer.fit(weight_col=...) previously used weights only to construct a bootstrap-resample probability over X_train, then fed the resampled dataset UNWEIGHTED into the underlying estimator. Effective sample size shrank, rare donors were dropped entirely, and variance was inflated relative to the correct weighted estimator. The weight validation (weights <= 0).any() also silently accepted NaN values.

Changes

Imputer.fit now threads sample_weight through to each model's _fit() so learners use their native weighted-fit API:
- QRF -> RandomForestQuantileRegressor.fit(sample_weight=...) and RandomForestClassifier.fit(sample_weight=...)
- OLS -> sm.WLS replaces sm.OLS when weights are provided; LogisticRegression.fit(sample_weight=...) for classification
- Matching -> donor_sample_weight flows into R StatMatch via weight.don
QuantReg and MDN (which don't support weighted fit) now raise NotImplementedError rather than silently discarding weights.
Weight validation now uses np.isnan(weights) | (weights <= 0) to catch both NaN and non-positive values.
QRF.max_train_samples subsampling switches to positional indexing so sample_weight stays aligned.

Test plan

test_imputer_rejects_nan_weights passes
test_imputer_rejects_zero_weights passes
test_weighted_fit_differs_from_unweighted confirms WLS differs from OLS on asymmetric data
Parametrized test_weighted_training passes (OLS/QRF fit weighted; QuantReg/MDN raise)
Full tests/test_models/ suite (100 passed, 2 skipped)

Previously, weight_col was used only to construct bootstrap-resample probabilities over X_train, and the resampled dataset was then fed UNWEIGHTED into the underlying estimator (QRF, OLS, statmatch, MDN). Effective sample size shrank, rare donors were dropped entirely, and variance was inflated relative to the correct weighted estimator. Worse, the "positive weight" guard used (weights <= 0).any() which returns False on NaN, allowing NaN weights to propagate into .sample() probabilities. Changes - Imputer.fit now threads sample_weight through to _fit() so each learner uses its native weighted-fit API: - QRF: sample_weight=... to RandomForestQuantileRegressor.fit and RandomForestClassifier.fit (for categorical/boolean targets). - OLS: sm.WLS replaces sm.OLS when weights are provided; LogisticRegression.fit gets sample_weight for classification targets. - Matching: donor_sample_weight is passed as weight.don into R StatMatch.NND.hotdeck. - Models that do NOT support weighted fit (QuantReg, MDN) now raise NotImplementedError when weights are provided, instead of silently ignoring them. - The weight-validation check now detects NaN AND non-positive weights with np.isnan(weights) | (weights <= 0) and raises a clear error. - QRF max_train_samples subsampling is switched to positional indexing so sample_weight stays aligned with X_train.reset_index. Tests - test_imputer_rejects_nan_weights / test_imputer_rejects_zero_weights - test_weighted_fit_differs_from_unweighted: WLS-fit OLS on an asymmetric dataset (block of weight 50 vs weight 1 with different slopes) must differ from unweighted OLS. - Updates test_weighted_training so QuantReg/MDN assert the expected NotImplementedError. - Fixes an unrelated pre-existing test_reproducibility parametrize that referenced Matching unconditionally (fails when rpy2 isn't installed).

vercel · 2026-04-17T12:39:08Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
microimpute-dashboard	Ready	Preview, Comment	Apr 17, 2026 0:39am

MaxGhenis

Weights threaded to learners verified on all five paths:

QRF: RandomForestQuantileRegressor.fit(sample_weight=...) and RandomForestClassifier.fit(sample_weight=...).
OLS → WLS: sm.WLS(y, X_with_const, weights=weights).fit() when sample_weight supplied.
Logistic: LogisticRegression.fit(..., sample_weight=...).
StatMatch: donor_sample_weight → R weight_don via localconverter (statmatch_hotdeck.py).
QuantReg and MDN raise NotImplementedError — explicit failure beats silent discard.

Weight validation fixed: np.isnan(weights) | (weights <= 0) (previously only <=0 which missed NaN). weight_col now accepts pd.Series. max_train_samples subsampling switched to positional rng.choice so sample_weight[sel] stays aligned.

Tests cover each path: test_imputer_rejects_nan_weights, test_imputer_rejects_zero_weights, test_weighted_fit_differs_from_unweighted (WLS vs OLS on asymmetric weights/slopes), test_weighted_training parametrised with expected NotImplementedError for QuantReg/MDN. Pre-existing test_reproducibility import guard added for missing rpy2.

Minor nit (not blocking): weights_arr is defined inside the if weights is not None: block above and re-referenced in a second conditional block — both guarded by the same condition, so it's correct, just slightly noisy.

CI all green. Mergeable. LGTM.

vercel bot deployed to Preview April 17, 2026 12:39 View deployment

MaxGhenis commented Apr 17, 2026

View reviewed changes

MaxGhenis merged commit c757d98 into main Apr 17, 2026
7 checks passed

MaxGhenis deleted the fix/imputer-weights branch April 17, 2026 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix Imputer.fit discarding weights and accepting NaN weights#178

Fix Imputer.fit discarding weights and accepting NaN weights#178
MaxGhenis merged 1 commit intomainfrom
fix/imputer-weights

MaxGhenis commented Apr 17, 2026

Uh oh!

vercel bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

MaxGhenis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 17, 2026

Summary

Changes

Test plan

Uh oh!

vercel bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Apr 17, 2026 •

edited

Loading