Skip to content

Fix Imputer.fit discarding weights and accepting NaN weights#178

Merged
MaxGhenis merged 1 commit intomainfrom
fix/imputer-weights
Apr 17, 2026
Merged

Fix Imputer.fit discarding weights and accepting NaN weights#178
MaxGhenis merged 1 commit intomainfrom
fix/imputer-weights

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Fixes finding #4 from the bug hunt. Imputer.fit(weight_col=...) previously used weights only to construct a bootstrap-resample probability over X_train, then fed the resampled dataset UNWEIGHTED into the underlying estimator. Effective sample size shrank, rare donors were dropped entirely, and variance was inflated relative to the correct weighted estimator. The weight validation (weights <= 0).any() also silently accepted NaN values.

Changes

  • Imputer.fit now threads sample_weight through to each model's _fit() so learners use their native weighted-fit API:
    • QRF -> RandomForestQuantileRegressor.fit(sample_weight=...) and RandomForestClassifier.fit(sample_weight=...)
    • OLS -> sm.WLS replaces sm.OLS when weights are provided; LogisticRegression.fit(sample_weight=...) for classification
    • Matching -> donor_sample_weight flows into R StatMatch via weight.don
  • QuantReg and MDN (which don't support weighted fit) now raise NotImplementedError rather than silently discarding weights.
  • Weight validation now uses np.isnan(weights) | (weights <= 0) to catch both NaN and non-positive values.
  • QRF.max_train_samples subsampling switches to positional indexing so sample_weight stays aligned.

Test plan

  • test_imputer_rejects_nan_weights passes
  • test_imputer_rejects_zero_weights passes
  • test_weighted_fit_differs_from_unweighted confirms WLS differs from OLS on asymmetric data
  • Parametrized test_weighted_training passes (OLS/QRF fit weighted; QuantReg/MDN raise)
  • Full tests/test_models/ suite (100 passed, 2 skipped)

Previously, weight_col was used only to construct bootstrap-resample
probabilities over X_train, and the resampled dataset was then fed
UNWEIGHTED into the underlying estimator (QRF, OLS, statmatch, MDN).
Effective sample size shrank, rare donors were dropped entirely, and
variance was inflated relative to the correct weighted estimator. Worse,
the "positive weight" guard used (weights <= 0).any() which returns False
on NaN, allowing NaN weights to propagate into .sample() probabilities.

Changes
- Imputer.fit now threads sample_weight through to _fit() so each learner
  uses its native weighted-fit API:
  - QRF: sample_weight=... to RandomForestQuantileRegressor.fit and
    RandomForestClassifier.fit (for categorical/boolean targets).
  - OLS: sm.WLS replaces sm.OLS when weights are provided;
    LogisticRegression.fit gets sample_weight for classification targets.
  - Matching: donor_sample_weight is passed as weight.don into R
    StatMatch.NND.hotdeck.
- Models that do NOT support weighted fit (QuantReg, MDN) now raise
  NotImplementedError when weights are provided, instead of silently
  ignoring them.
- The weight-validation check now detects NaN AND non-positive weights
  with np.isnan(weights) | (weights <= 0) and raises a clear error.
- QRF max_train_samples subsampling is switched to positional indexing so
  sample_weight stays aligned with X_train.reset_index.

Tests
- test_imputer_rejects_nan_weights / test_imputer_rejects_zero_weights
- test_weighted_fit_differs_from_unweighted: WLS-fit OLS on an asymmetric
  dataset (block of weight 50 vs weight 1 with different slopes) must
  differ from unweighted OLS.
- Updates test_weighted_training so QuantReg/MDN assert the expected
  NotImplementedError.
- Fixes an unrelated pre-existing test_reproducibility parametrize that
  referenced Matching unconditionally (fails when rpy2 isn't installed).
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
microimpute-dashboard Ready Ready Preview, Comment Apr 17, 2026 0:39am

Copy link
Copy Markdown
Contributor Author

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Weights threaded to learners verified on all five paths:

  • QRF: RandomForestQuantileRegressor.fit(sample_weight=...) and RandomForestClassifier.fit(sample_weight=...).
  • OLS → WLS: sm.WLS(y, X_with_const, weights=weights).fit() when sample_weight supplied.
  • Logistic: LogisticRegression.fit(..., sample_weight=...).
  • StatMatch: donor_sample_weight → R weight_don via localconverter (statmatch_hotdeck.py).
  • QuantReg and MDN raise NotImplementedError — explicit failure beats silent discard.

Weight validation fixed: np.isnan(weights) | (weights <= 0) (previously only <=0 which missed NaN). weight_col now accepts pd.Series. max_train_samples subsampling switched to positional rng.choice so sample_weight[sel] stays aligned.

Tests cover each path: test_imputer_rejects_nan_weights, test_imputer_rejects_zero_weights, test_weighted_fit_differs_from_unweighted (WLS vs OLS on asymmetric weights/slopes), test_weighted_training parametrised with expected NotImplementedError for QuantReg/MDN. Pre-existing test_reproducibility import guard added for missing rpy2.

Minor nit (not blocking): weights_arr is defined inside the if weights is not None: block above and re-referenced in a second conditional block — both guarded by the same condition, so it's correct, just slightly noisy.

CI all green. Mergeable. LGTM.

@MaxGhenis MaxGhenis merged commit c757d98 into main Apr 17, 2026
7 checks passed
@MaxGhenis MaxGhenis deleted the fix/imputer-weights branch April 17, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant