Fix OLS prediction SE and logistic l1_ratio silent ignore by MaxGhenis · Pull Request #180 · PolicyEngine/microimpute

MaxGhenis · 2026-04-17T12:46:01Z

Summary

Two OLS correctness fixes (findings #6 and #8):

Fix inconsistent imports and improve module structure #6 Prediction SE used residual std only. _OLSModel.predict scaled normal quantiles by se = sqrt(model.scale) (residual std), ignoring the leverage term. Switched to statsmodels' get_prediction(X).var_pred_mean + model.scale so the per-row prediction SE includes leverage. Also clipped mean_quantile to (1e-6, 1-1e-6) so q=0 / q=1 no longer return inf via norm.ppf.
Create central configuration system #8 Logistic l1_ratio silently ignored. sklearn's LogisticRegression uses l1_ratio only when penalty="elasticnet". Previously the classifier passed l1_ratio with the default penalty="l2", and sklearn emitted "l1_ratio parameter is only used when penalty is 'elasticnet'". Now when a non-zero l1_ratio is supplied, penalty="elasticnet" and solver="saga" are set automatically.

Test plan

test_ols_quantile_uses_full_prediction_se - interval width grows with leverage
test_ols_quantile_clips_q_away_from_zero_and_one - q=0/q=1 produce finite predictions
test_logistic_l1_ratio_activates_elasticnet - l1_ratio=0.5 sets penalty="elasticnet" and solver="saga"
Full tests/test_models/test_ols.py passes (11/11)
tests/test_models/test_qrf.py, test_quantreg.py pass

Two correctness bugs in microimpute/models/ols.py: 1. Prediction SE used residual std only (#6). _OLSModel scaled normal quantiles by se = sqrt(model.scale), ignoring the leverage term. The prediction SE for a new observation is sqrt(scale * (1 + x'(X'X)^-1 x)) = sqrt(var_pred_mean + scale). Test rows far from the training centroid were systematically under-dispersed; at extreme quantiles (0.01, 0.99) the under- dispersion is material. Switched to statsmodels' model.get_prediction(X).var_pred_mean + model.scale to obtain per-row prediction variance. Additionally clipped the mean_quantile to (1e-6, 1-1e-6) so q=0 or q=1 no longer produce ±inf via norm.ppf or a=inf in the beta distribution alpha formula. 2. Logistic l1_ratio silently ignored (#8). LogisticRegression uses l1_ratio only when penalty="elasticnet" (and solver supports it). Previously the classifier passed l1_ratio through with the default penalty="l2", and sklearn warned "l1_ratio parameter is only used when penalty is 'elasticnet'". Callers tuning l1_ratio saw no change. Now: when a non-zero l1_ratio is supplied, penalty is set to "elasticnet" and solver defaults to "saga". Tests - test_ols_quantile_uses_full_prediction_se: widths of prediction intervals must grow with leverage (centroid vs extrapolation). - test_ols_quantile_clips_q_away_from_zero_and_one: q=0 and q=1 return finite predictions. - test_logistic_l1_ratio_activates_elasticnet: l1_ratio=0.5 must result in penalty="elasticnet" and solver="saga".

vercel · 2026-04-17T12:46:04Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
microimpute-dashboard	Ready	Preview, Comment	Apr 17, 2026 1:09pm

Followup to the prior commit. _predict_quantile returned a bare ndarray for numeric targets. When OLSResults._predict assembled the output DataFrame, assigning the first (numeric) column as an ndarray caused pandas to anchor the DataFrame to a default RangeIndex (0..N). A subsequent categorical prediction — a pd.Series indexed by the real X_test index — then failed index alignment and all rows came back NaN. Downstream sklearn.log_loss then raised "Input contains NaN". The symptom was specific to mixed-type imputation (numeric + binary or categorical) with a non-zero-based test index. CI on 3.14 runs the full suite and caught it via tests/test_metrics.py::test_compare_metrics_mixed_types; 3.12 only runs smoke tests so it was hidden there. Fix: _predict_quantile now returns pd.Series(values, index=mean_preds.index, name=mean_preds.name). Added test_ols_mixed_targets_preserve_test_index as a regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

MaxGhenis

All four fixes verified (including follow-up commit 83c4a60):

SE math: model.get_prediction(X).var_pred_mean + model.scale ≡ σ²·x'(X'X)⁻¹x + σ² = correct new-observation prediction variance (leverage + residual). Previously only sqrt(model.scale) → under-dispersed at extreme quantiles and high leverage. test_ols_quantile_uses_full_prediction_se tests a row at centroid vs x=10 and asserts the extrapolated prediction interval is wider than the centroid's (impossible with the old residual-std-only formulation).
q=0/1 clipping: q_clipped = np.clip(mean_quantile, 1e-6, 1 - 1e-6) → finite norm.ppf. test_ols_quantile_clips_q_away_from_zero_and_one asserts finite output at q=0 and q=1.
l1_ratio: When supplied non-zero, now sets penalty='elasticnet', solver='saga'. test_logistic_l1_ratio_activates_elasticnet asserts classifier.penalty == 'elasticnet' and classifier.solver == 'saga' after fit with l1_ratio=0.5.
Follow-up 83c4a60: _predict_quantile returns pd.Series(values, index=mean_preds.index, name=mean_preds.name) (per the follow-up note). test_ols_mixed_targets_preserve_test_index is present — splits so test indices start at 160, asserts no NaN in either numeric or binary target, asserts list(out.index) == list(test_data.index). Confirmed the follow-up fixes the pandas 3.0 index-alignment bug that was masquerading as a 3.14-specific issue.

CI all green on 3.12 AND 3.14. Mergeable. LGTM.

vercel bot deployed to Preview April 17, 2026 12:46 View deployment

vercel bot deployed to Preview April 17, 2026 13:09 View deployment

MaxGhenis commented Apr 17, 2026

View reviewed changes

MaxGhenis merged commit fcd4b8a into main Apr 17, 2026
7 checks passed

MaxGhenis deleted the fix/ols-prediction-se-and-logistic-penalty branch April 17, 2026 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix OLS prediction SE and logistic l1_ratio silent ignore#180

Fix OLS prediction SE and logistic l1_ratio silent ignore#180
MaxGhenis merged 2 commits intomainfrom
fix/ols-prediction-se-and-logistic-penalty

MaxGhenis commented Apr 17, 2026

Uh oh!

vercel bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

MaxGhenis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 17, 2026

Summary

Test plan

Uh oh!

vercel bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Apr 17, 2026 •

edited

Loading