Skip to content

Fix OLS prediction SE and logistic l1_ratio silent ignore#180

Merged
MaxGhenis merged 2 commits intomainfrom
fix/ols-prediction-se-and-logistic-penalty
Apr 17, 2026
Merged

Fix OLS prediction SE and logistic l1_ratio silent ignore#180
MaxGhenis merged 2 commits intomainfrom
fix/ols-prediction-se-and-logistic-penalty

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Two OLS correctness fixes (findings #6 and #8):

  • Fix inconsistent imports and improve module structure #6 Prediction SE used residual std only. _OLSModel.predict scaled normal quantiles by se = sqrt(model.scale) (residual std), ignoring the leverage term. Switched to statsmodels' get_prediction(X).var_pred_mean + model.scale so the per-row prediction SE includes leverage. Also clipped mean_quantile to (1e-6, 1-1e-6) so q=0 / q=1 no longer return inf via norm.ppf.
  • Create central configuration system #8 Logistic l1_ratio silently ignored. sklearn's LogisticRegression uses l1_ratio only when penalty="elasticnet". Previously the classifier passed l1_ratio with the default penalty="l2", and sklearn emitted "l1_ratio parameter is only used when penalty is 'elasticnet'". Now when a non-zero l1_ratio is supplied, penalty="elasticnet" and solver="saga" are set automatically.

Test plan

  • test_ols_quantile_uses_full_prediction_se - interval width grows with leverage
  • test_ols_quantile_clips_q_away_from_zero_and_one - q=0/q=1 produce finite predictions
  • test_logistic_l1_ratio_activates_elasticnet - l1_ratio=0.5 sets penalty="elasticnet" and solver="saga"
  • Full tests/test_models/test_ols.py passes (11/11)
  • tests/test_models/test_qrf.py, test_quantreg.py pass

Two correctness bugs in microimpute/models/ols.py:

1. Prediction SE used residual std only (#6). _OLSModel scaled normal
   quantiles by se = sqrt(model.scale), ignoring the leverage term. The
   prediction SE for a new observation is
   sqrt(scale * (1 + x'(X'X)^-1 x)) = sqrt(var_pred_mean + scale). Test
   rows far from the training centroid were systematically
   under-dispersed; at extreme quantiles (0.01, 0.99) the under-
   dispersion is material. Switched to statsmodels'
   model.get_prediction(X).var_pred_mean + model.scale to obtain
   per-row prediction variance.

   Additionally clipped the mean_quantile to (1e-6, 1-1e-6) so q=0 or
   q=1 no longer produce ±inf via norm.ppf or a=inf in the beta
   distribution alpha formula.

2. Logistic l1_ratio silently ignored (#8). LogisticRegression uses
   l1_ratio only when penalty="elasticnet" (and solver supports it).
   Previously the classifier passed l1_ratio through with the default
   penalty="l2", and sklearn warned "l1_ratio parameter is only used
   when penalty is 'elasticnet'". Callers tuning l1_ratio saw no
   change. Now: when a non-zero l1_ratio is supplied, penalty is set
   to "elasticnet" and solver defaults to "saga".

Tests
- test_ols_quantile_uses_full_prediction_se: widths of prediction
  intervals must grow with leverage (centroid vs extrapolation).
- test_ols_quantile_clips_q_away_from_zero_and_one: q=0 and q=1
  return finite predictions.
- test_logistic_l1_ratio_activates_elasticnet: l1_ratio=0.5 must
  result in penalty="elasticnet" and solver="saga".
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
microimpute-dashboard Ready Ready Preview, Comment Apr 17, 2026 1:09pm

Followup to the prior commit. _predict_quantile returned a bare
ndarray for numeric targets. When OLSResults._predict assembled the
output DataFrame, assigning the first (numeric) column as an ndarray
caused pandas to anchor the DataFrame to a default RangeIndex (0..N).
A subsequent categorical prediction — a pd.Series indexed by the real
X_test index — then failed index alignment and all rows came back
NaN. Downstream sklearn.log_loss then raised "Input contains NaN".

The symptom was specific to mixed-type imputation (numeric + binary
or categorical) with a non-zero-based test index. CI on 3.14 runs
the full suite and caught it via
tests/test_metrics.py::test_compare_metrics_mixed_types; 3.12 only
runs smoke tests so it was hidden there.

Fix: _predict_quantile now returns pd.Series(values,
index=mean_preds.index, name=mean_preds.name). Added
test_ols_mixed_targets_preserve_test_index as a regression.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown
Contributor Author

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All four fixes verified (including follow-up commit 83c4a60):

  • SE math: model.get_prediction(X).var_pred_mean + model.scale ≡ σ²·x'(X'X)⁻¹x + σ² = correct new-observation prediction variance (leverage + residual). Previously only sqrt(model.scale) → under-dispersed at extreme quantiles and high leverage. test_ols_quantile_uses_full_prediction_se tests a row at centroid vs x=10 and asserts the extrapolated prediction interval is wider than the centroid's (impossible with the old residual-std-only formulation).

  • q=0/1 clipping: q_clipped = np.clip(mean_quantile, 1e-6, 1 - 1e-6) → finite norm.ppf. test_ols_quantile_clips_q_away_from_zero_and_one asserts finite output at q=0 and q=1.

  • l1_ratio: When supplied non-zero, now sets penalty='elasticnet', solver='saga'. test_logistic_l1_ratio_activates_elasticnet asserts classifier.penalty == 'elasticnet' and classifier.solver == 'saga' after fit with l1_ratio=0.5.

  • Follow-up 83c4a60: _predict_quantile returns pd.Series(values, index=mean_preds.index, name=mean_preds.name) (per the follow-up note). test_ols_mixed_targets_preserve_test_index is present — splits so test indices start at 160, asserts no NaN in either numeric or binary target, asserts list(out.index) == list(test_data.index). Confirmed the follow-up fixes the pandas 3.0 index-alignment bug that was masquerading as a 3.14-specific issue.

CI all green on 3.12 AND 3.14. Mergeable. LGTM.

@MaxGhenis MaxGhenis merged commit fcd4b8a into main Apr 17, 2026
7 checks passed
@MaxGhenis MaxGhenis deleted the fix/ols-prediction-se-and-logistic-penalty branch April 17, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant