Fix OLS prediction SE and logistic l1_ratio silent ignore#180
Fix OLS prediction SE and logistic l1_ratio silent ignore#180
Conversation
Two correctness bugs in microimpute/models/ols.py: 1. Prediction SE used residual std only (#6). _OLSModel scaled normal quantiles by se = sqrt(model.scale), ignoring the leverage term. The prediction SE for a new observation is sqrt(scale * (1 + x'(X'X)^-1 x)) = sqrt(var_pred_mean + scale). Test rows far from the training centroid were systematically under-dispersed; at extreme quantiles (0.01, 0.99) the under- dispersion is material. Switched to statsmodels' model.get_prediction(X).var_pred_mean + model.scale to obtain per-row prediction variance. Additionally clipped the mean_quantile to (1e-6, 1-1e-6) so q=0 or q=1 no longer produce ±inf via norm.ppf or a=inf in the beta distribution alpha formula. 2. Logistic l1_ratio silently ignored (#8). LogisticRegression uses l1_ratio only when penalty="elasticnet" (and solver supports it). Previously the classifier passed l1_ratio through with the default penalty="l2", and sklearn warned "l1_ratio parameter is only used when penalty is 'elasticnet'". Callers tuning l1_ratio saw no change. Now: when a non-zero l1_ratio is supplied, penalty is set to "elasticnet" and solver defaults to "saga". Tests - test_ols_quantile_uses_full_prediction_se: widths of prediction intervals must grow with leverage (centroid vs extrapolation). - test_ols_quantile_clips_q_away_from_zero_and_one: q=0 and q=1 return finite predictions. - test_logistic_l1_ratio_activates_elasticnet: l1_ratio=0.5 must result in penalty="elasticnet" and solver="saga".
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
Followup to the prior commit. _predict_quantile returned a bare ndarray for numeric targets. When OLSResults._predict assembled the output DataFrame, assigning the first (numeric) column as an ndarray caused pandas to anchor the DataFrame to a default RangeIndex (0..N). A subsequent categorical prediction — a pd.Series indexed by the real X_test index — then failed index alignment and all rows came back NaN. Downstream sklearn.log_loss then raised "Input contains NaN". The symptom was specific to mixed-type imputation (numeric + binary or categorical) with a non-zero-based test index. CI on 3.14 runs the full suite and caught it via tests/test_metrics.py::test_compare_metrics_mixed_types; 3.12 only runs smoke tests so it was hidden there. Fix: _predict_quantile now returns pd.Series(values, index=mean_preds.index, name=mean_preds.name). Added test_ols_mixed_targets_preserve_test_index as a regression. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
MaxGhenis
left a comment
There was a problem hiding this comment.
All four fixes verified (including follow-up commit 83c4a60):
-
SE math:
model.get_prediction(X).var_pred_mean + model.scale≡ σ²·x'(X'X)⁻¹x + σ² = correct new-observation prediction variance (leverage + residual). Previously onlysqrt(model.scale)→ under-dispersed at extreme quantiles and high leverage.test_ols_quantile_uses_full_prediction_setests a row at centroid vs x=10 and asserts the extrapolated prediction interval is wider than the centroid's (impossible with the old residual-std-only formulation). -
q=0/1 clipping:
q_clipped = np.clip(mean_quantile, 1e-6, 1 - 1e-6)→ finitenorm.ppf.test_ols_quantile_clips_q_away_from_zero_and_oneasserts finite output at q=0 and q=1. -
l1_ratio: When supplied non-zero, now sets
penalty='elasticnet', solver='saga'.test_logistic_l1_ratio_activates_elasticnetassertsclassifier.penalty == 'elasticnet'andclassifier.solver == 'saga'after fit withl1_ratio=0.5. -
Follow-up 83c4a60:
_predict_quantilereturnspd.Series(values, index=mean_preds.index, name=mean_preds.name)(per the follow-up note).test_ols_mixed_targets_preserve_test_indexis present — splits so test indices start at 160, asserts no NaN in either numeric or binary target, assertslist(out.index) == list(test_data.index). Confirmed the follow-up fixes the pandas 3.0 index-alignment bug that was masquerading as a 3.14-specific issue.
CI all green on 3.12 AND 3.14. Mergeable. LGTM.
Summary
Two OLS correctness fixes (findings #6 and #8):
_OLSModel.predictscaled normal quantiles byse = sqrt(model.scale)(residual std), ignoring the leverage term. Switched to statsmodels'get_prediction(X).var_pred_mean + model.scaleso the per-row prediction SE includes leverage. Also clippedmean_quantileto(1e-6, 1-1e-6)soq=0/q=1no longer returninfvianorm.ppf.l1_ratiosilently ignored. sklearn'sLogisticRegressionusesl1_ratioonly whenpenalty="elasticnet". Previously the classifier passedl1_ratiowith the defaultpenalty="l2", and sklearn emitted "l1_ratio parameter is only used when penalty is 'elasticnet'". Now when a non-zerol1_ratiois supplied,penalty="elasticnet"andsolver="saga"are set automatically.Test plan
test_ols_quantile_uses_full_prediction_se- interval width grows with leveragetest_ols_quantile_clips_q_away_from_zero_and_one- q=0/q=1 produce finite predictionstest_logistic_l1_ratio_activates_elasticnet- l1_ratio=0.5 setspenalty="elasticnet"andsolver="saga"tests/test_models/test_ols.pypasses (11/11)tests/test_models/test_qrf.py,test_quantreg.pypass