Fix float 0/1 misclassification and warn on unseen categoricals by MaxGhenis · Pull Request #182 · PolicyEngine/microimpute

MaxGhenis · 2026-04-17T12:51:16Z

Summary

Two variable-type fixes (findings #9 and #10):

Implement proper error handling with logging #9 Float {0.0, 1.0} misclassified as boolean. is_boolean_variable returned True for any float column whose unique values happened to be a subset of {0.0, 1.0} (probabilities, rescaled indicators, small-sample features). Such columns were silently routed to the classifier path in QRF/OLS/MDN, flipping the model type. Now only genuine bool dtypes and integer columns with values in {0, 1} are recognised as boolean.
Add parallelization for model training #10 Unseen test-time categoricals silently collapsed to reference. apply_dummy_encoding_to_test mapped unseen categorical levels to all-zero dummies (indistinguishable from the dropped reference level) without any notification, causing silent imputation bias when receiver and donor surveys used different demographic codes. Now the full training level set is tracked during training so we can tell genuinely unseen levels apart from the reference, and a UserWarning lists the unknown levels so callers can retrain or filter.

Test plan

New tests/test_type_handling.py:

Genuine booleans (dtype bool, {0, 1} int) still detected
Float {0.0, 1.0} and probability columns NOT detected as boolean
Unseen test-time category -> UserWarning
Reference-level rows do NOT trigger false-positive warning
Fully known test data -> no warning
All existing tests/test_models/test_qrf.py, test_ols.py, test_quantreg.py, test_data_preprocessing.py pass (86 total)

Two bugs in microimpute/utils/type_handling.py: 1. is_boolean_variable returned True for any float column whose values happened to be a subset of {0.0, 1.0} (#9). A probability column, a rescaled indicator, or a small-sample feature that contained only 0.0 and 1.0 would be silently routed to the classifier path in QRF/OLS/MDN — flipping the model type and destroying regression behaviour. Now only genuine bool dtypes and integer columns with values in {0, 1} are recognised as boolean. 2. apply_dummy_encoding_to_test silently mapped unseen test-time categories to the (dropped) reference level via all-zero dummies (#10). A receiver survey with a demographic code not in the donor survey would have those rows silently assigned the reference-level prediction with no notification. Now: - The training level set for each categorical predictor is tracked explicitly (in DummyVariableProcessor.predictor_training_levels, populated during preprocess_predictors) so the reference level is distinguishable from genuinely unseen levels. - A UserWarning is emitted listing the unseen levels so callers can retrain on a superset or filter those rows before predicting. Tests - test_is_boolean_variable_false_for_float_0_1: float {0.0, 1.0} is not boolean. - test_is_boolean_variable_false_for_float_probability: float [0, .25, .5, .75, 1] is not boolean. - test_is_boolean_variable_true_for_bool_dtype / _true_for_int_0_1: genuine booleans still detected. - test_unseen_category_warns_at_test_time: unseen test-time category emits UserWarning. - test_reference_level_does_not_trigger_warning: reference-level rows don't emit false-positive warnings. - test_all_known_levels_do_not_warn: fully in-sample test data is silent.

vercel · 2026-04-17T12:51:18Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
microimpute-dashboard	Ready	Preview, Comment	Apr 17, 2026 0:52am

MaxGhenis

Type handling fixes verified:

Float {0.0, 1.0} → regressor: is_boolean_variable now returns True only for genuine bool dtype or int {0,1}. Float column with values {0.0, 1.0} or probabilities returns False. Tests test_is_boolean_variable_false_for_float_0_1 and test_is_boolean_variable_false_for_float_probability.
Unseen categorical → UserWarning: DummyVariableProcessor now records the full training level set (including the dropped reference) in predictor_training_levels, so apply_dummy_encoding_to_test can emit a UserWarning only for genuinely-new test-time levels without false-positives on reference-level rows. Tests cover unseen-category warns, reference-level doesn't warn, fully-known data is silent.

The test fixtures deliberately use non-equally-spaced numeric values for x to avoid auto-detection as numeric_categorical — a pragmatic workaround that's noted in test comments.

CI all green. Mergeable. LGTM.

vercel bot deployed to Preview April 17, 2026 12:52 View deployment

MaxGhenis commented Apr 17, 2026

View reviewed changes

MaxGhenis merged commit 16c19a6 into main Apr 17, 2026
7 checks passed

MaxGhenis deleted the fix/type-handling branch April 17, 2026 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix float 0/1 misclassification and warn on unseen categoricals#182

Fix float 0/1 misclassification and warn on unseen categoricals#182
MaxGhenis merged 1 commit intomainfrom
fix/type-handling

MaxGhenis commented Apr 17, 2026

Uh oh!

vercel bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

MaxGhenis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 17, 2026

Summary

Test plan

Uh oh!

vercel bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Apr 17, 2026 •

edited

Loading