Fix float 0/1 misclassification and warn on unseen categoricals#182
Merged
Fix float 0/1 misclassification and warn on unseen categoricals#182
Conversation
Two bugs in microimpute/utils/type_handling.py:
1. is_boolean_variable returned True for any float column whose values
happened to be a subset of {0.0, 1.0} (#9). A probability column, a
rescaled indicator, or a small-sample feature that contained only
0.0 and 1.0 would be silently routed to the classifier path in
QRF/OLS/MDN — flipping the model type and destroying regression
behaviour. Now only genuine bool dtypes and integer columns with
values in {0, 1} are recognised as boolean.
2. apply_dummy_encoding_to_test silently mapped unseen test-time
categories to the (dropped) reference level via all-zero dummies
(#10). A receiver survey with a demographic code not in the donor
survey would have those rows silently assigned the reference-level
prediction with no notification. Now:
- The training level set for each categorical predictor is tracked
explicitly (in DummyVariableProcessor.predictor_training_levels,
populated during preprocess_predictors) so the reference level is
distinguishable from genuinely unseen levels.
- A UserWarning is emitted listing the unseen levels so callers can
retrain on a superset or filter those rows before predicting.
Tests
- test_is_boolean_variable_false_for_float_0_1: float {0.0, 1.0} is
not boolean.
- test_is_boolean_variable_false_for_float_probability: float
[0, .25, .5, .75, 1] is not boolean.
- test_is_boolean_variable_true_for_bool_dtype / _true_for_int_0_1:
genuine booleans still detected.
- test_unseen_category_warns_at_test_time: unseen test-time category
emits UserWarning.
- test_reference_level_does_not_trigger_warning: reference-level
rows don't emit false-positive warnings.
- test_all_known_levels_do_not_warn: fully in-sample test data is
silent.
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
MaxGhenis
commented
Apr 17, 2026
Contributor
Author
MaxGhenis
left a comment
There was a problem hiding this comment.
Type handling fixes verified:
- Float {0.0, 1.0} → regressor:
is_boolean_variablenow returns True only for genuine bool dtype or int {0,1}. Float column with values {0.0, 1.0} or probabilities returns False. Teststest_is_boolean_variable_false_for_float_0_1andtest_is_boolean_variable_false_for_float_probability. - Unseen categorical → UserWarning:
DummyVariableProcessornow records the full training level set (including the dropped reference) inpredictor_training_levels, soapply_dummy_encoding_to_testcan emit aUserWarningonly for genuinely-new test-time levels without false-positives on reference-level rows. Tests cover unseen-category warns, reference-level doesn't warn, fully-known data is silent.
The test fixtures deliberately use non-equally-spaced numeric values for x to avoid auto-detection as numeric_categorical — a pragmatic workaround that's noted in test comments.
CI all green. Mergeable. LGTM.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Two variable-type fixes (findings #9 and #10):
{0.0, 1.0}misclassified as boolean.is_boolean_variablereturnedTruefor any float column whose unique values happened to be a subset of{0.0, 1.0}(probabilities, rescaled indicators, small-sample features). Such columns were silently routed to the classifier path in QRF/OLS/MDN, flipping the model type. Now only genuinebooldtypes and integer columns with values in{0, 1}are recognised as boolean.apply_dummy_encoding_to_testmapped unseen categorical levels to all-zero dummies (indistinguishable from the dropped reference level) without any notification, causing silent imputation bias when receiver and donor surveys used different demographic codes. Now the full training level set is tracked during training so we can tell genuinely unseen levels apart from the reference, and aUserWarninglists the unknown levels so callers can retrain or filter.Test plan
New
tests/test_type_handling.py:bool,{0, 1}int) still detected{0.0, 1.0}and probability columns NOT detected as booleanUserWarningtests/test_models/test_qrf.py,test_ols.py,test_quantreg.py,test_data_preprocessing.pypass (86 total)