Skip to content

Fix float 0/1 misclassification and warn on unseen categoricals#182

Merged
MaxGhenis merged 1 commit intomainfrom
fix/type-handling
Apr 17, 2026
Merged

Fix float 0/1 misclassification and warn on unseen categoricals#182
MaxGhenis merged 1 commit intomainfrom
fix/type-handling

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Two variable-type fixes (findings #9 and #10):

  • Implement proper error handling with logging #9 Float {0.0, 1.0} misclassified as boolean. is_boolean_variable returned True for any float column whose unique values happened to be a subset of {0.0, 1.0} (probabilities, rescaled indicators, small-sample features). Such columns were silently routed to the classifier path in QRF/OLS/MDN, flipping the model type. Now only genuine bool dtypes and integer columns with values in {0, 1} are recognised as boolean.
  • Add parallelization for model training #10 Unseen test-time categoricals silently collapsed to reference. apply_dummy_encoding_to_test mapped unseen categorical levels to all-zero dummies (indistinguishable from the dropped reference level) without any notification, causing silent imputation bias when receiver and donor surveys used different demographic codes. Now the full training level set is tracked during training so we can tell genuinely unseen levels apart from the reference, and a UserWarning lists the unknown levels so callers can retrain or filter.

Test plan

New tests/test_type_handling.py:

  • Genuine booleans (dtype bool, {0, 1} int) still detected
  • Float {0.0, 1.0} and probability columns NOT detected as boolean
  • Unseen test-time category -> UserWarning
  • Reference-level rows do NOT trigger false-positive warning
  • Fully known test data -> no warning
  • All existing tests/test_models/test_qrf.py, test_ols.py, test_quantreg.py, test_data_preprocessing.py pass (86 total)

Two bugs in microimpute/utils/type_handling.py:

1. is_boolean_variable returned True for any float column whose values
   happened to be a subset of {0.0, 1.0} (#9). A probability column, a
   rescaled indicator, or a small-sample feature that contained only
   0.0 and 1.0 would be silently routed to the classifier path in
   QRF/OLS/MDN — flipping the model type and destroying regression
   behaviour. Now only genuine bool dtypes and integer columns with
   values in {0, 1} are recognised as boolean.

2. apply_dummy_encoding_to_test silently mapped unseen test-time
   categories to the (dropped) reference level via all-zero dummies
   (#10). A receiver survey with a demographic code not in the donor
   survey would have those rows silently assigned the reference-level
   prediction with no notification. Now:
   - The training level set for each categorical predictor is tracked
     explicitly (in DummyVariableProcessor.predictor_training_levels,
     populated during preprocess_predictors) so the reference level is
     distinguishable from genuinely unseen levels.
   - A UserWarning is emitted listing the unseen levels so callers can
     retrain on a superset or filter those rows before predicting.

Tests
- test_is_boolean_variable_false_for_float_0_1: float {0.0, 1.0} is
  not boolean.
- test_is_boolean_variable_false_for_float_probability: float
  [0, .25, .5, .75, 1] is not boolean.
- test_is_boolean_variable_true_for_bool_dtype / _true_for_int_0_1:
  genuine booleans still detected.
- test_unseen_category_warns_at_test_time: unseen test-time category
  emits UserWarning.
- test_reference_level_does_not_trigger_warning: reference-level
  rows don't emit false-positive warnings.
- test_all_known_levels_do_not_warn: fully in-sample test data is
  silent.
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
microimpute-dashboard Ready Ready Preview, Comment Apr 17, 2026 0:52am

Copy link
Copy Markdown
Contributor Author

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Type handling fixes verified:

  • Float {0.0, 1.0} → regressor: is_boolean_variable now returns True only for genuine bool dtype or int {0,1}. Float column with values {0.0, 1.0} or probabilities returns False. Tests test_is_boolean_variable_false_for_float_0_1 and test_is_boolean_variable_false_for_float_probability.
  • Unseen categorical → UserWarning: DummyVariableProcessor now records the full training level set (including the dropped reference) in predictor_training_levels, so apply_dummy_encoding_to_test can emit a UserWarning only for genuinely-new test-time levels without false-positives on reference-level rows. Tests cover unseen-category warns, reference-level doesn't warn, fully-known data is silent.

The test fixtures deliberately use non-equally-spaced numeric values for x to avoid auto-detection as numeric_categorical — a pragmatic workaround that's noted in test comments.

CI all green. Mergeable. LGTM.

@MaxGhenis MaxGhenis merged commit 16c19a6 into main Apr 17, 2026
7 checks passed
@MaxGhenis MaxGhenis deleted the fix/type-handling branch April 17, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant