Skip to content

Fix county over datasets: map stored county_fips instead of collapsing to first_county_in_state#8843

Merged
MaxGhenis merged 4 commits into
mainfrom
fix-county-dataset-fips
Jul 2, 2026
Merged

Fix county over datasets: map stored county_fips instead of collapsing to first_county_in_state#8843
MaxGhenis merged 4 commits into
mainfrom
fix-county-dataset-fips

Conversation

@MaxGhenis

@MaxGhenis MaxGhenis commented Jul 2, 2026

Copy link
Copy Markdown
Contributor

Summary

Over a dataset, the county formula read only stored county values and otherwise short-circuited to first_county_in_state. A dataset carrying the county_fips input — but no stored county — therefore computed every New York household as Albany County, which zeroes in_nyc and the entire defined_for="in_nyc" NYC tax subtree nationwide (the failure documented in PolicyEngine/populace#34: NYC income tax $0B vs ~$15–16B actual).

The dataset branch now falls through to the existing county_fips mapping when no county is stored. Full-FIPS, partial-FIPS, and no-FIPS behavior is unchanged from the non-dataset path: unknown or missing FIPS still fall back to first_county_in_state.

This is the policyengine-us half of PolicyEngine/populace#275 (block-anchored geography ladder): populace exports county_fips (with block_geoid, tract_geoid, place_fips, sldu, sldl, cbsa_code) as household inputs, and county/county_str/in_nyc/nyc_income_tax recompute from them instead of persisting formula outputs.

Latent gaps exposed at full-microdata scale

Running the FIPS mapping over enhanced_cps_2024 (which stores county_fips for every household) surfaced three pre-existing defects, all fixed here:

  1. The County enum was missing 61 rows of county_fips_2020.csv.gz — 31 genuine state counties and independent cities (O'Brien County IA, five Georgia counties, Clark and Cumberland KY, four Mississippi counties, Deuel and Logan NE, Churchill NV, Los Alamos NM, six Texas counties, and nine Virginia independent cities) plus 30 territory rows. Households in those counties raised KeyError: 'O_BRIEN_COUNTY_IA' not in index. The members are appended, never inserted: datasets persist county as enum indices (the NYC city file stores int32 index arrays), so existing member positions must never shift.
  2. map_county_string_to_enum raised KeyError for any name outside the enum; it now returns UNKNOWN for unmappable names.
  3. three_digit_zip_code crashed on the "UNKNOWN" default when a dataset stores no zip codes (reached via the Medicaid rating-area county-lookup fallback once households carry real counties); non-numeric zip codes now pass through as failed lookups.

Changes

  • county.formula: over a dataset with no stored county, map stored county_fips (previously: unconditional first_county_in_state).
  • county_enum.py: append the 61 missing members with an append-only comment documenting the index-stability constraint.
  • county_helpers.map_county_string_to_enum: reindex + fill UNKNOWN instead of raising.
  • three_digit_zip_code: guard non-numeric zip codes.
  • New tests: test_county_from_dataset.py (county mapping from stored FIPS incl. the O'Brien regression; in_nyc/nyc_income_tax recomputation — NYC household positive, LA household zero; no-FIPS fallback) and test_county_enum_coverage.py (every county FIPS dataset row maps to a member — locks the enum against future drift; unmapped names return UNKNOWN).

Testing

  • The dataset-path tests fail on main and pass with this change; fallback tests pass on both.
  • Locally green: test_county_from_dataset.py, test_county_enum_coverage.py, test_county_fips_fallback.py, test_local_employee_taxes.py, test_load_county_fips_dataset.py, test_microsim.py::test_county_persists_across_periods (NYC.h5 index-persisted county), and the previously failing test_microsim_runs[enhanced_cps_2024] for 2024 and 2025.

🤖 Generated with Claude Code

…g to first_county_in_state

Over a dataset, the county formula read only stored county values and
otherwise short-circuited to first_county_in_state — a dataset carrying
the county_fips input still computed every New York household as Albany
County, zeroing in_nyc and nyc_income_tax nationwide
(PolicyEngine/populace#34). The dataset branch now falls through to the
existing county_fips mapping (full, partial, and no-FIPS fallback
behavior unchanged), so datasets that store county_fips recompute
county, county_str, in_nyc, and NYC taxes correctly.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@codecov

codecov Bot commented Jul 2, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.69%. Comparing base (f9e58e7) to head (9ba8f80).
⚠️ Report is 42 commits behind head on main.

Additional details and impacted files
@@             Coverage Diff             @@
##              main    #8843      +/-   ##
===========================================
- Coverage   100.00%   99.69%   -0.31%     
===========================================
  Files            3        3              
  Lines           55     3282    +3227     
  Branches         0        3       +3     
===========================================
+ Hits            55     3272    +3217     
- Misses           0        9       +9     
- Partials         0        1       +1     
Flag Coverage Δ
unittests 99.69% <100.00%> (-0.31%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

MaxGhenis and others added 3 commits July 2, 2026 13:08
Mapping stored county_fips over datasets exposed three latent gaps at
full-microdata scale:

- The County enum was missing 61 rows of the county FIPS dataset: 31
  genuine state counties and independent cities (O'Brien County IA, five
  Georgia counties, Clark and Cumberland KY, four Mississippi counties,
  Deuel and Logan NE, Churchill NV, Los Alamos NM, six Texas counties,
  and nine Virginia independent cities) plus 30 territory rows. All are
  APPENDED, never inserted: datasets persist county as enum indices
  (the NYC city file stores int32 index arrays), so existing member
  positions must never shift.
- map_county_string_to_enum raised KeyError for any name outside the
  enum; it now returns UNKNOWN for unmappable names.
- three_digit_zip_code crashed on the "UNKNOWN" default when a dataset
  stores no zip codes (reached via the Medicaid rating-area fallback
  once households carry real counties); non-numeric zip codes now pass
  through as failed lookups.

Adds an enum-coverage lock test (every county FIPS dataset row maps to
a member) and an O'Brien County regression case in the dataset test.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The tests/core copy shared a basename with the existing baseline test
module, which breaks pytest collection.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@MaxGhenis MaxGhenis merged commit 3cb4283 into main Jul 2, 2026
28 of 29 checks passed
@MaxGhenis MaxGhenis deleted the fix-county-dataset-fips branch July 2, 2026 13:16
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant