Fix county over datasets: map stored county_fips instead of collapsing to first_county_in_state#8843
Merged
Merged
Conversation
…g to first_county_in_state Over a dataset, the county formula read only stored county values and otherwise short-circuited to first_county_in_state — a dataset carrying the county_fips input still computed every New York household as Albany County, zeroing in_nyc and nyc_income_tax nationwide (PolicyEngine/populace#34). The dataset branch now falls through to the existing county_fips mapping (full, partial, and no-FIPS fallback behavior unchanged), so datasets that store county_fips recompute county, county_str, in_nyc, and NYC taxes correctly. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #8843 +/- ##
===========================================
- Coverage 100.00% 99.69% -0.31%
===========================================
Files 3 3
Lines 55 3282 +3227
Branches 0 3 +3
===========================================
+ Hits 55 3272 +3217
- Misses 0 9 +9
- Partials 0 1 +1
Flags with carried forward coverage won't be shown. Click here to find out more. ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
Mapping stored county_fips over datasets exposed three latent gaps at full-microdata scale: - The County enum was missing 61 rows of the county FIPS dataset: 31 genuine state counties and independent cities (O'Brien County IA, five Georgia counties, Clark and Cumberland KY, four Mississippi counties, Deuel and Logan NE, Churchill NV, Los Alamos NM, six Texas counties, and nine Virginia independent cities) plus 30 territory rows. All are APPENDED, never inserted: datasets persist county as enum indices (the NYC city file stores int32 index arrays), so existing member positions must never shift. - map_county_string_to_enum raised KeyError for any name outside the enum; it now returns UNKNOWN for unmappable names. - three_digit_zip_code crashed on the "UNKNOWN" default when a dataset stores no zip codes (reached via the Medicaid rating-area fallback once households carry real counties); non-numeric zip codes now pass through as failed lookups. Adds an enum-coverage lock test (every county FIPS dataset row maps to a member) and an O'Brien County regression case in the dataset test. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The tests/core copy shared a basename with the existing baseline test module, which breaks pytest collection. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Over a dataset, the
countyformula read only storedcountyvalues and otherwise short-circuited tofirst_county_in_state. A dataset carrying thecounty_fipsinput — but no storedcounty— therefore computed every New York household as Albany County, which zeroesin_nycand the entiredefined_for="in_nyc"NYC tax subtree nationwide (the failure documented in PolicyEngine/populace#34: NYC income tax $0B vs ~$15–16B actual).The dataset branch now falls through to the existing
county_fipsmapping when nocountyis stored. Full-FIPS, partial-FIPS, and no-FIPS behavior is unchanged from the non-dataset path: unknown or missing FIPS still fall back tofirst_county_in_state.This is the policyengine-us half of PolicyEngine/populace#275 (block-anchored geography ladder): populace exports
county_fips(withblock_geoid,tract_geoid,place_fips,sldu,sldl,cbsa_code) as household inputs, andcounty/county_str/in_nyc/nyc_income_taxrecompute from them instead of persisting formula outputs.Latent gaps exposed at full-microdata scale
Running the FIPS mapping over enhanced_cps_2024 (which stores
county_fipsfor every household) surfaced three pre-existing defects, all fixed here:Countyenum was missing 61 rows ofcounty_fips_2020.csv.gz— 31 genuine state counties and independent cities (O'Brien County IA, five Georgia counties, Clark and Cumberland KY, four Mississippi counties, Deuel and Logan NE, Churchill NV, Los Alamos NM, six Texas counties, and nine Virginia independent cities) plus 30 territory rows. Households in those counties raisedKeyError: 'O_BRIEN_COUNTY_IA' not in index. The members are appended, never inserted: datasets persistcountyas enum indices (the NYC city file stores int32 index arrays), so existing member positions must never shift.map_county_string_to_enumraisedKeyErrorfor any name outside the enum; it now returnsUNKNOWNfor unmappable names.three_digit_zip_codecrashed on the"UNKNOWN"default when a dataset stores no zip codes (reached via the Medicaid rating-area county-lookup fallback once households carry real counties); non-numeric zip codes now pass through as failed lookups.Changes
county.formula: over a dataset with no storedcounty, map storedcounty_fips(previously: unconditionalfirst_county_in_state).county_enum.py: append the 61 missing members with an append-only comment documenting the index-stability constraint.county_helpers.map_county_string_to_enum:reindex+ fillUNKNOWNinstead of raising.three_digit_zip_code: guard non-numeric zip codes.test_county_from_dataset.py(county mapping from stored FIPS incl. the O'Brien regression;in_nyc/nyc_income_taxrecomputation — NYC household positive, LA household zero; no-FIPS fallback) andtest_county_enum_coverage.py(every county FIPS dataset row maps to a member — locks the enum against future drift; unmapped names returnUNKNOWN).Testing
mainand pass with this change; fallback tests pass on both.test_county_from_dataset.py,test_county_enum_coverage.py,test_county_fips_fallback.py,test_local_employee_taxes.py,test_load_county_fips_dataset.py,test_microsim.py::test_county_persists_across_periods(NYC.h5 index-persisted county), and the previously failingtest_microsim_runs[enhanced_cps_2024]for 2024 and 2025.🤖 Generated with Claude Code