Carry the block-anchored US geography ladder as spine columns#277
Merged
Conversation
Anchor every household at a 2020 census tabulation block (sampled within its assigned congressional district proportional to block population) and derive the full geography ladder as household spine columns on the national artifact: block_geoid, tract_geoid, county_fips, place_fips, sldu, sldl, and cbsa_code alongside the existing state_fips and congressional_district_geoid. One dataset, filter by geography, at any grain — no per-area files. - populace.build.us_runtime.geography_ladder: ladder artifact loader (refuses artifacts missing any layer vintage — vintage_policy: error), block assignment, Frame wrapper, provenance summary, and the us_geography_ladder gate whose NYC checks are the permanent form of the #34 in_nyc-collapse regression. - populace.build.us_runtime.block_ladder_sources: pure parsers for the primary sources — 119th CD BEF (NationalCD119.txt), 2020 BAF SLDU/SLDL/INCPLACE_CDP layers, P.L. 94-171 geographic headers (block POP100, validated against each state row), and OMB 2023 CBSA delineations. - tools/build_us_block_ladder_artifact.py: builds the national NPZ artifact from those sources with cached downloads, per-layer vintage metadata, sha256 provenance, and a load-back self-check. - tools/build_us_puf_support_base.py: --block-ladder-artifact assigns the ladder after CD assignment (requires the CD vintage crosswalk so households carry current-vintage districts), runs the gate, records the assignment summary, and stamps populace_geography_ladder_* H5 attrs. Column names follow the policyengine-us household input surface, so the county rung recomputes in_nyc/nyc_income_tax from inputs rather than persisting formula outputs. Fixes #275 Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
- Atomic download cache writes (write-then-rename) so a crash mid-write cannot leave a truncated file a later run treats as a cache hit. - CBSA delineation parser accepts numeric-typed spreadsheet cells for all three code columns and raises a clear error when a data row carries malformed FIPS cells instead of an opaque int() failure. - Correct the P.L. 94-171 geoheader field count (97, not 93) in the parser docstring and test fixture. Review validated the parsers against the real published files (8.17M-row NationalCD119.txt, DE/AK/NE/DC BAF zips, list1_2023.xlsx, degeo2020.pl) with no confirmed bugs. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
This was referenced Jul 2, 2026
MaxGhenis
added a commit
that referenced
this pull request
Jul 2, 2026
The ladder shipped opt-in (#277): nothing stopped a base or release from being built without the geography spine, which is the silent-degradation family (#225 everyone-is-a-citizen, #34 NYC-zero) this repo legislates against. - tools/build_us_puf_support_base.py: omitting --block-ladder-artifact is now an error; diagnostic builds opt out explicitly with --without-block-ladder (recorded in the summary). - L0/refit release export: the geography spine (state_fips, congressional_district_geoid, and the seven ladder columns) joins the required release source columns (presence; value quality is the gate's job), and the us_geography_ladder gate runs on the selected support with its calibrated weights — a release whose spine is inconsistent or whose NYC mass collapsed fails by default. --allow-geography-ladder-gate-failures is the diagnostic escape hatch; the gate result is recorded in the export summary either way. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #275.
What
Anchors every household's geography at a 2020 census tabulation block and carries the full ladder as household spine columns on the national artifact:
block_geoid,tract_geoid,county_fips,place_fips,sldu,sldl,cbsa_code, alongside the existingstate_fipsandcongressional_district_geoid. One dataset, filter by geography, at any grain — no per-area files.Blocks are sampled within each household's already-assigned congressional district proportional to 2020 block population, so the calibrated CD and state distributions are preserved exactly; every other rung is a deterministic function of the block (tract/county by prefix, place/SLD by BAF crosswalk, CBSA by county → OMB delineation). Adding a future layer (school districts, PUMAs, ZCTAs) is a crosswalk-only change to the ladder artifact — no rebuild, no re-assignment.
Pieces
populace.build.us_runtime.geography_ladder— artifact loader (refuses an artifact missing any layer vintage:vintage_policy: error, the Translate old-vintage CD targets to current district geography #205 lesson), CD-conditioned block assignment with loud vintage-mismatch and state-prefix checks,Framewrapper, provenance summary, and theus_geography_laddergate whose NYC checks are the permanent form of the Fixin_nycbefore enforcing no-formula exports #34 regression (NYC weighted share within bounds nationally and within New York State; place/CBSA/SLD coverage anchors; block/tract/county prefix consistency).populace.build.us_runtime.block_ladder_sources— pure, unit-tested parsers for the primary sources: the 119th Congressional District BEF (NationalCD119.txt,GEOID,CDFP;98delegate → at-large00,ZZdropped), 2020 BAFSLDU/SLDL/INCPLACE_CDPlayers (ZZZ→ unassigned), P.L. 94-171 geographic headers (blockPOP100, validated per state against the summary-level-040 row), and the OMB 2023 CBSA delineation workbook.tools/build_us_block_ladder_artifact.py— builds the single national NPZ from those sources with cached downloads, per-layer{vintage, source, url}metadata, per-file sha256 provenance, and a load-back self-check.--statessupports smoke subsets.tools/build_us_puf_support_base.py—--block-ladder-artifactassigns the ladder immediately after CD assignment (requires the CD vintage crosswalk so household districts are current-vintage; a ladder/district vintage mismatch is an error, never a partial join), runs the gate release-blocking (--allow-geography-ladder-gate-failuresis the diagnostic escape hatch), records the assignment summary + weighted coverage shares, and stampspopulace_geography_ladder_artifact_sha256/populace_geography_ladder_vintagesH5 root attrs next to the existing CD-vintage attrs. The fiscal-refresh/L0 pipeline propagates household columns andpopulace_*attrs unchanged, so the ladder flows to the release artifact with no further wiring.Column names follow the policyengine-us household input surface, so the artifact carries engine inputs, never formula outputs. The issue sketched
sldu_geoid/sldl_geoid; the artifact stores the policyengine-us inputssldu/sldl(3-character within-state BAF codes) —state_fips + slduis the full SLD geoid.Vintages (recorded per layer in artifact metadata, build summary, and H5 attrs)
2020_tabulation_blocks119th_congresscd119.zip) — not the 116th-vintagebaf2020CD layer2020_baf2020_censusINCPLACE_CDPomb_2023_delineationslist1_2023.xlsx)A newer SLD plan (post-2020 redistricting BEF) is a data swap in the artifact builder, not a code change.
The #34 regression (in_nyc / nyc_income_tax)
The county rung restores computable NYC taxes, paired with PolicyEngine/policyengine-us#8843 (over a dataset,
countynow maps storedcounty_fipsinstead of collapsing tofirst_county_in_state). Verified end to end with that fix installed: a DE+NY+VT ladder build → H5 export →Microsimulationrecomputesin_nycexactly equal to the ladder's NYC-county assignment (all five boroughs present),nyc_income_tax > 0for every NYC household and 0 elsewhere, and the gate passes with NYC at 43.5% of New York State weight (actual ≈ 42%). Until #8843 ships, the artifact still carries the county rung; the gate checks it structurally without needing the engine.Verification
test_us_geography_ladder.py,test_us_block_ladder_sources.py+ loader round-trip): loader refusals (missing vintage, ZZZ markers, duplicate/mismatched blocks, nonpositive population), deterministic population-weighted assignment, vintage-mismatch refusal, frame mass/strata preservation, gate pass/fail behavior including the NYC-collapse case. Fullpopulace-buildsuite green;ruff checkclean.write_datasetround-trip: all nine spine columns surviveUSSingleYearDatasetsave/reload with dtypes intact.🤖 Generated with Claude Code