Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
1 change: 1 addition & 0 deletions changelog.d/populace-geoslices.changed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Use the certified national Populace US dataset for state and congressional-district regions via row filters, and stop vendoring derived Populace area H5 slices into the PolicyEngine bundle manifest.
17 changes: 6 additions & 11 deletions docs/bundles.md
Original file line number Diff line number Diff line change
Expand Up @@ -88,26 +88,21 @@ python scripts/bundle.py certify-data \
--manifest-uri hf://dataset/policyengine/populace-uk-private@<release>/releases/<release>/release_manifest.json
```

For US Populace releases, include the inherited state datasets from
`policyengine-us-data`:
For US Populace releases, certify the Populace release manifest directly:

```bash
python scripts/bundle.py certify-data \
--country us \
--data-producer populace \
--manifest-uri hf://dataset/policyengine/populace-us@<release>/releases/<release>/release_manifest.json \
--regional-manifest-uri hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json \
--model-version <policyengine-us-version>
```

The regional manifest must include all 51 `states/{STATE}.h5` artifacts with
their original repo, revision, and sha256 pins. The resulting bundle manifest
certifies Populace as the US national default dataset and
`policyengine-us-data` as the state dataset source.
The regional manifest URI is recorded for traceability; the bundle does not
currently record the regional manifest's own sha256. The citable pins are the
artifact-level repo, revision, and sha256 values copied into
`data_releases.us.datasets`.
US state and congressional-district regions scope the certified national
Populace dataset with row filters. If a Populace release also publishes derived
`states/*.h5` or `districts/*.h5` area slices, the bundle certification omits
those slices from `data_releases.us.datasets`; they are not runtime dataset
dependencies.

Use `python scripts/bundle.py generate` to regenerate derived bundle metadata,
and `python scripts/bundle.py generate --include-tros` when TRACE TRO sidecars
Expand Down
5 changes: 3 additions & 2 deletions docs/countries.md
Original file line number Diff line number Diff line change
Expand Up @@ -32,12 +32,13 @@ Override in any output with `income_variable=`.

| | Dataset |
|---|---|
| US | Enhanced CPS 2024 (`enhanced_cps_2024.h5`) |
| US | Populace US 2024 (`populace_us_2024.h5`) |
| UK | Populace UK 2023 (`populace_uk_2023.h5`) |

## State / regional breakdown

US: `state_code` and `congressional_district` on every household.
US: Populace row scoping uses `state_fips` and `congressional_district_geoid`.
`state_code` remains the human-readable input for custom households.

UK: constituency code and local authority code on every household where available.

Expand Down
6 changes: 3 additions & 3 deletions docs/data-publishing-design.md
Original file line number Diff line number Diff line change
Expand Up @@ -168,7 +168,7 @@ Minimal. The existing `pe.us.ensure_datasets` takes a URI today:

```python
pe.us.ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
datasets=["hf://policyengine/populace-us/populace_us_2024.h5@<release>"],
years=[2026],
)
```
Expand All @@ -178,13 +178,13 @@ Under the substrate, the URI scheme gains a new prefix:
```python
# The release manifest pins a specific artifact:
pe.us.ensure_datasets(
datasets=["pe-data://us/enhanced_cps_2024@sha256:4e92b340…"],
datasets=["pe-data://us/populace_us_2024@sha256:4e92b340…"],
years=[2026],
)

# A developer asking for operational newest:
pe.us.ensure_datasets(
datasets=["pe-data://us/enhanced_cps_2024@latest"], # resolves via channel
datasets=["pe-data://us/populace_us_2024@latest"], # resolves via channel
years=[2026],
)
```
Expand Down
25 changes: 7 additions & 18 deletions docs/engineering/skills/data-certification.md
Original file line number Diff line number Diff line change
Expand Up @@ -26,37 +26,26 @@ python scripts/bundle.py certify-data --country uk --data-producer populace \
--manifest-uri "hf://dataset/policyengine/populace-uk-private@<tag>/releases/<tag>/release_manifest.json"
```

For US Populace certification, include the inherited state datasets from the
certified `policyengine-us-data` release manifest:
For US Populace certification, certify the Populace release manifest directly:

```bash
python scripts/bundle.py certify-data --country us --data-producer populace \
--manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json" \
--regional-manifest-uri "hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json" \
--model-version "<policyengine-us-version>"
```

The regional manifest is required for US while the stack still serves
state-level datasets from `policyengine-us-data`. It must contain all 51
`states/{STATE}.h5` artifacts, including DC, and each state artifact must carry
its original `repo_id`, `revision`, and `sha256`. Certification preserves those
per-artifact pins in `data_releases.us.datasets` and writes:
US state and congressional-district regions are row filters over the certified
national Populace dataset. Certification writes:

```json
"region_datasets": {
"national": {"path_template": "populace_us_2024.h5"},
"state": {"path_template": "states/{state_code}.h5"}
"national": {"path_template": "populace_us_2024.h5"}
}
```

Do not move or rewrite state artifacts into the Populace repo. The certified
bundle is intentionally hybrid: Populace owns the national default dataset, and
`policyengine-us-data` owns the inherited state datasets until that path is
migrated.
The regional manifest URI is recorded for traceability, but the bundle does not
currently record the regional manifest's own sha256. Treat the copied
artifact-level repo, revision, and sha256 pins in `data_releases.us.datasets`
as the citable state dataset certification.
If the Populace release publishes derived `states/*.h5` or `districts/*.h5`
files for compatibility checks, certification omits them from the runtime
bundle. The national H5 is the canonical `.py` dataset.

The script fetches and validates the manifest (every artifact must carry a
revision pin; the certified dataset must be reachable), writes the canonical
Expand Down
7 changes: 2 additions & 5 deletions docs/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -76,11 +76,8 @@ For population estimates — budget cost, distributional impact, poverty — mov
```python
from policyengine.core import Simulation

datasets = pe.us.ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
years=[2026],
)
dataset = datasets["enhanced_cps_2024_2026"]
datasets = pe.us.ensure_datasets(years=[2026])
dataset = next(iter(datasets.values()))

baseline = Simulation(
dataset=dataset,
Expand Down
7 changes: 2 additions & 5 deletions docs/impact-analysis.md
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,8 @@ title: "Impact analysis"
import policyengine as pe
from policyengine.core import Simulation

datasets = pe.us.ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
years=[2026],
)
dataset = datasets["enhanced_cps_2024_2026"]
datasets = pe.us.ensure_datasets(years=[2026])
dataset = next(iter(datasets.values()))

baseline = Simulation(dataset=dataset, tax_benefit_model_version=pe.us.model)
reformed = Simulation(
Expand Down
19 changes: 7 additions & 12 deletions docs/microsim.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,11 +11,8 @@ import policyengine as pe
from policyengine.core import Simulation
from policyengine.outputs import Aggregate, AggregateType

datasets = pe.us.ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
years=[2026],
)
dataset = datasets["enhanced_cps_2024_2026"]
datasets = pe.us.ensure_datasets(years=[2026])
dataset = next(iter(datasets.values()))

baseline = Simulation(dataset=dataset, tax_benefit_model_version=pe.us.model)
baseline.ensure()
Expand All @@ -37,15 +34,13 @@ Microdata is stored as HDF5 on Hugging Face. `ensure_datasets` downloads, caches

```python
datasets = pe.us.ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
years=[2024, 2026],
data_folder="./data", # local cache directory
)
# Keys are "<dataset_stem>_<year>":
dataset = datasets["enhanced_cps_2024_2026"]
dataset = datasets["populace_us_2024_2026"]
```

The default US dataset is **Enhanced CPS 2024** — CPS ASEC fused with IRS SOI tax-return records and calibrated to IRS, CMS, SNAP, and other administrative totals. The UK default is **Populace UK 2023** — a Populace-built Family Resources Survey dataset calibrated to UK administrative targets.
The default US dataset is **Populace US 2024** — a Populace-built dataset calibrated to IRS, CMS, SNAP, Census, and other administrative totals. The UK default is **Populace UK 2023** — a Populace-built Family Resources Survey dataset calibrated to UK administrative targets.

List datasets already known to the country:

Expand Down Expand Up @@ -158,7 +153,7 @@ See [Outputs](outputs.md) for the full catalog.

## Memory and performance

A full Enhanced CPS microsimulation uses roughly 4 GB of memory and takes 1530 seconds on a laptop. For parameter sweeps, reuse the baseline:
A full Populace US microsimulation uses roughly 4 GB of memory and takes 15-30 seconds on a laptop. For parameter sweeps, reuse the baseline:

```python
baseline = Simulation(dataset=dataset, tax_benefit_model_version=pe.us.model)
Expand All @@ -171,11 +166,11 @@ for amount in [0, 1_000, 2_000, 3_000]:
# each iteration runs only the reform
```

Downsampled datasets are available for testing:
Smaller custom H5 datasets can be passed explicitly for testing:

```python
datasets = pe.us.ensure_datasets(
datasets=["hf://policyengine/policyengine-us-data/cps_small_2024.h5"],
datasets=["/path/to/smoke_test_populace_us_2024.h5"],
years=[2026],
)
```
Expand Down
27 changes: 15 additions & 12 deletions docs/regions.md
Original file line number Diff line number Diff line change
Expand Up @@ -6,7 +6,9 @@ Sub-national breakdowns: state / district filters on any output, plus dedicated

## US states

`state_code` is an Enum variable on every household (values `"CA"`, `"TX"`, ...). Pass it as a filter on any `Aggregate` or `ChangeAggregate`:
For custom households, `state_code` remains the public input (values `"CA"`,
`"TX"`, ...). Pass it as a filter on any `Aggregate` or `ChangeAggregate` when
working with simulated outputs that expose that variable:

```python
from policyengine.outputs import Aggregate, AggregateType
Expand All @@ -21,15 +23,18 @@ ca_snap = Aggregate(
ca_snap.run()
```

Each state is a region in the US registry, with its own dataset:
Each state is a region in the US registry. State regions scope the certified
national Populace dataset by `state_fips`; they do not require separate state
H5 files:

```python
states = pe.us.model.region_registry.get_by_type("state")
for region in states:
print(region.code, region.label, region.dataset_path)
print(region.code, region.label, region.scoping_strategy)
```

For state-specific datasets (rather than filtering a national one), pass `scoping_strategy=region.scoping_strategy` or resolve the dataset path directly.
For state-specific simulations, pass `scoping_strategy=region.scoping_strategy`
with the certified national dataset.

## US congressional districts

Expand All @@ -44,7 +49,7 @@ for row in impacts.district_results:
print(row["district_geoid"], row["avg_change"], row["winner_percentage"])
```

`district_geoid` is the SSDD integer (state FIPS × 100 + district number). Requires a dataset with `congressional_district_geoid` populated — the default enhanced CPS does.
`district_geoid` is the SSDD integer (state FIPS × 100 + district number; at-large districts use `00`). Congressional district regions scope the certified national Populace dataset by `congressional_district_geoid`.

## UK parliamentary constituencies

Expand Down Expand Up @@ -136,21 +141,19 @@ baseline = Simulation(
dataset=dataset,
tax_benefit_model_version=pe.us.model,
scoping_strategy=RowFilterStrategy(
variable_name="state_code",
variable_value="CA",
variable_name="state_fips",
variable_value=6,
),
)
```

Regions that filter (US places, UK countries, and any region with `region.requires_filter == True`) carry their own `scoping_strategy`. Pull it off the region object rather than reconstructing it:
Regions that filter (US states and congressional districts, UK countries, and any region with `region.requires_filter == True`) carry their own `scoping_strategy`. Pull it off the region object rather than reconstructing it. US place regions are present as hierarchy metadata, but current Populace datasets do not carry `place_fips`, so they do not expose runtime scoping yet:

```python
nyc = pe.us.model.region_registry.get("place/NY-51000")
ca = pe.us.model.region_registry.get("state/ca")
baseline = Simulation(
dataset=dataset,
tax_benefit_model_version=pe.us.model,
scoping_strategy=nyc.scoping_strategy,
scoping_strategy=ca.scoping_strategy,
)
```

US states and congressional districts don't use a scoping strategy — they point to dedicated state- or district-specific datasets via `region.dataset_path`. Pass that dataset to `Simulation` instead.
16 changes: 5 additions & 11 deletions docs/release-bundles.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,7 +96,7 @@ It does not own final runtime bundle certification.

### Country data package

Examples: `policyengine-uk-data`, `policyengine-us-data`
Examples: `populace-data`, `policyengine-uk-data`

The country data package owns:

Expand Down Expand Up @@ -128,24 +128,18 @@ python scripts/bundle.py certify-data --country us \
--manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json"
```

US Populace certification currently also needs the inherited state-level
datasets from the certified `policyengine-us-data` release manifest:
US Populace certification uses the Populace release manifest directly:

```bash
python scripts/bundle.py certify-data --country us --data-producer populace \
--manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json" \
--regional-manifest-uri "hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json" \
--model-version "<policyengine-us-version>"
```

That produces one US bundle manifest entry containing the Populace national
default dataset plus all 51 `states/{STATE}.h5` artifacts pinned to
`policyengine-us-data`. The resulting `region_datasets.state` template lets
runtime code resolve a state region to the exact certified state artifact.
The regional manifest URI is retained for traceability, but the bundle does not
currently store the regional manifest's own sha256. For inherited state data,
the citable pins are the copied artifact-level repo, revision, and sha256
values in `data_releases.us.datasets`.
default dataset. State and congressional-district regions are runtime row
filters over that national dataset, so derived `states/*.h5` or
`districts/*.h5` files are not vendored into `data_releases.us.datasets`.

Earlier releases (policyengine 4.15.x–4.16.x) were certified through the
`PolicyEngine/policyengine-bundles` archive flow; those bundles remain the
Expand Down
1 change: 1 addition & 0 deletions scripts/generate_trace_tros.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,7 @@ def generated_tros() -> list[tuple[Path, bytes]]:
certification=country_manifest.certification,
model_wheel_sha256=country_manifest.model_package.sha256,
model_wheel_url=country_manifest.model_package.wheel_url,
emission_context={"pe:emittedIn": "repository-bundle"},
)
payloads.append((tro_path, serialize_trace_tro(tro)))
return payloads
Expand Down
6 changes: 3 additions & 3 deletions src/policyengine/core/region.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,9 +2,9 @@

This module provides the Region and RegionRegistry classes for defining
geographic regions that a tax-benefit model supports. Regions can have:
1. A dedicated dataset (e.g., US states, congressional districts)
1. A dedicated dataset, usually for the national default.
2. A scoping strategy that derives the region from a parent dataset
(row filter or weight replacement)
(row filter or weight replacement).
"""

from typing import Literal, Optional, Union
Expand Down Expand Up @@ -56,7 +56,7 @@ class Region(BaseModel):
# Dataset configuration
dataset_path: Optional[str] = Field(
default=None,
description="GCS path to dedicated dataset (e.g., 'gs://policyengine-us-data/states/CA.h5')",
description="URI to a dedicated dataset when the region has one.",
)

# Scoping strategy for regions that derive from a parent dataset
Expand Down
16 changes: 12 additions & 4 deletions src/policyengine/core/scoping_strategy.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,8 @@
Provides two concrete strategies for scoping datasets to sub-national regions:

1. RowFilterStrategy: Filters dataset rows where a household variable matches
a specific value (e.g., UK countries by 'country' field, US places by 'place_fips').
a specific value (e.g., US states by 'state_fips', US congressional districts
by 'congressional_district_geoid').

2. WeightReplacementStrategy: Legacy strategy that replaces household weights from
a pre-computed weight matrix resolved locally or from GCS.
Expand All @@ -16,7 +17,7 @@
import numpy as np
import pandas as pd
from microdf import MicroDataFrame
from pydantic import BaseModel, Discriminator
from pydantic import BaseModel, Discriminator, Field

from policyengine.utils.entity_utils import (
filter_dataset_by_household_variable,
Expand Down Expand Up @@ -62,12 +63,13 @@ class RowFilterStrategy(RegionScopingStrategy):
"""Scoping strategy that filters dataset rows by a household variable.

Used for regions where we want to keep only households matching a
specific variable value (e.g., UK countries, US places/cities).
specific variable value (e.g., US states or congressional districts).
"""

strategy_type: Literal["row_filter"] = "row_filter"
variable_name: str
variable_value: Union[str, int, float]
additional_filters: dict[str, Union[str, int, float]] = Field(default_factory=dict)

def apply(
self,
Expand All @@ -80,11 +82,17 @@ def apply(
group_entities=group_entities,
variable_name=self.variable_name,
variable_value=self.variable_value,
additional_filters=self.additional_filters,
)

@property
def cache_key(self) -> str:
return f"row_filter:{self.variable_name}={self.variable_value}"
filters = [
(self.variable_name, self.variable_value),
*self.additional_filters.items(),
]
filter_key = ",".join(f"{name}={value}" for name, value in sorted(filters))
return f"row_filter:{filter_key}"


class WeightReplacementStrategy(RegionScopingStrategy):
Expand Down
Loading
Loading