PolicyEngine · MaxGhenis · Jun 28, 2026 · Jun 28, 2026
diff --git a/changelog.d/populace-geoslices.changed.md b/changelog.d/populace-geoslices.changed.md
@@ -0,0 +1 @@
+Use the certified national Populace US dataset for state and congressional-district regions via row filters, and stop vendoring derived Populace area H5 slices into the PolicyEngine bundle manifest.
diff --git a/docs/bundles.md b/docs/bundles.md
@@ -88,26 +88,21 @@ python scripts/bundle.py certify-data \
   --manifest-uri hf://dataset/policyengine/populace-uk-private@<release>/releases/<release>/release_manifest.json
 ```
 
-For US Populace releases, include the inherited state datasets from
-`policyengine-us-data`:
+For US Populace releases, certify the Populace release manifest directly:
 
 ```bash
 python scripts/bundle.py certify-data \
   --country us \
   --data-producer populace \
   --manifest-uri hf://dataset/policyengine/populace-us@<release>/releases/<release>/release_manifest.json \
-  --regional-manifest-uri hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json \
   --model-version <policyengine-us-version>
 ```
 
-The regional manifest must include all 51 `states/{STATE}.h5` artifacts with
-their original repo, revision, and sha256 pins. The resulting bundle manifest
-certifies Populace as the US national default dataset and
-`policyengine-us-data` as the state dataset source.
-The regional manifest URI is recorded for traceability; the bundle does not
-currently record the regional manifest's own sha256. The citable pins are the
-artifact-level repo, revision, and sha256 values copied into
-`data_releases.us.datasets`.
+US state and congressional-district regions scope the certified national
+Populace dataset with row filters. If a Populace release also publishes derived
+`states/*.h5` or `districts/*.h5` area slices, the bundle certification omits
+those slices from `data_releases.us.datasets`; they are not runtime dataset
+dependencies.
 
 Use `python scripts/bundle.py generate` to regenerate derived bundle metadata,
 and `python scripts/bundle.py generate --include-tros` when TRACE TRO sidecars

diff --git a/docs/countries.md b/docs/countries.md
@@ -32,12 +32,13 @@ Override in any output with `income_variable=`.
 
 | | Dataset |
 |---|---|
-| US | Enhanced CPS 2024 (`enhanced_cps_2024.h5`) |
+| US | Populace US 2024 (`populace_us_2024.h5`) |
 | UK | Populace UK 2023 (`populace_uk_2023.h5`) |
 
 ## State / regional breakdown
 
-US: `state_code` and `congressional_district` on every household.
+US: Populace row scoping uses `state_fips` and `congressional_district_geoid`.
+`state_code` remains the human-readable input for custom households.
 
 UK: constituency code and local authority code on every household where available.
 

diff --git a/docs/data-publishing-design.md b/docs/data-publishing-design.md
@@ -168,7 +168,7 @@ Minimal. The existing `pe.us.ensure_datasets` takes a URI today:
 
 ```python
 pe.us.ensure_datasets(
-    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
+    datasets=["hf://policyengine/populace-us/populace_us_2024.h5@<release>"],
     years=[2026],
 )
 ```
@@ -178,13 +178,13 @@ Under the substrate, the URI scheme gains a new prefix:
 ```python
 # The release manifest pins a specific artifact:
 pe.us.ensure_datasets(
-    datasets=["pe-data://us/enhanced_cps_2024@sha256:4e92b340…"],
+    datasets=["pe-data://us/populace_us_2024@sha256:4e92b340…"],
     years=[2026],
 )
 
 # A developer asking for operational newest:
 pe.us.ensure_datasets(
-    datasets=["pe-data://us/enhanced_cps_2024@latest"],  # resolves via channel
+    datasets=["pe-data://us/populace_us_2024@latest"],  # resolves via channel
     years=[2026],
 )
 ```

diff --git a/docs/engineering/skills/data-certification.md b/docs/engineering/skills/data-certification.md
@@ -26,37 +26,26 @@ python scripts/bundle.py certify-data --country uk --data-producer populace \
   --manifest-uri "hf://dataset/policyengine/populace-uk-private@<tag>/releases/<tag>/release_manifest.json"
 ```
 
-For US Populace certification, include the inherited state datasets from the
-certified `policyengine-us-data` release manifest:
+For US Populace certification, certify the Populace release manifest directly:
 
 ```bash
 python scripts/bundle.py certify-data --country us --data-producer populace \
   --manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json" \
-  --regional-manifest-uri "hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json" \
   --model-version "<policyengine-us-version>"
 ```
 
-The regional manifest is required for US while the stack still serves
-state-level datasets from `policyengine-us-data`. It must contain all 51
-`states/{STATE}.h5` artifacts, including DC, and each state artifact must carry
-its original `repo_id`, `revision`, and `sha256`. Certification preserves those
-per-artifact pins in `data_releases.us.datasets` and writes:
+US state and congressional-district regions are row filters over the certified
+national Populace dataset. Certification writes:
 
 ```json
 "region_datasets": {
-  "national": {"path_template": "populace_us_2024.h5"},
-  "state": {"path_template": "states/{state_code}.h5"}
+  "national": {"path_template": "populace_us_2024.h5"}
 }
 ```
 
-Do not move or rewrite state artifacts into the Populace repo. The certified
-bundle is intentionally hybrid: Populace owns the national default dataset, and
-`policyengine-us-data` owns the inherited state datasets until that path is
-migrated.
-The regional manifest URI is recorded for traceability, but the bundle does not
-currently record the regional manifest's own sha256. Treat the copied
-artifact-level repo, revision, and sha256 pins in `data_releases.us.datasets`
-as the citable state dataset certification.
+If the Populace release publishes derived `states/*.h5` or `districts/*.h5`
+files for compatibility checks, certification omits them from the runtime
+bundle. The national H5 is the canonical `.py` dataset.
 
 The script fetches and validates the manifest (every artifact must carry a
 revision pin; the certified dataset must be reachable), writes the canonical

diff --git a/docs/getting-started.md b/docs/getting-started.md
@@ -76,11 +76,8 @@ For population estimates — budget cost, distributional impact, poverty — mov
 ```python
 from policyengine.core import Simulation
 
-datasets = pe.us.ensure_datasets(
-    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
-    years=[2026],
-)
-dataset = datasets["enhanced_cps_2024_2026"]
+datasets = pe.us.ensure_datasets(years=[2026])
+dataset = next(iter(datasets.values()))
 
 baseline = Simulation(
     dataset=dataset,

diff --git a/docs/impact-analysis.md b/docs/impact-analysis.md
@@ -10,11 +10,8 @@ title: "Impact analysis"
 import policyengine as pe
 from policyengine.core import Simulation
 
-datasets = pe.us.ensure_datasets(
-    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
-    years=[2026],
-)
-dataset = datasets["enhanced_cps_2024_2026"]
+datasets = pe.us.ensure_datasets(years=[2026])
+dataset = next(iter(datasets.values()))
 
 baseline = Simulation(dataset=dataset, tax_benefit_model_version=pe.us.model)
 reformed = Simulation(

diff --git a/docs/microsim.md b/docs/microsim.md
@@ -11,11 +11,8 @@ import policyengine as pe
 from policyengine.core import Simulation
 from policyengine.outputs import Aggregate, AggregateType
 
-datasets = pe.us.ensure_datasets(
-    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
-    years=[2026],
-)
-dataset = datasets["enhanced_cps_2024_2026"]
+datasets = pe.us.ensure_datasets(years=[2026])
+dataset = next(iter(datasets.values()))
 
 baseline = Simulation(dataset=dataset, tax_benefit_model_version=pe.us.model)
 baseline.ensure()
@@ -37,15 +34,13 @@ Microdata is stored as HDF5 on Hugging Face. `ensure_datasets` downloads, caches
 
 ```python
 datasets = pe.us.ensure_datasets(
-    datasets=["hf://policyengine/policyengine-us-data/enhanced_cps_2024.h5"],
     years=[2024, 2026],
     data_folder="./data",        # local cache directory
 )
-# Keys are "<dataset_stem>_<year>":
-dataset = datasets["enhanced_cps_2024_2026"]
+dataset = datasets["populace_us_2024_2026"]
 ```
 
-The default US dataset is **Enhanced CPS 2024** — CPS ASEC fused with IRS SOI tax-return records and calibrated to IRS, CMS, SNAP, and other administrative totals. The UK default is **Populace UK 2023** — a Populace-built Family Resources Survey dataset calibrated to UK administrative targets.
+The default US dataset is **Populace US 2024** — a Populace-built dataset calibrated to IRS, CMS, SNAP, Census, and other administrative totals. The UK default is **Populace UK 2023** — a Populace-built Family Resources Survey dataset calibrated to UK administrative targets.
 
 List datasets already known to the country:
 
@@ -158,7 +153,7 @@ See [Outputs](outputs.md) for the full catalog.
 
 ## Memory and performance
 
-A full Enhanced CPS microsimulation uses roughly 4 GB of memory and takes 15–30 seconds on a laptop. For parameter sweeps, reuse the baseline:
+A full Populace US microsimulation uses roughly 4 GB of memory and takes 15-30 seconds on a laptop. For parameter sweeps, reuse the baseline:
 
 ```python
 baseline = Simulation(dataset=dataset, tax_benefit_model_version=pe.us.model)
@@ -171,11 +166,11 @@ for amount in [0, 1_000, 2_000, 3_000]:
     # each iteration runs only the reform
 ```
 
-Downsampled datasets are available for testing:
+Smaller custom H5 datasets can be passed explicitly for testing:
 
 ```python
 datasets = pe.us.ensure_datasets(
-    datasets=["hf://policyengine/policyengine-us-data/cps_small_2024.h5"],
+    datasets=["/path/to/smoke_test_populace_us_2024.h5"],
     years=[2026],
 )
 ```

diff --git a/docs/regions.md b/docs/regions.md
@@ -6,7 +6,9 @@ Sub-national breakdowns: state / district filters on any output, plus dedicated
 
 ## US states
 
-`state_code` is an Enum variable on every household (values `"CA"`, `"TX"`, ...). Pass it as a filter on any `Aggregate` or `ChangeAggregate`:
+For custom households, `state_code` remains the public input (values `"CA"`,
+`"TX"`, ...). Pass it as a filter on any `Aggregate` or `ChangeAggregate` when
+working with simulated outputs that expose that variable:
 
 ```python
 from policyengine.outputs import Aggregate, AggregateType
@@ -21,15 +23,18 @@ ca_snap = Aggregate(
 ca_snap.run()
 ```
 
-Each state is a region in the US registry, with its own dataset:
+Each state is a region in the US registry. State regions scope the certified
+national Populace dataset by `state_fips`; they do not require separate state
+H5 files:
 
 ```python
 states = pe.us.model.region_registry.get_by_type("state")
 for region in states:
-    print(region.code, region.label, region.dataset_path)
+    print(region.code, region.label, region.scoping_strategy)
 ```
 
-For state-specific datasets (rather than filtering a national one), pass `scoping_strategy=region.scoping_strategy` or resolve the dataset path directly.
+For state-specific simulations, pass `scoping_strategy=region.scoping_strategy`
+with the certified national dataset.
 
 ## US congressional districts
 
@@ -44,7 +49,7 @@ for row in impacts.district_results:
     print(row["district_geoid"], row["avg_change"], row["winner_percentage"])
 ```
 
-`district_geoid` is the SSDD integer (state FIPS × 100 + district number). Requires a dataset with `congressional_district_geoid` populated — the default enhanced CPS does.
+`district_geoid` is the SSDD integer (state FIPS × 100 + district number; at-large districts use `00`). Congressional district regions scope the certified national Populace dataset by `congressional_district_geoid`.
 
 ## UK parliamentary constituencies
 
@@ -136,21 +141,19 @@ baseline = Simulation(
     dataset=dataset,
     tax_benefit_model_version=pe.us.model,
     scoping_strategy=RowFilterStrategy(
-        variable_name="state_code",
-        variable_value="CA",
+        variable_name="state_fips",
+        variable_value=6,
     ),
 )
 ```
 
-Regions that filter (US places, UK countries, and any region with `region.requires_filter == True`) carry their own `scoping_strategy`. Pull it off the region object rather than reconstructing it:
+Regions that filter (US states and congressional districts, UK countries, and any region with `region.requires_filter == True`) carry their own `scoping_strategy`. Pull it off the region object rather than reconstructing it. US place regions are present as hierarchy metadata, but current Populace datasets do not carry `place_fips`, so they do not expose runtime scoping yet:
 
 ```python
-nyc = pe.us.model.region_registry.get("place/NY-51000")
+ca = pe.us.model.region_registry.get("state/ca")
 baseline = Simulation(
     dataset=dataset,
     tax_benefit_model_version=pe.us.model,
-    scoping_strategy=nyc.scoping_strategy,
+    scoping_strategy=ca.scoping_strategy,
 )
 ```
-
-US states and congressional districts don't use a scoping strategy — they point to dedicated state- or district-specific datasets via `region.dataset_path`. Pass that dataset to `Simulation` instead.
diff --git a/docs/release-bundles.md b/docs/release-bundles.md
@@ -96,7 +96,7 @@ It does not own final runtime bundle certification.
 
 ### Country data package
 
-Examples: `policyengine-uk-data`, `policyengine-us-data`
+Examples: `populace-data`, `policyengine-uk-data`
 
 The country data package owns:
 
@@ -128,24 +128,18 @@ python scripts/bundle.py certify-data --country us \
   --manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json"
 ```
 
-US Populace certification currently also needs the inherited state-level
-datasets from the certified `policyengine-us-data` release manifest:
+US Populace certification uses the Populace release manifest directly:
 
 ```bash
 python scripts/bundle.py certify-data --country us --data-producer populace \
   --manifest-uri "hf://dataset/policyengine/populace-us@<tag>/releases/<tag>/release_manifest.json" \
-  --regional-manifest-uri "hf://model/policyengine/policyengine-us-data@<version>/releases/<version>/release_manifest.json" \
   --model-version "<policyengine-us-version>"
 ```
 
 That produces one US bundle manifest entry containing the Populace national
-default dataset plus all 51 `states/{STATE}.h5` artifacts pinned to
-`policyengine-us-data`. The resulting `region_datasets.state` template lets
-runtime code resolve a state region to the exact certified state artifact.
-The regional manifest URI is retained for traceability, but the bundle does not
-currently store the regional manifest's own sha256. For inherited state data,
-the citable pins are the copied artifact-level repo, revision, and sha256
-values in `data_releases.us.datasets`.
+default dataset. State and congressional-district regions are runtime row
+filters over that national dataset, so derived `states/*.h5` or
+`districts/*.h5` files are not vendored into `data_releases.us.datasets`.
 
 Earlier releases (policyengine 4.15.x–4.16.x) were certified through the
 `PolicyEngine/policyengine-bundles` archive flow; those bundles remain the

diff --git a/scripts/generate_trace_tros.py b/scripts/generate_trace_tros.py
@@ -61,6 +61,7 @@ def generated_tros() -> list[tuple[Path, bytes]]:
             certification=country_manifest.certification,
             model_wheel_sha256=country_manifest.model_package.sha256,
             model_wheel_url=country_manifest.model_package.wheel_url,
+            emission_context={"pe:emittedIn": "repository-bundle"},
         )
         payloads.append((tro_path, serialize_trace_tro(tro)))
     return payloads

diff --git a/src/policyengine/core/region.py b/src/policyengine/core/region.py
@@ -2,9 +2,9 @@
 
 This module provides the Region and RegionRegistry classes for defining
 geographic regions that a tax-benefit model supports. Regions can have:
-1. A dedicated dataset (e.g., US states, congressional districts)
+1. A dedicated dataset, usually for the national default.
 2. A scoping strategy that derives the region from a parent dataset
-   (row filter or weight replacement)
+   (row filter or weight replacement).
 """
 
 from typing import Literal, Optional, Union
@@ -56,7 +56,7 @@ class Region(BaseModel):
     # Dataset configuration
     dataset_path: Optional[str] = Field(
         default=None,
-        description="GCS path to dedicated dataset (e.g., 'gs://policyengine-us-data/states/CA.h5')",
+        description="URI to a dedicated dataset when the region has one.",
     )
 
     # Scoping strategy for regions that derive from a parent dataset

diff --git a/src/policyengine/core/scoping_strategy.py b/src/policyengine/core/scoping_strategy.py
@@ -3,7 +3,8 @@
 Provides two concrete strategies for scoping datasets to sub-national regions:
 
 1. RowFilterStrategy: Filters dataset rows where a household variable matches
-   a specific value (e.g., UK countries by 'country' field, US places by 'place_fips').
+   a specific value (e.g., US states by 'state_fips', US congressional districts
+   by 'congressional_district_geoid').
 
 2. WeightReplacementStrategy: Legacy strategy that replaces household weights from
    a pre-computed weight matrix resolved locally or from GCS.
@@ -16,7 +17,7 @@
 import numpy as np
 import pandas as pd
 from microdf import MicroDataFrame
-from pydantic import BaseModel, Discriminator
+from pydantic import BaseModel, Discriminator, Field
 
 from policyengine.utils.entity_utils import (
     filter_dataset_by_household_variable,
@@ -62,12 +63,13 @@ class RowFilterStrategy(RegionScopingStrategy):
     """Scoping strategy that filters dataset rows by a household variable.
 
     Used for regions where we want to keep only households matching a
-    specific variable value (e.g., UK countries, US places/cities).
+    specific variable value (e.g., US states or congressional districts).
     """
 
     strategy_type: Literal["row_filter"] = "row_filter"
     variable_name: str
     variable_value: Union[str, int, float]
+    additional_filters: dict[str, Union[str, int, float]] = Field(default_factory=dict)
 
     def apply(
         self,
@@ -80,11 +82,17 @@ def apply(
             group_entities=group_entities,
             variable_name=self.variable_name,
             variable_value=self.variable_value,
+            additional_filters=self.additional_filters,
         )
 
     @property
     def cache_key(self) -> str:
-        return f"row_filter:{self.variable_name}={self.variable_value}"
+        filters = [
+            (self.variable_name, self.variable_value),
+            *self.additional_filters.items(),
+        ]
+        filter_key = ",".join(f"{name}={value}" for name, value in sorted(filters))
+        return f"row_filter:{filter_key}"
 
 
 class WeightReplacementStrategy(RegionScopingStrategy):
Original file line number	Diff line number	Diff line change
		@@ -0,0 +1 @@
		Use the certified national Populace US dataset for state and congressional-district regions via row filters, and stop vendoring derived Populace area H5 slices into the PolicyEngine bundle manifest.