Skip to content

Panel pipeline: full plumbing for per-year snapshots (closes #345)#346

Draft
vahid-ahmadi wants to merge 11 commits intomainfrom
feat/panel-persistent-ids-345
Draft

Panel pipeline: full plumbing for per-year snapshots (closes #345)#346
vahid-ahmadi wants to merge 11 commits intomainfrom
feat/panel-persistent-ids-345

Conversation

@vahid-ahmadi
Copy link
Copy Markdown
Collaborator

@vahid-ahmadi vahid-ahmadi commented Apr 17, 2026

Summary

Lands all six steps of the panel-data plan in #345. No change to the default behaviour of the existing single-year pipeline — the new code is plumbing and new helpers that consumers opt into. No data sourcing and no modification of audited upload paths.

Step 1 — panel ID contract

  • policyengine_uk_data/utils/panel_ids.py: PANEL_ID_COLUMNS, get_panel_ids, assert_panel_id_consistency, classify_panel_ids + PanelIDTransition.
  • README.md section documenting that household_id, benunit_id, person_id are the panel keys.

Step 2 — create_yearly_snapshots helper

  • policyengine_uk_data/datasets/yearly_snapshots.py with create_yearly_snapshots(base, years, output_dir, *, filename_template=...).
  • Uprates per year, asserts panel IDs, saves enhanced_frs_<year>.h5.

Step 3 — demographic ageing

  • policyengine_uk_data/utils/demographic_ageing.py with age_dataset(base, years, *, seed, mortality_rates, fertility_rates).
  • Per-year loop: mortality (row removal) → fertility (fresh person_ids attached to mother's benunit + household) → age increment.
  • Deterministic via seed. Ships _PLACEHOLDER rates — explicitly not ONS data; real rates are follow-up work.

Step 4 — year-aware calibration targets

  • Two real bug fixes: local_authorities/loss.py read household_weight at a hard-coded 2025; constituencies/loss.py used dataset.time_period instead of the time_period argument in two places. Both now honour the argument.
  • Extracts year-resolution into resolve_target_value(target, year, *, tolerance=3), names YEAR_FALLBACK_TOLERANCE = 3 as a constant, documents the policy (exact → nearest past year within tolerance → None; no backwards extrapolation; VOA population scaling preserved).
  • docs/targets_coverage.md documents year coverage across every target category and surfaces the real remaining gaps (DWP 2026+ forecasts, annual local-area CSV refreshes).
  • No new data sourced in this PR — plumbing + documentation only.

Step 5 — cross-year weight smoothness

  • utils/calibrate.py gains compute_log_weight_smoothness_penalty (pure helper) plus two keyword-only kwargs on calibrate_local_areas: prior_weights and smoothness_penalty.
  • When both are supplied, the training loss is augmented with a log-space L2 penalty pulling the current weights towards the prior year's final weights. Zero-prior entries (households outside an area's country) are excluded from the mean.
  • Defaults reproduce pre-step-5 behaviour exactly. Shape mismatch raises a clear ValueError.
  • The penalty is computed from the underlying log-space weights, not the dropout-augmented tensor fed into the fit loss, so the regulariser does not double-count the dropout noise.

Step 6 — downstream & consumer changes

  • storage/upload_yearly_snapshots.py — parallel uploader to upload_completed_datasets.py. Same private destination; the existing audited path is not modified. Destination constants are not exposed as function arguments — redirecting the upload requires a code edit reviewed under CLAUDE.md's data-protection rules.
  • tests/conftest.py — new enhanced_frs_for_year factory fixture that resolves enhanced_frs_<year>.h5 and falls back to the legacy enhanced_frs_2023_24.h5 for 2023. Skips (rather than errors) when the year's file is missing so partial panel builds run cleanly.
  • docs/panel_downstream.md — coordination note for the sibling policyengine-uk repo: runtime-uprating skip options, fixture migration pattern, default year set, out-of-scope items.

Together these six pieces give a callable, tested pipeline for going from a single imputed base to a set of uprated + demographically-aged per-year snapshots with year-aware targets, smooth cross-year weights, a safe upload path, and test fixtures that consumers can use without rewrites.

What's not in this PR

Explicit non-goals, all deferred so this change stays reviewable:

  • Real ONS life tables and fertility rates — step 3 ships placeholders.
  • DWP 2026+ benefit caseloads, annual local-area CSV refreshes, NTS 2025+ — data-sourcing work, policy-dependent for DWP.
  • Marriage, separation, leaving-home, migration — step 3 extension.
  • Refactoring create_datasets.py:main() to invoke create_yearly_snapshots automatically — kept as a follow-up so that the change-of-default conversation happens in its own PR.
  • Changes to policyengine-uk — documented in docs/panel_downstream.md, left to the sibling repo.
  • Tuning the smoothness-penalty coefficient on real data — empirical question best answered against full loss matrices.

Test plan

All 80 new tests green locally; formatting and lint clean.

  • pytest policyengine_uk_data/tests/test_panel_ids.py — 10/10
  • pytest policyengine_uk_data/tests/test_yearly_snapshots.py — 7/7
  • pytest policyengine_uk_data/tests/test_demographic_ageing.py — 22/22
  • pytest policyengine_uk_data/tests/test_resolve_target_value.py — 12/12
  • pytest policyengine_uk_data/tests/test_smoothness_penalty.py — 10/10
  • pytest policyengine_uk_data/tests/test_calibrate_smoothness_integration.py — 5/5
  • pytest policyengine_uk_data/tests/test_upload_yearly_snapshots.py — 7/7
  • pytest policyengine_uk_data/tests/test_conftest_fixtures.py — 7/7

Coverage summary

Eight new test files, one pure-function and integration harness per subsystem. All utility/ageing/calibration/upload tests use small in-memory fixtures or monkeypatched dependencies so they run without any real FRS data or network access.

Notable coverage:

  • Upload safety: the private HF repo, HF repo type and GCS bucket are locked by test; the function signature is asserted to only allow years and storage_folder as kwargs (no way to redirect the upload).
  • No-partial-upload invariant: a missing file aborts the upload call before any network traffic.
  • Fixture factory: resolves both the legacy 2023_24 name and the new per-year name, prefers the new one when both exist, coerces str/int years.
  • Smoothness regulariser: gradient masking on zero-prior entries, log-space symmetry, large-penalty integration test that proves pull-towards-prior.

Files touched

policyengine_uk_data/utils/panel_ids.py                        (new + classify_panel_ids)
policyengine_uk_data/utils/demographic_ageing.py               (new)
policyengine_uk_data/utils/calibrate.py                        (+ helper + two kwargs)
policyengine_uk_data/datasets/yearly_snapshots.py              (new)
policyengine_uk_data/targets/build_loss_matrix.py              (extract resolve_target_value)
policyengine_uk_data/datasets/local_areas/constituencies/loss.py    (honour time_period)
policyengine_uk_data/datasets/local_areas/local_authorities/loss.py (honour time_period)
policyengine_uk_data/storage/upload_yearly_snapshots.py        (new)
policyengine_uk_data/tests/conftest.py                         (+ enhanced_frs_for_year)
policyengine_uk_data/tests/test_panel_ids.py                   (new, 10 tests)
policyengine_uk_data/tests/test_yearly_snapshots.py            (new,  7 tests)
policyengine_uk_data/tests/test_demographic_ageing.py          (new, 22 tests)
policyengine_uk_data/tests/test_resolve_target_value.py        (new, 12 tests)
policyengine_uk_data/tests/test_smoothness_penalty.py          (new, 10 tests)
policyengine_uk_data/tests/test_calibrate_smoothness_integration.py (new, 5 tests)
policyengine_uk_data/tests/test_upload_yearly_snapshots.py     (new,  7 tests)
policyengine_uk_data/tests/test_conftest_fixtures.py           (new,  7 tests)
docs/targets_coverage.md                                       (new)
docs/panel_downstream.md                                       (new)
README.md                                                      (new section)
changelog.d/345.md                                             (new)

Tracks and closes: #345.

vahid-ahmadi and others added 3 commits April 17, 2026 11:11
First step towards the per-year panel pipeline described in #345: document
that household_id, benunit_id and person_id are the panel keys that must
be preserved across yearly snapshots, and add a reusable
`assert_panel_id_consistency` utility so future year-loop code can enforce
the invariant at save time and in tests.

No behaviour change to the current single-year pipeline.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Unblocks the Lint check on #346.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds a standalone helper that takes an already-imputed base dataset and
produces one `enhanced_frs_<year>.h5` file per requested year by calling
`uprate_dataset` and saving. Every snapshot is verified against the base
with `assert_panel_id_consistency` at save time, so any future step that
mutates the person/benunit/household tables (e.g. demographic ageing in
step 3) cannot silently break the panel key contract.

Deliberately out of scope for this PR — tracked in #345:
- per-year calibration (needs year-specific targets, step 4)
- demographic ageing (step 3)
- restructuring `create_datasets.py:main()` to call this helper

The existing single-year pipeline is untouched; callers opt in to panel
output by invoking `create_yearly_snapshots` directly.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Add panel ID contract and consistency utility (step 1 of #345) Panel pipeline: ID contract + yearly snapshot helper (steps 1-2 of #345) Apr 17, 2026
Introduces `age_dataset(base, years, *, seed, mortality_rates, fertility_rates)`
— the minimum-viable demographic ageing described in the plan. Per year
step:

- every surviving person's `age` column is incremented,
- persons sampled as dying are removed,
- new babies are appended with fresh, non-colliding `person_id` values and
  attached to the mother's existing benefit unit and household.

Deterministic via the `seed` argument. Placeholder mortality and fertility
tables ship with the module so it runs end-to-end in tests — they are
explicitly named `_PLACEHOLDER` and are due to be replaced by real ONS
life tables and fertility rates in a follow-up.

Also extends `utils/panel_ids` with `classify_panel_ids(base, other)`
and a `PanelIDTransition` dataclass so tests and diagnostics can describe
the survivors / deaths / births move without tripping the strict
`assert_panel_id_consistency` check (which remains the right tool for
uprating-style transforms that must not change ID sets).

Out of scope, tracked in #345:
- real ONS life tables and fertility rates,
- marriage, separation and leaving-home dynamics,
- migration,
- integration into `create_yearly_snapshots` — callers chain `age_dataset`
  and `uprate_dataset` themselves for now.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Panel pipeline: ID contract + yearly snapshot helper (steps 1-2 of #345) Panel pipeline: ID contract, yearly snapshots, demographic ageing (steps 1-3 of #345) Apr 17, 2026
Fixes two concrete bugs that would have prevented calibrating the same
base dataset at a year other than its stored `time_period`:

- `local_authorities/loss.py` read household weights at a hard-coded 2025
  when computing the national-total fallbacks used for LAs missing ONS
  data. Now uses the explicit `time_period` argument.
- `constituencies/loss.py` passed `dataset.time_period` to
  `get_national_income_projections` and `sim.default_calculation_period`
  even when the caller supplied a different `time_period`. Same fix.

Also extracts the year-resolution logic from `build_loss_matrix._resolve_value`
into a documented public function `resolve_target_value`, names the
three-year tolerance as a constant, and adds 12 unit tests covering the
fallback policy (exact match, nearest past year, tolerance limit, no
backwards extrapolation, VOA population scaling).

Ships `docs/targets_coverage.md` documenting year coverage across every
target category and where the real gaps are (DWP 2026+, local-area CSV
refreshes). No new data sourced in this PR — sourcing is deferred.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Panel pipeline: ID contract, yearly snapshots, demographic ageing (steps 1-3 of #345) Panel pipeline: IDs, yearly snapshots, ageing, year-aware targets (steps 1-4 of #345) Apr 17, 2026
Adds an opt-in log-space L2 penalty to the training loss in
`calibrate_local_areas` that pulls the optimised weights towards a prior
year's weights. This is the regulariser that makes a sequence of
per-year calibrations statistically coherent as a panel — without it,
the same household can represent, say, 500 units in 2024 and 50 in 2025.

Design choices:

- The penalty is factored out into a pure helper
  `compute_log_weight_smoothness_penalty(log_weights, prior_weights)`
  so it can be unit-tested thoroughly. Entries where the prior is zero
  (households outside an area's country) are excluded from the mean so
  they neither pull nor inflate the penalty.
- `calibrate_local_areas` gains two keyword-only kwargs, `prior_weights`
  and `smoothness_penalty`, both defaulting to values that reproduce the
  pre-step-5 training loop exactly.
- Shape mismatches raise a clear `ValueError` rather than failing
  deep inside the optimiser.
- The penalty is computed from the underlying log-space weights (not
  the dropout-augmented tensor fed into the fit loss) so the regulariser
  does not double-count the dropout noise.

Tests (15 new, all in two files):

- 10 unit tests on the helper covering zero-when-equal, quadratic
  scaling, masking of zero-prior entries, gradient masking, shape
  validation, symmetric log deviation, differentiability, dtype
  round-trip and a hand-computed heterogeneous case.
- 5 integration tests on `calibrate_local_areas` with a three-household
  fake dataset: default kwargs reproduce pre-step-5 behaviour, shape
  mismatch raises, `None` prior + penalty is a no-op, zero penalty +
  prior is a no-op, and a large penalty measurably pulls weights
  towards the prior versus a no-smoothness run.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Panel pipeline: IDs, yearly snapshots, ageing, year-aware targets (steps 1-4 of #345) Panel pipeline: IDs, snapshots, ageing, year-aware targets, smoothness (steps 1-5 of #345) Apr 17, 2026
@vahid-ahmadi vahid-ahmadi self-assigned this Apr 17, 2026
Final step of the plan in #345: the consumer-facing plumbing that lets
something actually use per-year panel snapshots.

- `policyengine_uk_data/storage/upload_yearly_snapshots.py` — a
  deliberately parallel uploader to `upload_completed_datasets.py`.
  Pointed at the same private HuggingFace repo and GCS bucket; the
  destination constants are not exposed as function arguments, so
  redirecting the upload requires a code edit reviewed under
  CLAUDE.md's data-protection rules. The existing
  `upload_completed_datasets.py` is untouched.
- `policyengine_uk_data/tests/conftest.py` — adds `enhanced_frs_for_year`
  factory fixture. Resolves `enhanced_frs_<year>.h5` and falls back to
  the legacy `enhanced_frs_2023_24.h5` for the 2023 base year so existing
  tests keep passing without modification. Skips (rather than errors) if
  the requested year's file is missing.
- `docs/panel_downstream.md` — coordination note for sibling
  `policyengine-uk` repo: runtime-uprating skip options, fixture
  migration pattern, sensible default year set.

Tests (14 new):

- 7 on the uploader: pure path construction, iterable acceptance,
  empty-list rejection, missing-file rejection with no partial upload,
  upload-arguments lock to the private destination, destination
  constants locked to private repo, function signature does not allow
  redirect via kwargs.
- 7 on the fixture factory: resolves `enhanced_frs_<year>.h5`, skips
  cleanly when year is missing, falls back to legacy filename for 2023,
  prefers the new filename when both exist, accepts int and str years,
  existing `enhanced_frs` fixture still points at legacy name,
  `STORAGE_FOLDER` export is not accidentally shadowed.

Out of scope, flagged in `docs/panel_downstream.md`:

- Modifying `policyengine-uk` itself (separate repo).
- Changing the audited upload destinations.
- Actual decision on skip-vs-always-uprate at simulation time — the doc
  presents the two options and the tradeoffs.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@vahid-ahmadi vahid-ahmadi changed the title Panel pipeline: IDs, snapshots, ageing, year-aware targets, smoothness (steps 1-5 of #345) Panel pipeline: full plumbing for per-year snapshots (closes #345) Apr 17, 2026
CI's pandas returned the panel gender column as StringDtype rather than
the plain object dtype I saw locally. `numpy.ndarray.astype` only
accepts numpy-compatible dtypes, so passing `StringDtype` on line 186
raised TypeError: Cannot interpret '<StringDtype(...)>' as a data type.

Fix: use object arrays when building newborn rows and let pandas coerce
them back to the template's extension dtype during `pd.concat`. Same
pattern applied to other non-numeric template columns in
`_build_newborn_rows`. Adds a regression test that explicitly casts
`gender` to pandas StringDtype before calling `age_dataset`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
vahid-ahmadi and others added 3 commits April 20, 2026 12:43
Adds `targets/sources/ons_mortality.py`, which parses the UK National
Life Tables workbook (~600 KB, shipped alongside the other ONS xlsx
fixtures in `policyengine_uk_data/storage/`) into a long-format frame
keyed by period / sex / age / qx. Exposes `get_mortality_rates(year)`
returning `{sex: {age: qx}}` for the 3-year rolling period covering a
calendar year (with nearest-past fallback), plus a unisex helper.

Extends `age_dataset` in `utils/demographic_ageing.py` to accept the
sex-specific mapping shape in addition to the existing age-only
mapping. Detection is by key type, so the existing placeholder rates
and every current test continue to work unchanged.

Placeholder mortality rates are kept as the fallback default, but the
docstring now points callers at `get_mortality_rates` for real data.

Test coverage: 7 loader tests against a synthetic in-tree workbook
(period resolution, nearest-past fallback, unisex averaging, non-
period sheet filtering) plus 5 age_dataset integration tests
(backwards-compat, sex-specific kill/spare behaviour, missing-sex
fallback, missing-age default, real-rate shape sanity on a toy pop).
All 35 tests in the affected modules pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Adds `targets/sources/ons_fertility.py`, which parses Table 10 of the
ONS *Births in England and Wales: registrations* workbook (~840 KB,
shipped alongside the other ONS xlsx fixtures in
`policyengine_uk_data/storage/`). Exposes
`load_ons_fertility_rates()` returning a long-format frame keyed by
year / country / age_low / age_high / rate_per_1000, and
`get_fertility_rates(year, country=...)` returning a single-year
`{age: probability}` map that plugs straight into
`age_dataset(..., fertility_rates=...)`.

Handles the ONS band format:

- "Under 20" → ages 15-19 (conventional start of the fertility window).
- "20 to 24" ... "35 to 39" → ages 20-24 ... 35-39, uniform within band.
- "40 and over" → ages 40-44 only (5-year cap). Expanding an open band
  uniformly across the whole fertility window would otherwise overstate
  ASFR at ages 45+ by an order of magnitude, since the overwhelming
  majority of 40+ births happen at 40-44.
- Rates converted from births-per-1 000 to per-woman-per-year
  probability.

Year resolution: exact match preferred, nearest past year as fallback
(mirrors the mortality loader). Future years silently fall back to the
latest available; pre-1938 requests raise a clear KeyError.

Test coverage: 9 new loader/integration tests against a synthetic
in-tree workbook (year fallback, open-band cap, under-20 lower bound,
country filter, probability scaling, end-to-end age_dataset
integration). Zero network access in CI.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…345)

Introduces `utils/household_transitions.py` with five life-cycle
transitions that complement the existing mortality + fertility
mechanics:

- `apply_marriages`: pair single adults with an in-region opposite-sex
  partner (closest-age match), merge benunits and fold weights. Uses
  age × sex rates (ONS Marriage Statistics smoothed averages).
- `apply_separations`: split two-adult benunits with the couple's
  mean-age rate (ONS Divorce Statistics). Children attach to the
  mother by default; overridable. New benunit + household rows are
  minted for the mover and regions are preserved.
- `apply_children_leaving_home`: move adult dependents out of their
  parents' benunit + household. Handles both FRS shapes (dependent
  young adult on parents' benunit, or adult child with their own
  single benunit inside the parental household). Uses age-indexed
  rates (ONS LFS "Young adults living with parents").
- `apply_migration`: Poisson-distributed net inflow/outflow by age
  (ONS Long-Term International Migration estimates). Immigration
  clones donor rows at the same age; emigration randomly removes
  rows and cleans up orphaned benunit/household rows.
- `apply_employment_transitions`: rule-based placeholder for
  within-person labour-market moves — retirement at state-pension
  age, CPI-plus wage drift, and configurable job loss/gain rates
  with nearest-age income donor for gainers. Will be replaced with
  UKHLS-estimated rates in a follow-up.

All functions are pure (no mutation), deterministic given an explicit
RNG, and use only columns and ID shapes already present in the FRS /
pe-uk schema. `is_married`, `is_single`, `is_couple` etc. pick up the
changes automatically because they are derived from adult counts in
benunits.

44 new tests across the five modules cover: rate-zero is a no-op,
rate-one produces maximal transitions, derived boolean flags flip
correctly, benunit/household rows stay consistent, deterministic
under a fixed seed, and the default rates produce sensible aggregates.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generate panel data across years (synthetic panel on FRS 2023-24 base)

1 participant