Skip to content

Bump us-data 1.73.0 → 1.78.2 + fix HF model/dataset repo detection#310

Closed
MaxGhenis wants to merge 1 commit intomainfrom
bump-us-data-1.78.2
Closed

Bump us-data 1.73.0 → 1.78.2 + fix HF model/dataset repo detection#310
MaxGhenis wants to merge 1 commit intomainfrom
bump-us-data-1.78.2

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Two changes:

  1. Bump us-data 1.73.0 → 1.78.2 using the refresh helper from Add release-bundle refresh helper + CLI wrapper #309.
  2. Fix HF model-vs-dataset repo detection in bundle._hf_dataset_sha256. PolicyEngine publishes country microdata under HF model repos (huggingface.co/policyengine/policyengine-us-data/...), not dataset repos. The hardcoded /datasets/ path prefix was 404-ing. Refresh helper now tries the model URL first and falls back to /datasets/ on 404, so both shapes work.

What the refresh changed

Field Before After
data_package.version 1.73.0 1.78.2
certified_data_artifact.sha256 18cdc668… 4e92b340…
certified_data_artifact.uri …@1.73.0 …@1.78.2
certified_data_artifact.build_id …-1.73.0 …-1.78.2
certification.data_build_id …-1.73.0 …-1.78.2

us-data 1.78.2 is the latest HF tag on policyengine/policyengine-us-data as of today. (PyPI has 1.83.4, but HF tags stop at 1.78.2 — us-data publishes to both, with HF sometimes lagging.) Model pin stays at 1.653.3 (latest on PyPI and unchanged in the manifest).

Snapshot tests: unchanged

10/10 tests/test_household_calculator_snapshot.py still pass. The household calculator uses policyengine_us.Simulation(situation=...) which synthesises a fresh sim from the situation dict — it never touches enhanced_cps_2024.h5, so data-version bumps don't shift household-level numbers. No rebaseline needed.

Deferred: TRO regeneration

regenerate_trace_tro('us') raises DataReleaseManifestUnavailableError because the release_manifest.json isn't published at the 1.78.2 HF tag. The existing us.trace.tro.jsonld is stale by one data-version; worth filing upstream to have policyengine-us-data publish the manifest on future tags.

Test plan

  • pytest tests/test_bundle_refresh.py green (6/6)
  • pytest tests/test_household_calculator_snapshot.py green (10/10)
  • Full CI to confirm no downstream breakage

🤖 Generated with Claude Code

Two changes:

1. ``bundle._hf_dataset_sha256`` now tries the HF *model* URL before
   falling back to *datasets*. PolicyEngine publishes country
   microdata under model repos (``huggingface.co/policyengine/...``)
   not dataset repos (``huggingface.co/datasets/...``), so the
   old hardcoded ``/datasets/`` prefix always 404'd for us-data.
   Tests unchanged (mocks match both URL shapes via the same
   ``huggingface.co`` substring check).

2. Applied the first live refresh using
   ``scripts/refresh_release_bundle.py --country us --data-version 1.78.2``:
   - certified_data_artifact.version: 1.73.0 -> 1.78.2
   - certified_data_artifact.sha256: 18cdc668... -> 4e92b340...
   - data_build_id: policyengine-us-data-1.73.0 ->
     policyengine-us-data-1.78.2
   - URI revision tail retargeted to 1.78.2

us-data 1.78.2 is the latest tag on HF (PyPI has 1.83.4 but HF tags
stopped at 1.78.2 as of today). Model pin stays at 1.653.3 (latest
on both PyPI and the manifest; no change needed).

Snapshot tests unchanged: household calculator goes through
policyengine_us.Simulation(situation=...) which synthesises a fresh
sim from the situation dict, never touching ``enhanced_cps_2024.h5``,
so data-version bumps don't shift household-level numbers.

TRO regeneration deferred: the data_release_manifest.json isn't
published at the 1.78.2 HF tag, so
``regenerate_trace_tro('us')`` raises
DataReleaseManifestUnavailableError. Existing us.trace.tro.jsonld is
now stale by one data-version; worth filing upstream with us-data
to publish the manifest on future tags.
@MaxGhenis MaxGhenis force-pushed the bump-us-data-1.78.2 branch from 17e0c7c to cf68edc Compare April 20, 2026 17:43
MaxGhenis added a commit that referenced this pull request Apr 20, 2026
Subagent stress-test surfaced five scenarios the sketch handled
clumsily or not at all. Rewritten to:

- State upfront that the motivating pains (#310 HF repo-type bug,
  "is data stale?") do NOT require this architecture — they're
  solvable with a one-time HF repo-type fix + a 50-line CI job.
  The sketch is a "where do we want to be in a year?" question,
  not an immediate-fix proposal.

- Replace the single "Open questions" section with an explicit
  "Unresolved risks" section covering:
  * UK Data Service audit trail (today HF logs downloader identity;
    sketch loses this unless we explicitly gate manifests + log
    resolver hits)
  * Silent-promote attack (channel JSON has no signature; sketch
    is strictly weaker than PyPI/HF platform auth until
    channel-history signing ships)
  * Non-deterministic builds (today's Enhanced CPS pipeline uses
    torch+pandas imputation; v1 needs If-None-Match conditional
    writes or explicit first-writer-wins semantics)
  * Licence revocation vs immutability (tombstone build_ids with
    status=revoked, explicit licence-continuity qualifier on the
    replicability guarantee)
  * Cross-cloud replication (mirror story is payload-only; channels
    require proxy or consumer multi-mirror config)

- Revised cost estimate: the earlier "3 engineer-weeks" was ~3x
  optimistic. Realistic range is 8-12 engineer-weeks.
  Recategorised as v5 scope.

No change to the core three-concepts model (identity /
distribution / discovery separation). That part held up.
MaxGhenis added a commit that referenced this pull request Apr 20, 2026
…-bundles

Codex review caught the main architectural overclaim in the earlier
sketch: it slid between recipe-addressed ("sha256 of inputs" — an
identifier derived from declared inputs) and content-addressed
("sha256 of output bytes" — an identifier derived from the bytes
themselves), and framed the whole design as a replacement for
release-bundles.md when release-bundles is load-bearing for the
scientific citation and certification surface.

Rewritten to:

- Scope the sketch explicitly to a storage substrate. release-bundles
  remains the authoritative citation + certification surface; the
  substrate is infrastructure underneath.

- Switch the primary identifier from `build_id = sha256(inputs)` to
  `artifact_sha256 = sha256(output bytes)`. Input digest becomes a
  derived queryable field in the manifest, not the primary key.
  This is how OCI/Nix actually work in the parts that deliver.

- Drop the `stable` and `lts-{quarter}` channel names. Their
  semantics for microdata are ambiguous (four meanings per codex:
  "latest official source vintage" vs "methodologically preferred
  reconstruction" vs "legally redistributable build" vs
  "paper-citation freeze"). Keep only `latest` (operational) and
  `next` (staging, feeds certification). Authoritative / stable
  stays on the release-bundle side.

- Drop claims of org-independent identity. `data_vintage:
  "cps_asec_2024"` is a label, not a raw-bytes hash; `built_at` /
  `built_by` break bitwise identity across orgs anyway. The current
  release-bundles schema records raw-input hashes, so regressing
  on that would be real.

- New section: "The release-bundle boundary (what doesn't change)"
  spelling out that certification, staged promotion, compatibility
  rules, `*.trace.tro.jsonld` sidecars, and the replicability
  guarantee all remain in release-bundles.md.

- Revised "whether to pursue" section leads with the honest
  conclusion: keep the storage idea, drop the "replace release
  bundles" framing, don't build it to fix #310 or "is our data
  stale?" (which have cheap targeted fixes), and revisit if the UK
  Data Service relationship gets stricter.

- Honest migration cost table (7-11 engineer-weeks, independent
  tracks), explicitly v5 scope.

Both review findings (general-purpose + codex) carried forward
under "Unresolved risks"; that section barely changed except that
"non-deterministic builds" is now actually *cleaner* under output-
hash identity — two runs produce two different sha256s, they don't
silently collide.

Structure now: scope / motivating pains (and what's actually on the
critical path) / what the substrate provides / output-hash identity
/ narrow channel semantics / release-bundle boundary preservation /
consumer resolver changes / unresolved risks / what this fixes
vs. what it doesn't / honest cost / whether to pursue / open
questions.
@MaxGhenis MaxGhenis closed this in 4e7a602 Apr 20, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant