Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
62 changes: 57 additions & 5 deletions .github/CONTRIBUTING.md
Original file line number Diff line number Diff line change
@@ -1,11 +1,63 @@
## Updating data
# Contributing to policyengine-us-data

If your changes present a non-bugfix change to one or more datasets which are cloud-hosted (CPS, ECPS and PUF), then please change both the filename and URL (in both the class definition file and in `storage/upload_completed_datasets.py`. This enables us to store historical versions of datasets separately and reproducibly.
See the [shared PolicyEngine contribution guide](https://github.com/PolicyEngine/.github/blob/main/CONTRIBUTING.md) for cross-repo conventions (towncrier changelog fragments, `uv run`, PR description format, anti-patterns). This file covers policyengine-us-data specifics.

## Commands

```bash
make install # install deps (uv)
make format # format (required)
make test-unit # unit tests (synthetic / mocked, seconds)
make test-integration # integration tests (need built H5 datasets)
make test # both
make data # full dataset build (long)
make push-pr-branch # push to upstream with correct tracking (use before opening PRs)
uv run pytest tests/unit/datasets/ -v
```

Python 3.12–3.14. Default branch: `main`.

## Test organisation

- `tests/unit/` — self-contained (synthetic data, mocks, checked-in fixtures). Run in seconds with no external deps.
- `unit/datasets/` — dataset code
- `unit/calibration/` — calibration code
- `tests/integration/` — requires built H5 datasets, HuggingFace downloads, `Microsimulation` objects, or DB ETL. Named after the dataset under test (e.g. `test_cps.py` tests `cps_2024.h5`).

**Placement rules:**

- **Never** put tests that need H5 files or `Microsimulation` in `unit/`.
- **Never** put synthetic-only tests in `integration/`.
- Sanity checks (value ranges, population counts) go in the per-dataset integration file, not a separate sanity file.
- When adding an integration test, extend the existing per-dataset file if one exists.

## Updating datasets

If your change is a non-bugfix update to a cloud-hosted dataset (CPS, enhanced CPS, PUF), bump both the filename and URL in the class definition and in `storage/upload_completed_datasets.py`. That lets us store historical dataset versions separately and reproducibly.

## Opening PRs

Push PR branches to the upstream `PolicyEngine/policyengine-us-data` repository, not to a personal fork. From the repo root, run:
**Always create branches on the upstream repo, not a fork.** Fork PRs can't access workflow secrets and will fail on data-download steps. The convenience target:

```bash
make push-pr-branch
```

pushes the current branch to `upstream` with the correct tracking so `gh pr create` just works.

## Repo-specific anti-patterns

- **Never fabricate data or results.** This is a research codebase; reproducible aggregates only. Use `[TO BE CALCULATED]` placeholders if a number isn't computed yet.
- **Don't** open PRs from personal forks (CI will fail on secrets).
- **Don't** add `[codex]` or other agent-label prefixes to PR titles.
- **Don't** skip full-build CI when touching the imputation or calibration pipeline.
- **Don't** commit large binary artefacts — HuggingFace storage only.

## CI workflows

`make push-pr-branch`
Five workflow files in `.github/workflows/`:

This avoids the fork-only CI failure path and sets the upstream tracking branch correctly before opening the PR.
- `pr.yaml` — fork check, lint, uv.lock freshness, towncrier fragment check, unit tests, smoke test, docs build. Integration tests trigger when files in `policyengine_us_data/`, `modal_app/`, or `tests/integration/` change. ~2–3 min for the unit path.
- `push.yaml` — on push to main: either version-bump + PyPI publish (on `Update package version` commits), or a full Modal data build with integration tests (on everything else).
- `pipeline.yaml` — dispatch only, spawns the H5 generation pipeline on Modal with configurable GPU/epochs/workers.
- `local_area_publish.yaml` / `local_area_promote.yaml` — manual dispatch to build/stage and then promote local-area H5 files.
1 change: 1 addition & 0 deletions changelog.d/migrate-to-towncrier.changed.md
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
Migrate the changelog tooling from `yaml-changelog` (`changelog_entry.yaml` + `changelog.yaml` + `build-changelog`) to towncrier (`changelog.d/<branch>.<type>.md` fragments). The repo's CI already ran `towncrier check` in `pr.yaml` and `bump_version.py` already read fragments from `changelog.d/`; this drops the leftover yaml-changelog artefacts (unused dep, unused reusable workflow, zero-byte `changelog_entry.yaml`, and duplicated `changelog.yaml` whose contents are already in `CHANGELOG.md`) so the tooling story matches the rest of the org.
Loading
Loading