From 39e49a25cc0d26e0124eecc8ff4f586443d15125 Mon Sep 17 00:00:00 2001 From: Max Ghenis Date: Tue, 21 Apr 2026 11:35:21 -0400 Subject: [PATCH] Clarify SIPP is public-use; only IRS-PUF is access-restricted John Sabelhaus corrected a licensing overclaim in the 2026-04-21 meeting: the SIPP vintage we consume (Census public-use SIPP) has no per-user license, data-use agreement, or registration requirement. Of the six upstream sources the pipeline ingests (CPS, ACS, SCF, ORG, SIPP, IRS-PUF), only IRS-PUF has a genuine access restriction. The HuggingFace mirror of pu2023.csv is a caching convenience, not an access-restriction workaround. This matters for TRACE / reproducibility writeups: overstating which inputs are restricted distorts the institutional-certification story. Fixes #808. Co-Authored-By: Claude Opus 4.7 (1M context) --- .../correct-sipp-licensing-language.fixed.md | 1 + policyengine_us_data/datasets/sipp/README.md | 19 +++++++++++++++++++ 2 files changed, 20 insertions(+) create mode 100644 changelog.d/correct-sipp-licensing-language.fixed.md diff --git a/changelog.d/correct-sipp-licensing-language.fixed.md b/changelog.d/correct-sipp-licensing-language.fixed.md new file mode 100644 index 000000000..5859e0020 --- /dev/null +++ b/changelog.d/correct-sipp-licensing-language.fixed.md @@ -0,0 +1 @@ +Clarified SIPP licensing language in `policyengine_us_data/datasets/sipp/README.md`: SIPP public-use data is unrestricted (no per-user license, agreement, or registration). Of the six upstream microdata sources the Enhanced CPS pipeline ingests (CPS, ACS, SCF, ORG, SIPP, IRS-PUF), only IRS-PUF has a genuine access restriction. Fixes #808. diff --git a/policyengine_us_data/datasets/sipp/README.md b/policyengine_us_data/datasets/sipp/README.md index 39ba48825..c30316ae7 100644 --- a/policyengine_us_data/datasets/sipp/README.md +++ b/policyengine_us_data/datasets/sipp/README.md @@ -39,3 +39,22 @@ The raw SIPP CSVs (`pu2023.csv` and the slim variant `pu2023_slim.csv`) are mirrored on the `PolicyEngine/policyengine-us-data` HuggingFace model repo and downloaded on demand when a training run is needed. They are not vendored in this Git repository. + +## Licensing + +SIPP public-use files are, as the name implies, **public-use data** — no +per-user license, data-use agreement, or registration is required to +download or redistribute them. We mirror them on our HuggingFace model +repo purely as a caching convenience (Census's own hosting is slow and +occasionally unavailable), not to work around any access restriction. + +This matters because PolicyEngine's enhanced CPS pipeline ingests several +different upstream microdata sources, and only **one** of them — +**IRS Public Use File (PUF)** — has any genuine access restriction. PUF +requires agreeing to IRS's terms of use before download, even though the +file is itself intended for public release. CPS, ACS, SCF, ORG, and SIPP +are all unrestricted public-use. If you are writing about the pipeline's +licensing posture (for a paper, replication packet, or TRACE TRO), only +IRS-PUF should appear in the restricted column. + +See issue #808 for the background on this correction.