Add ACA marketplace bronze-selection target ETL by daphnehanse11 · Pull Request #618 · PolicyEngine/policyengine-us-data

daphnehanse11 · 2026-03-17T22:32:13Z

Summary

This PR has been refocused to the ACA marketplace targets ETL path.

It now:

adds policyengine_us_data/db/etl_aca_marketplace.py to transform 2024 CMS state metal status data into state-level ACA marketplace targets
loads two state-level target strata per HC.gov state: all APTC marketplace tax units and the bronze-plan subset
checks in the CMS-derived state target input at policyengine_us_data/storage/calibration_targets/aca_marketplace_state_metal_selection_2024.csv
trims the tests, changelog, and storage exceptions to match the narrower scope

Why

This follows review feedback to keep unified_matrix_builder.py generic and to avoid publish_local_area.py / calibration-plumbing changes in this PR.

Review fixes (commit `8fd8990`)

P0 bug fixed: bronze-stratum constraints were inserted in the order state_fips, used_aca_ptc, selected_marketplace_plan_benchmark_ratio, which SQLite's GROUP_CONCAT(DISTINCT ...) preserves insertion order for. That produced domain_variable = "used_aca_ptc,selected_marketplace_plan_benchmark_ratio", but target_config.yaml:68 expects alphabetical selected_marketplace_plan_benchmark_ratio,used_aca_ptc. The rule didn't match, so the bronze target silently dropped out of the loss. Now reordered and documented.
Deleted the dead etl_aca_agi_state_targets.py (still used banned source="CMS Marketplace"; Makefile already dropped its invocation); retargeted tests/integration/test_database_build.py to the new ETL.
Added an ETL-level ValueError guard for corrupt source data (bronze > total).
Added the CMS Marketplace PUF URL to the extract docstring so the input CSV is refetchable.
Expanded unit tests: a regression test against the real checked-in CSV (asserts 27+ HC.gov states with bronze ≤ total and no SBM states leaking in) and a negative test for the new ValueError.

Upstream variable now exists

When this PR was opened, the selected_marketplace_plan_benchmark_ratio variable was not yet populated on CPS data. #801 (merged 2026-04-20) now imputes it from reported premiums, so the bronze stratum is actually evaluable against current microdata.

Out of Scope

This PR does not:

add marketplace-specific logic to unified_matrix_builder.py
modify publish_local_area.py
include the proxy builder, H5 publishing support, or local-area calibration plumbing from the earlier draft
change enhanced_cps.py or loss.py

Those downstream pieces can follow separately.

Validation

pytest tests/unit/test_aca_marketplace_targets.py -v — 3 tests pass (original happy path, real-CSV regression, negative bronze-exceeds-total case)
ruff format --check and ruff check clean on touched files

baogorek

Hi @daphnehanse11 , I'm going to let my Claude do the talking below, but the short of it is that there's a lot to do. I think Codex went for the quick win, and there's just not a quick win here.

  the CMS data sourcing is thorough and the underlying goal of decomposing PTC into used vs. unused makes sense. However, I think
   the approach needs to be restructured. The matrix builder should stay generic and not contain variable-specific logic, and the variables you're deriving   
  don't yet exist in the places they need to for calibration to actually work.                                                                             
                                                                                                                                                              
  Here's the full path I'd suggest, roughly in dependency order:      
                                                                                                                                                              
  1. policyengine-us: Add used_aca_ptc, unused_aca_ptc, and selects_bronze_marketplace_plan as real calculated variables with formulas and parameters. The    
  state-level bronze selection probabilities and price ratios from your CMS data become parameters there. Everything downstream depends on these existing     
  first.                                                                                                                                                      
  2. ETL scripts (policy_data.db): Derive state-level calibration targets (e.g., total used PTC by state) from the CMS data and load them into the targets
  database. That's where calibration targets live now.                                                                                                        
  3. enhanced_cps.py: Wire up the bronze plan selection so the legacy calibration pipeline has access to the new variables.
  4. target_config.yaml: Add the new variable names so the unified matrix builder picks them up — no code changes to the builder itself, just config.         
                                                                                                                                                              
  With this approach, the matrix builder never needs to know what these variables are. It just sees new names in the config and new rows in the database, same
   as any other target.                                                                                                                                       
                                                                                                                                                              
  I'd suggest starting with step 1 since everything else depends on it.

New "under construction" node type (amber dashed) for showing pipeline changes that are actively being developed: US: - PR #611: Pipeline orchestrator in Overview (Modal hardening) - PR #540: Category takeup rerandomization in Stage 2, extracted puf_impute.py + source_impute.py modules in Stage 4 - PR #618: CMS marketplace data + plan selection in Stage 5 UK: - PR #291: New Stage 9 — OA calibration pipeline (6 phases) - PR #296: New Stage 10 — Adversarial weight regularisation - PR #279: Modal GPU calibration nodes in Stages 6, 7, Overview Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

baogorek

@daphnehanse11 I'm requesting that this PR be refocused to the targets ETL and perhaps the ECPS logic. Please note that that this current PR will not affect the ECPS because it's not touching either loss.py or enhanced_cps.py. I don't think your coding agent was able to pick up on the two distinct paths.

I cannot approve the changes in unified_matrix_builder.py or publish_local_area.py. and I recommend that they be removed from the PR. Hard-coded variables in the matrix builder are what made the junkyard the junkyard. We need to do everything humanly (or codexly) possible to never, ever hard-code a variable in unified_matrix_builder.py.

It is possible that publish_local_area.py will need a small modification before this works in local area calibration. Once these targets are in, we can start building models locally and test out the changes. So, I really think this needs to be a two part process.

So if you want the ECPS to be improved, which will get you a benefit now, there needs to be a separate editing of loss.py or enhanced_cps.py in this PR. In that case, some CSVs are acceptable in the storage/calibraiton folder. If you only want better local area h5 calibration, then there should not be CSVs at all, with the exception of sources are not available for download online (like our national "Tips" target). Please see etl_medicaid.py for reference.

Note: the meaning of "ETL" is
E: Extract from the original source
T: Transform the data
L: Load the data into the database.

Forgive me from being tough on this PR: the target sourcing is excellent work. There is just a lot of risk in modifying some of these files.

vercel · 2026-04-13T17:25:27Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
pipeline-diagrams	Error		Apr 13, 2026 5:49pm

baogorek

Please check the comments, but there are no blockers here.

baogorek

Oh @daphnehanse11 , I forgot! You have to put your etl python file in the Makefile so that make database will run it!

When I did this and tried to run it, I did get an error:

RuntimeError: ACA marketplace ETL requires policyengine-us variables that are not available in the current environment: selected_marketplace_plan_benchmark_ratio, used_aca_ptc
make: *** [Makefile:81: database] Error 1

daphnehanse11 · 2026-04-16T13:52:07Z

Addressed the latest follow-up as well:

added etl_aca_marketplace.py to the database target in the Makefile
removed the ETL-time policyengine-us variable presence check, since this load step only needs to write the constraint names into the calibration DB and should not block make database before those variables land downstream

Also pushed the inline documentation/comment updates requested on the ETL file.

baogorek

@daphnehanse11 thank you for addressing my comments. Everything looks good and make database runs fine locally for me. I understand this is a problem with the uv.lock, which should be easy to fix.

The next step will be to try calibrating in an actual model run! You'll do that by editing the target config. I would suggest that you (or someone else) fit the model once locally to ensure it doesn't take down calibration.

MaxGhenis · 2026-04-17T22:31:21Z

Please rebase from upstream/main (there are conflicts) and fix the uv.lock issue flagged in the last review so make database can run cleanly.

MaxGhenis · 2026-04-20T04:00:02Z

Status check from a rebase attempt against current main:

Integration-tests failure — the red job was a pre-existing Census API flake unrelated to this PR (etl_pregnancy.py → 503 Server Error fetching ACS B01001 data for 2022). Just reran it; should clear up on its own.

Rebase conflicts — policyengine_us_data/calibration/unified_matrix_builder.py and policyengine_us_data/storage/calibration_targets/README.md now conflict with main. The unified_matrix_builder.py conflict is substantive: this PR's branch still contains the marketplace-proxy imports and constants (from ...utils.marketplace_plan_selection import compute_marketplace_aca_proxies, MARKETPLACE_PROXY_TARGET_VARS, etc.) that the PR description lists as out of scope:

Out of Scope:

add marketplace-specific logic to unified_matrix_builder.py

If the intent of the refocused PR is ETL-only, those unified-matrix-builder additions should come out during the rebase. If they should stay, the PR description needs updating. Your call — I didn't force-push the rebase since it's an editorial decision on your work.

Related follow-up: #800 (per-household selected_marketplace_plan_benchmark_ratio via CPS premium back-out) is implemented in #801 — complementary to this PR's state-level aggregate targets.

daphnehanse11 · 2026-04-20T16:45:30Z

Rebased this PR onto current main and force-pushed the refreshed branch.

The diff is now ETL-only again:

no changes to policyengine_us_data/calibration/unified_matrix_builder.py
no changes to policyengine_us_data/storage/calibration_targets/README.md

I rebuilt the branch from a fresh main worktree and re-applied only the marketplace ETL pieces:

new etl_aca_marketplace.py
new marketplace target CSV
target-config entries
source label / Makefile wiring
focused unit test

Validation on the rebased branch:

python3 -m py_compile policyengine_us_data/db/etl_aca_marketplace.py tests/unit/test_aca_marketplace_targets.py policyengine_us_data/db/create_field_valid_values.py
uv run make database YEAR=2024
uv run pytest tests/unit/test_aca_marketplace_targets.py -q

make database completed cleanly and loaded the ACA marketplace bronze/APTC targets from the checked-in CSV.

Review fixes from the standing review of PR 618: 1. P0 bug: bronze stratum constraints were inserted in the order ``state_fips, used_aca_ptc, selected_marketplace_plan_benchmark_ratio``, which SQLite's ``GROUP_CONCAT(DISTINCT ...)`` preserves insertion order for. That produced ``domain_variable = "used_aca_ptc,...``, but ``target_config.yaml:68`` expects the alphabetical form ``selected_marketplace_plan_benchmark_ratio,used_aca_ptc``. The rule didn't match, so the bronze target silently dropped out of the loss. Reorder the inserts and add a comment explaining why order matters. 2. Delete the now-dead ``etl_aca_agi_state_targets.py`` — it still used ``source="CMS Marketplace"`` (rejected by ``create_field_valid_values``) and the Makefile no longer invokes it. Redirect ``tests/integration/test_database_build.py`` to the new ``etl_aca_marketplace.py``. 3. Add a ValueError guard for corrupt source data (bronze APTC consumers exceeding total APTC consumers for any state). 4. Add the CMS Marketplace PUF URL to the ETL extract docstring so the input CSV is actually refetchable. 5. Expand the unit test file: add a real-CSV regression test (expects 27+ HC.gov states with bronze ≤ total and no SBM states leaking in) and a negative test for the new ValueError.

MaxGhenis · 2026-04-20T22:13:43Z

@codex please re-review. Addressed all prior feedback in commit 8fd8990:

Fixed the P0 bronze-stratum domain_variable ordering bug — SQLite's GROUP_CONCAT(DISTINCT ...) preserves insertion order, so the constraint inserts now match the alphabetical form in target_config.yaml:68. Verified with a direct sqlite3 repro.
Deleted the dead etl_aca_agi_state_targets.py and updated the integration-test build list.
Added an ETL-level ValueError guard when bronze APTC consumers exceed total APTC consumers.
Added the CMS Marketplace PUF source URL to the extract docstring.
Expanded the unit tests: real-CSV regression test + negative test for the new guard.

Upstream variable landed in #801 (today), so the bronze stratum is now actually evaluable against current microdata.

… test Codex review on 8fd8990 found two issues: 1. ``tests/integration/test_database_build.py::test_state_aca_and_agi_targets_loaded`` still asserted legacy ``aca_ptc`` / ``person_count`` / ``adjusted_gross_income`` state targets that the deleted ``etl_aca_agi_state_targets.py`` used to load, so it would fail against the rebuilt DB. Rename and rewrite it as ``test_state_marketplace_targets_loaded`` that asserts the new APTC and bronze-selection targets land with the canonical alphabetical ``domain_variable`` strings. 2. The previous constraint-insertion-order workaround relied on SQLite's ``GROUP_CONCAT(DISTINCT ...)`` preserving insertion order, which is undocumented. Add ``ORDER BY`` to the ``domain_variable`` aggregation in the ``stratum_domain`` view so the canonical form is enforced at the view level, regardless of how callers insert constraints. Drop the now-obsolete ordering comment in ``etl_aca_marketplace.py``.

…ity) The prior ``GROUP_CONCAT(DISTINCT ... ORDER BY ...)`` form requires SQLite >= 3.44 and failed on the Modal integration runner with ``sqlite3.OperationalError: near "ORDER": syntax error``. Replace with a correlated subquery that selects distinct constraint names ordered alphabetically and then concatenates them without an inner ORDER BY. Works on all supported SQLite versions and still produces the canonical form (e.g. ``selected_marketplace_plan_benchmark_ratio,used_aca_ptc``) regardless of constraint insertion order. Verified by running the real view against in-memory SQLite with non-alphabetical insert order; result matches the expected canonical string.

The deletion in 8fd8990 was too aggressive. That file loaded three distinct target families into the calibration DB: 1. state-level ``aca_ptc`` spending targets (sourced from ``aca_spending_and_enrollment_2024.csv``) 2. state-level ``person_count`` enrollment targets (same source) 3. state-level AGI bracket targets (sourced from ``agi_state.csv``) This PR adds *new* marketplace APTC-count and bronze-count targets but does not replace the ACA spending/enrollment or AGI targets. Without them the calibrator has nothing to pin state-level ACA PTC spending, and ``test_aca_calibration`` / ``test_sparse_aca_calibration`` fail with >500% state deviations. Restore the file verbatim from the pre-deletion state, keep the ``CMS Marketplace`` source string it uses (re-added to the ``create_field_valid_values`` allowlist alongside the newer ``CMS 2024 OEP state metal status PUF`` source the marketplace ETL uses), re-add the Makefile invocation, and put its entry back in the integration-build script list ahead of the marketplace ETL. Keep the new ``test_state_marketplace_targets_loaded`` as a peer to the restored ``test_state_aca_and_agi_targets_loaded``. The long-term cleanup (migrating the spending/enrollment targets into the marketplace ETL or deprecating them) is a follow-up.

…ce-plan-selection

MaxGhenis

All CI green (integration-tests passed 4h20m). Prior review feedback addressed across commits 8fd8990, 3e2292d, bd15fd7, 59a0491, 04dbaea.

baogorek requested changes Mar 18, 2026

View reviewed changes

daphnehanse11 requested review from baogorek and juaristi22 April 9, 2026 13:53

baogorek requested changes Apr 10, 2026

View reviewed changes

daphnehanse11 changed the title ~~Add ACA marketplace plan selection proxies~~ Add ACA marketplace bronze-selection target ETL Apr 13, 2026

baogorek approved these changes Apr 14, 2026

View reviewed changes

Comment thread policyengine_us_data/db/etl_aca_marketplace.py

Comment thread policyengine_us_data/db/etl_aca_marketplace.py

Comment thread policyengine_us_data/db/etl_aca_marketplace.py

Comment thread policyengine_us_data/db/etl_aca_marketplace.py

baogorek requested changes Apr 14, 2026

View reviewed changes

baogorek requested changes Apr 16, 2026

View reviewed changes

daphnehanse11 requested a review from baogorek April 17, 2026 16:40

This was referenced Apr 19, 2026

Impute selected_marketplace_plan_benchmark_ratio per household via CPS premium back-out #800

Closed

Impute selected_marketplace_plan_benchmark_ratio from CPS premiums #801

Merged

Rebase ACA marketplace ETL onto main

6ad2911

daphnehanse11 force-pushed the codex/aca-marketplace-plan-selection branch from eb1d7bf to 6ad2911 Compare April 20, 2026 16:43

Format CPS marketplace benchmark helper

8df074d

daphnehanse11 requested a review from MaxGhenis April 20, 2026 20:25

MaxGhenis added 4 commits April 20, 2026 22:16

Merge remote-tracking branch 'upstream/main' into codex/aca-marketpla…

04dbaea

…ce-plan-selection

MaxGhenis approved these changes Apr 21, 2026

View reviewed changes

MaxGhenis merged commit 4f6324e into main Apr 21, 2026
17 of 18 checks passed

Conversation

daphnehanse11 commented Mar 17, 2026 • edited by MaxGhenis Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Why

Review fixes (commit 8fd8990)

Upstream variable now exists

Out of Scope

Validation

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

vercel Bot commented Apr 13, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

daphnehanse11 commented Apr 16, 2026

Uh oh!

baogorek left a comment

Choose a reason for hiding this comment

Uh oh!

MaxGhenis commented Apr 17, 2026

Uh oh!

MaxGhenis commented Apr 20, 2026

Uh oh!

daphnehanse11 commented Apr 20, 2026

Uh oh!

MaxGhenis commented Apr 20, 2026

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

daphnehanse11 commented Mar 17, 2026 •

edited by MaxGhenis

Loading

Review fixes (commit `8fd8990`)

vercel Bot commented Apr 13, 2026 •

edited

Loading