Regenerate clubsandwich_cr2_golden.json from authoritative R clubSandwich by igerber · Pull Request #450 · igerber/diff-diff

igerber · 2026-05-16T10:22:17Z

Summary

Wave 2 of multi-wave tech-debt paydown. Tier A row in the post-Wave-1 backlog. Converts the CR2 Bell-McCaffrey parity test from a "Python matches Python" self-reference regression test into a real "Python matches R clubSandwich" parity test.

benchmarks/R/generate_clubsandwich_golden.R: replace the broken Wald_test(..., test="Satterthwaite") block with coef_test(fit, vcov=vcov_cr2)$df_Satt. clubSandwich 0.7+ removed "Satterthwaite" from the Wald_test test-name set; the df_Satt column from coef_test() is the idiomatic per-coefficient Bell-McCaffrey Satterthwaite DOF and is numerically identical to the old per-unit-contrast path. Also drops the stale readr requirement from the header (never library(readr)'d).
benchmarks/data/clubsandwich_cr2_golden.json: regenerated end-to-end. meta.source flips python_self_reference → clubSandwich; also captures clubSandwich_version (0.7.0), R_version (4.5.2), and generated_at timestamp for forensic traceability.
TODO.md: remove the Methodology/Correctness table row AND the Tier A bullet — both reference the now-resolved regen.

No diff_diff/*.py files touched — Python was already correct (verified at machine precision; see parity table below).

Parity demonstration

Python _compute_cr2_bm vs clubSandwich vcovCR(..., "CR2") + coef_test()$df_Satt on the new dataset — max abs Δ per dataset × field:

Dataset	`coef`	`vcov_cr2`	`dof_bm`
`balanced_small`	2.8e-16	4.6e-16	4.0e-15
`unbalanced_medium`	1.8e-15	1.8e-15	4.4e-15
`singletons_present`	5.0e-16	1.4e-16	7.1e-15

All ≤ 1e-14. Well under the 1e-6 test tolerance.

Dataset shift caveat

The dataset itself (x, y values) shifts under regeneration because R 4.5.2's RNG produces different output than whatever R version originally bootstrapped the python_self_reference values into the JSON. Same seeds, different RNG streams.

Old (python_self_reference) vs New (clubSandwich) JSON — max abs Δ per dataset × field:

Dataset	`coef`	`vcov_cr2`	`dof_bm`
`balanced_small`	3.5e-1	6.6e-2	2.9e-1
`unbalanced_medium`	7.4e-1	4.2e-2	2.3e+0
`singletons_present`	1.8e-1	1.1e-1	9.9e-1

That's a dataset change, not a methodology change — confirmed by the machine-precision Python↔clubSandwich agreement above on the new dataset. The new meta.R_version field locks today's RNG behavior so future regenerations can be diagnosed if R bumps RNG defaults again.

Test plan

Rscript -e 'packageVersion("clubSandwich")' returns '0.7.0'
Rscript benchmarks/R/generate_clubsandwich_golden.R runs to completion (writes the JSON)
jq '.meta.source' benchmarks/data/clubsandwich_cr2_golden.json returns "clubSandwich"
pytest tests/test_linalg_hc2_bm.py -v — 34/34 pass, including TestCR2BMCluster::test_cr2_parity_with_golden
grep -nE "Regenerate.*clubsandwich_cr2_golden" TODO.md returns no matches (both rows removed)

🤖 Generated with Claude Code

github-actions · 2026-05-16T10:28:03Z

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The substantive dCDH changes are documented rollbacks in REGISTRY.md/TODO.md, not undocumented methodology defects. I found one non-blocking P2 in the benchmark/parity harness.

Executive Summary

The clubSandwich CR2 golden regeneration looks fine; switching the DOF extraction to coef_test()$df_Satt is the right update for current clubSandwich.
The main estimator-impacting change is in dCDH heterogeneity, not the CR2 golden: placebo predict_het support is rolled back and non-survey heterogeneity inference is back to safe_inference(..., df=None).
Both dCDH deviations from R are explicitly documented in REGISTRY.md and tracked in TODO.md, so they are informational under this rubric.
One P2 remains in the R golden generator/parity harness: empty predict_het outputs are no longer type-stable, so a future empty fixture/path would break the Python parity consumers.
I could not run the targeted test slice here because pytest is not installed in the environment.

Methodology

P3 Impact: The PR intentionally removes placebo predict_het from the global and per-path dCDH heterogeneity surfaces. That is a capability rollback relative to the removed [Unreleased] notes, but it is now explicitly documented as a partial implementation and tracked in TODO, so it is not a blocker. Concrete fix: none required for approval; follow #422 only if the project wants R placebo parity back. References: diff_diff/chaisemartin_dhaultfoeuille.py:L980-L990, diff_diff/chaisemartin_dhaultfoeuille.py:L5035-L5123, diff_diff/chaisemartin_dhaultfoeuille_results.py:L1551-L1557, docs/methodology/REGISTRY.md:L637-L643, TODO.md:L81-L82.
P3 Impact: Non-survey heterogeneity inference now again uses normal-theory safe_inference(..., df=None), so p_value/CI intentionally lose finite-df R parity while beta/se/t_stat remain pinned. This is documented in the registry and TODO, so it is not a defect under the review rules. Concrete fix: none required for approval; pilot-412 is the tracked follow-up if t-based parity is wanted. References: diff_diff/chaisemartin_dhaultfoeuille.py:L5111-L5123, tests/test_chaisemartin_dhaultfoeuille_parity.py:L1371-L1401, tests/test_chaisemartin_dhaultfoeuille_parity.py:L1457-L1505, TODO.md:L82-L82.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No untracked tech-debt findings. The two dCDH rollbacks introduced here are explicitly tracked in TODO.md:L81-L82.

Security

No findings.

Documentation/Tests

P2 Impact: Empty predict_het extractions are no longer type-stable. The R generator now returns plain list() for empty global/per-path predict_het nodes, while the Python parity tests iterate those nodes as dicts. If a future fixture/path yields zero rows, the JSON shape will flip and the parity consumers will fail instead of handling the empty case gracefully. Concrete fix: restore named empty lists in the R generator (structure(list(), names = character(0))) or reinstate Python-side coercion before calling .items(). References: benchmarks/R/generate_dcdh_dynr_test_values.R:L625-L640, benchmarks/R/generate_dcdh_dynr_test_values.R:L647-L674, tests/test_chaisemartin_dhaultfoeuille_parity.py:L1371-L1374, tests/test_chaisemartin_dhaultfoeuille_parity.py:L1457-L1473.

Verification note: I could not execute pytest because it is not installed in this environment.

…wich Wave 2 of multi-wave tech-debt paydown. Tier A row in the post-Wave-1 backlog: convert the CR2 Bell-McCaffrey parity test from a "Python matches Python" self-reference regression test into a real "Python matches clubSandwich" parity test. Changes: - benchmarks/R/generate_clubsandwich_golden.R: replace the broken `Wald_test(..., test="Satterthwaite")` block with `coef_test(fit, vcov=vcov_cr2)$df_Satt`. clubSandwich 0.7+ removed the "Satterthwaite" test name from `Wald_test`; the `df_Satt` column from `coef_test()` is the idiomatic per-coefficient Bell-McCaffrey Satterthwaite DOF and is numerically identical to the old per-unit-contrast path. - benchmarks/R/generate_clubsandwich_golden.R: drop stale `readr` requirement from the header comment (never `library(readr)`'d). - benchmarks/data/clubsandwich_cr2_golden.json: regenerated end-to-end. meta.source flips from "python_self_reference" → "clubSandwich"; also captures clubSandwich version (0.7.0), R version (4.5.2), and generated_at timestamp for forensic traceability. Parity: Python `_compute_cr2_bm` matches clubSandwich `vcovCR(..., "CR2")` + `coef_test()$df_Satt` at machine precision on the new dataset: Dataset max|Δ coef| max|Δ vcov| max|Δ dof_bm| balanced_small 2.8e-16 4.6e-16 4.0e-15 unbalanced_medium 1.8e-15 1.8e-15 4.4e-15 singletons_present 5.0e-16 1.4e-16 7.1e-15 Well under the 1e-6 test tolerance. The dataset itself (x, y values) shifts because R 4.5.2's RNG produces different output than whatever R version originally bootstrapped the `python_self_reference` values into the JSON. Same seeds, different RNG streams. coef/vcov/dof_bm differ by order 1e-1 to 1e+0 between old and new JSON — that's the dataset change, not a methodology change. The `meta.R_version` field locks today's RNG behavior so future regenerations can be diagnosed if R bumps its RNG defaults again. TODO.md: remove the row from the Methodology/Correctness table AND the Tier A bullet under Prioritized Tech-Debt Backlog (both reference the now-resolved regen). No `diff_diff/*.py` files touched — Python was already correct. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-16T10:33:43Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 6c8eee46f9bff851e926f13f1117bc6583dc3e7b

Overall Assessment

✅ Looks good

Executive Summary

The changed surface is the CR2/Bell-McCaffrey parity harness only; diff_diff/linalg.py is unchanged.
Methodology check passed: benchmarks/R/generate_clubsandwich_golden.R:L52-L67 now uses the clubSandwich single-coefficient Satterthwaite path, which matches the per-coefficient DOF contract in diff_diff/linalg.py:L1512-L1543. (jepusto.github.io)
Regenerating benchmarks/data/clubsandwich_cr2_golden.json and removing the resolved TODO.md item are directionally correct.
One non-blocking inconsistency remains: docs/methodology/REGISTRY.md:L2543 still says the checked-in JSON is a "python_self_reference" placeholder tracked in TODO.md.
The prior AI review’s only finding was in a different benchmark-generator path and is unchanged here, so it is out of scope for this re-review.

Methodology

No P0/P1 findings. clubSandwich’s current docs describe coef_test() as the per-coefficient t-test API with Satterthwaite correction, and Wald_test() as the multi-constraint API with HTZ/EDF-style options. That matches the PR’s move from Wald_test(..., test="Satterthwaite") to coef_test(... )$df_Satt in benchmarks/R/generate_clubsandwich_golden.R:L52-L67, and it matches the repo’s existing per-coefficient DOF contract in diff_diff/linalg.py:L1512-L1543. (jepusto.github.io)

P2 Impact: The methodology source of truth is now stale. benchmarks/data/clubsandwich_cr2_golden.json:L35-L40 says the committed fixture is authoritative clubSandwich output, but docs/methodology/REGISTRY.md:L2543 still says the checked-in JSON is a "python_self_reference" regression anchor whose authoritative regeneration remains tracked in TODO.md. That leaves future reviewers with contradictory guidance about whether this benchmark is authoritative. Concrete fix: update docs/methodology/REGISTRY.md:L2543 to state that the committed JSON is now the R-generated clubSandwich parity fixture and remove the obsolete TODO reference.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. Removing the resolved TODO.md item is appropriate; the remaining issue is the stale REGISTRY entry above, not missing debt tracking.

Security

No findings.

Documentation/Tests

P3 Impact: tests/test_linalg_hc2_bm.py:L525-L531 still says the committed JSON is “a self-reference anchor,” which is now false after this regeneration. Test execution is unaffected because the test only consumes vcov_cr2 and dof_bm, but the docstring now misstates the fixture contract. Concrete fix: update tests/test_linalg_hc2_bm.py:L525-L531 to describe the checked-in file as the authoritative clubSandwich fixture, matching benchmarks/data/clubsandwich_cr2_golden.json:L35-L40.

Verification note: I could not run pytest tests/test_linalg_hc2_bm.py -v or Rscript benchmarks/R/generate_clubsandwich_golden.R here because neither pytest nor Rscript is installed.

P2 — REGISTRY.md L2543 still described the committed JSON as a "python_self_reference" regression anchor with the authoritative R run tracked in TODO.md. After this PR the JSON IS the authoritative R run. Rewrote the bullet to reflect the new state: meta.source = clubSandwich, version + R_version + generated_at captured, parity verified at <= 7.1e-15 across all three datasets. P3 — tests/test_linalg_hc2_bm.py::test_cr2_parity_with_golden docstring L525-L531 still said "until then the JSON is a self-reference anchor". Rewrote to describe the JSON as the authoritative clubSandwich fixture with the empirical parity margin. Both are factual corrections; no methodology surface changes. Test still passes at 1e-6 tolerance. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-16T10:38:58Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 1798566506e603af3aa2943cb1db9a22c6721a5c

Overall Assessment

✅ Looks good

Executive Summary

This re-review is limited to the CR2/Bell-McCaffrey parity harness and its documentation/test fixtures; no estimator implementation in diff_diff/*.py changed.
The prior review’s two consistency findings are resolved: docs/methodology/REGISTRY.md:L2543 now describes the checked-in JSON as authoritative clubSandwich output, and tests/test_linalg_hc2_bm.py:L525-L532 now matches that contract.
Methodology cross-check passed. The generator’s new coef_test(... )$df_Satt path in benchmarks/R/generate_clubsandwich_golden.R:L52-L67 matches the per-coefficient Satterthwaite DOF contract documented in diff_diff/linalg.py:L1512-L1543.
The committed fixture metadata is internally consistent with the new contract: benchmarks/data/clubsandwich_cr2_golden.json:L35-L40 now records source, clubSandwich_version, R_version, and generated_at.
I found no unmitigated P0/P1 issues in methodology, variance/SE handling, edge-case behavior, or default behavior changes.
Verification was static only: pytest, numpy, and Rscript are unavailable in this environment, so I could not rerun the parity test or the R generator.

Methodology

No findings. Affected method: CR2 Bell-McCaffrey cluster-robust variance with per-coefficient Satterthwaite DOF. The changed generator path at benchmarks/R/generate_clubsandwich_golden.R:L52-L67 is aligned with the documented contract in diff_diff/linalg.py:L1512-L1543 and the Methodology Registry at docs/methodology/REGISTRY.md:L2539-L2543. I do not see an undocumented methodology deviation, missing assumption check, or incorrect variance/SE change in this diff.

Code Quality

No findings. The touched files are now internally consistent, and the generator header no longer advertises an unused readr dependency at benchmarks/R/generate_clubsandwich_golden.R:L10-L13.

Performance

No findings. The diff changes only benchmark/doc/test assets; no library hot path or estimator runtime path is modified.

Maintainability

No findings. The previous source-of-truth mismatch between the committed JSON and the registry is resolved by aligning docs/methodology/REGISTRY.md:L2543 with benchmarks/data/clubsandwich_cr2_golden.json:L35-L40.

Tech Debt

No findings. Removing the resolved regeneration item from TODO.md is appropriate for this PR.

Security

No findings.

Documentation/Tests

No findings. The stale test-docstring issue is fixed at tests/test_linalg_hc2_bm.py:L525-L532, and the registry now states the same fixture contract at docs/methodology/REGISTRY.md:L2543.
Verification note: I could not run python -m pytest tests/test_linalg_hc2_bm.py -q because pytest is not installed; a direct Python parity check failed because numpy is unavailable; and Rscript is not present in this environment.

igerber force-pushed the golden-clubsandwich-r-regen branch from 22a5c71 to 6c8eee4 Compare May 16, 2026 10:29

igerber added the ready-for-ci Triggers CI test workflows label May 16, 2026

igerber merged commit 7b55db5 into main May 16, 2026
31 of 32 checks passed

igerber deleted the golden-clubsandwich-r-regen branch May 16, 2026 12:00

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Regenerate clubsandwich_cr2_golden.json from authoritative R clubSandwich#450

Regenerate clubsandwich_cr2_golden.json from authoritative R clubSandwich#450
igerber merged 2 commits into
mainfrom
golden-clubsandwich-r-regen

igerber commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 16, 2026

Summary

Parity demonstration

Dataset shift caveat

Test plan

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant