Skip to content

Regenerate clubsandwich_cr2_golden.json from authoritative R clubSandwich#450

Merged
igerber merged 2 commits into
mainfrom
golden-clubsandwich-r-regen
May 16, 2026
Merged

Regenerate clubsandwich_cr2_golden.json from authoritative R clubSandwich#450
igerber merged 2 commits into
mainfrom
golden-clubsandwich-r-regen

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 16, 2026

Summary

Wave 2 of multi-wave tech-debt paydown. Tier A row in the post-Wave-1 backlog. Converts the CR2 Bell-McCaffrey parity test from a "Python matches Python" self-reference regression test into a real "Python matches R clubSandwich" parity test.

  • benchmarks/R/generate_clubsandwich_golden.R: replace the broken Wald_test(..., test="Satterthwaite") block with coef_test(fit, vcov=vcov_cr2)$df_Satt. clubSandwich 0.7+ removed "Satterthwaite" from the Wald_test test-name set; the df_Satt column from coef_test() is the idiomatic per-coefficient Bell-McCaffrey Satterthwaite DOF and is numerically identical to the old per-unit-contrast path. Also drops the stale readr requirement from the header (never library(readr)'d).
  • benchmarks/data/clubsandwich_cr2_golden.json: regenerated end-to-end. meta.source flips python_self_referenceclubSandwich; also captures clubSandwich_version (0.7.0), R_version (4.5.2), and generated_at timestamp for forensic traceability.
  • TODO.md: remove the Methodology/Correctness table row AND the Tier A bullet — both reference the now-resolved regen.

No diff_diff/*.py files touched — Python was already correct (verified at machine precision; see parity table below).

Parity demonstration

Python _compute_cr2_bm vs clubSandwich vcovCR(..., "CR2") + coef_test()$df_Satt on the new dataset — max abs Δ per dataset × field:

Dataset coef vcov_cr2 dof_bm
balanced_small 2.8e-16 4.6e-16 4.0e-15
unbalanced_medium 1.8e-15 1.8e-15 4.4e-15
singletons_present 5.0e-16 1.4e-16 7.1e-15

All ≤ 1e-14. Well under the 1e-6 test tolerance.

Dataset shift caveat

The dataset itself (x, y values) shifts under regeneration because R 4.5.2's RNG produces different output than whatever R version originally bootstrapped the python_self_reference values into the JSON. Same seeds, different RNG streams.

Old (python_self_reference) vs New (clubSandwich) JSON — max abs Δ per dataset × field:

Dataset coef vcov_cr2 dof_bm
balanced_small 3.5e-1 6.6e-2 2.9e-1
unbalanced_medium 7.4e-1 4.2e-2 2.3e+0
singletons_present 1.8e-1 1.1e-1 9.9e-1

That's a dataset change, not a methodology change — confirmed by the machine-precision Python↔clubSandwich agreement above on the new dataset. The new meta.R_version field locks today's RNG behavior so future regenerations can be diagnosed if R bumps RNG defaults again.

Test plan

  • Rscript -e 'packageVersion("clubSandwich")' returns '0.7.0'
  • Rscript benchmarks/R/generate_clubsandwich_golden.R runs to completion (writes the JSON)
  • jq '.meta.source' benchmarks/data/clubsandwich_cr2_golden.json returns "clubSandwich"
  • pytest tests/test_linalg_hc2_bm.py -v — 34/34 pass, including TestCR2BMCluster::test_cr2_parity_with_golden
  • grep -nE "Regenerate.*clubsandwich_cr2_golden" TODO.md returns no matches (both rows removed)

🤖 Generated with Claude Code

@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings. The substantive dCDH changes are documented rollbacks in REGISTRY.md/TODO.md, not undocumented methodology defects. I found one non-blocking P2 in the benchmark/parity harness.

Executive Summary

  • The clubSandwich CR2 golden regeneration looks fine; switching the DOF extraction to coef_test()$df_Satt is the right update for current clubSandwich.
  • The main estimator-impacting change is in dCDH heterogeneity, not the CR2 golden: placebo predict_het support is rolled back and non-survey heterogeneity inference is back to safe_inference(..., df=None).
  • Both dCDH deviations from R are explicitly documented in REGISTRY.md and tracked in TODO.md, so they are informational under this rubric.
  • One P2 remains in the R golden generator/parity harness: empty predict_het outputs are no longer type-stable, so a future empty fixture/path would break the Python parity consumers.
  • I could not run the targeted test slice here because pytest is not installed in the environment.

Methodology

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No untracked tech-debt findings. The two dCDH rollbacks introduced here are explicitly tracked in TODO.md:L81-L82.

Security

No findings.

Documentation/Tests

Verification note: I could not execute pytest because it is not installed in this environment.

…wich

Wave 2 of multi-wave tech-debt paydown. Tier A row in the post-Wave-1
backlog: convert the CR2 Bell-McCaffrey parity test from a
"Python matches Python" self-reference regression test into a real
"Python matches clubSandwich" parity test.

Changes:
- benchmarks/R/generate_clubsandwich_golden.R: replace the broken
  `Wald_test(..., test="Satterthwaite")` block with `coef_test(fit,
  vcov=vcov_cr2)$df_Satt`. clubSandwich 0.7+ removed the "Satterthwaite"
  test name from `Wald_test`; the `df_Satt` column from `coef_test()`
  is the idiomatic per-coefficient Bell-McCaffrey Satterthwaite DOF and
  is numerically identical to the old per-unit-contrast path.
- benchmarks/R/generate_clubsandwich_golden.R: drop stale `readr`
  requirement from the header comment (never `library(readr)`'d).
- benchmarks/data/clubsandwich_cr2_golden.json: regenerated end-to-end.
  meta.source flips from "python_self_reference" → "clubSandwich"; also
  captures clubSandwich version (0.7.0), R version (4.5.2), and
  generated_at timestamp for forensic traceability.

Parity:
  Python `_compute_cr2_bm` matches clubSandwich `vcovCR(..., "CR2")` +
  `coef_test()$df_Satt` at machine precision on the new dataset:

    Dataset              max|Δ coef|  max|Δ vcov|  max|Δ dof_bm|
    balanced_small       2.8e-16      4.6e-16      4.0e-15
    unbalanced_medium    1.8e-15      1.8e-15      4.4e-15
    singletons_present   5.0e-16      1.4e-16      7.1e-15

  Well under the 1e-6 test tolerance.

The dataset itself (x, y values) shifts because R 4.5.2's RNG produces
different output than whatever R version originally bootstrapped the
`python_self_reference` values into the JSON. Same seeds, different
RNG streams. coef/vcov/dof_bm differ by order 1e-1 to 1e+0 between
old and new JSON — that's the dataset change, not a methodology change.
The `meta.R_version` field locks today's RNG behavior so future
regenerations can be diagnosed if R bumps its RNG defaults again.

TODO.md: remove the row from the Methodology/Correctness table AND the
Tier A bullet under Prioritized Tech-Debt Backlog (both reference the
now-resolved regen).

No `diff_diff/*.py` files touched — Python was already correct.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@igerber igerber force-pushed the golden-clubsandwich-r-regen branch from 22a5c71 to 6c8eee4 Compare May 16, 2026 10:29
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 6c8eee46f9bff851e926f13f1117bc6583dc3e7b


Overall Assessment

✅ Looks good

Executive Summary

  • The changed surface is the CR2/Bell-McCaffrey parity harness only; diff_diff/linalg.py is unchanged.
  • Methodology check passed: benchmarks/R/generate_clubsandwich_golden.R:L52-L67 now uses the clubSandwich single-coefficient Satterthwaite path, which matches the per-coefficient DOF contract in diff_diff/linalg.py:L1512-L1543. (jepusto.github.io)
  • Regenerating benchmarks/data/clubsandwich_cr2_golden.json and removing the resolved TODO.md item are directionally correct.
  • One non-blocking inconsistency remains: docs/methodology/REGISTRY.md:L2543 still says the checked-in JSON is a "python_self_reference" placeholder tracked in TODO.md.
  • The prior AI review’s only finding was in a different benchmark-generator path and is unchanged here, so it is out of scope for this re-review.

Methodology

No P0/P1 findings. clubSandwich’s current docs describe coef_test() as the per-coefficient t-test API with Satterthwaite correction, and Wald_test() as the multi-constraint API with HTZ/EDF-style options. That matches the PR’s move from Wald_test(..., test="Satterthwaite") to coef_test(... )$df_Satt in benchmarks/R/generate_clubsandwich_golden.R:L52-L67, and it matches the repo’s existing per-coefficient DOF contract in diff_diff/linalg.py:L1512-L1543. (jepusto.github.io)

  • P2 Impact: The methodology source of truth is now stale. benchmarks/data/clubsandwich_cr2_golden.json:L35-L40 says the committed fixture is authoritative clubSandwich output, but docs/methodology/REGISTRY.md:L2543 still says the checked-in JSON is a "python_self_reference" regression anchor whose authoritative regeneration remains tracked in TODO.md. That leaves future reviewers with contradictory guidance about whether this benchmark is authoritative. Concrete fix: update docs/methodology/REGISTRY.md:L2543 to state that the committed JSON is now the R-generated clubSandwich parity fixture and remove the obsolete TODO reference.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

No findings. Removing the resolved TODO.md item is appropriate; the remaining issue is the stale REGISTRY entry above, not missing debt tracking.

Security

No findings.

Documentation/Tests

  • P3 Impact: tests/test_linalg_hc2_bm.py:L525-L531 still says the committed JSON is “a self-reference anchor,” which is now false after this regeneration. Test execution is unaffected because the test only consumes vcov_cr2 and dof_bm, but the docstring now misstates the fixture contract. Concrete fix: update tests/test_linalg_hc2_bm.py:L525-L531 to describe the checked-in file as the authoritative clubSandwich fixture, matching benchmarks/data/clubsandwich_cr2_golden.json:L35-L40.

Verification note: I could not run pytest tests/test_linalg_hc2_bm.py -v or Rscript benchmarks/R/generate_clubsandwich_golden.R here because neither pytest nor Rscript is installed.

P2 — REGISTRY.md L2543 still described the committed JSON as a
  "python_self_reference" regression anchor with the authoritative R run
  tracked in TODO.md. After this PR the JSON IS the authoritative R run.
  Rewrote the bullet to reflect the new state: meta.source = clubSandwich,
  version + R_version + generated_at captured, parity verified at
  <= 7.1e-15 across all three datasets.

P3 — tests/test_linalg_hc2_bm.py::test_cr2_parity_with_golden docstring
  L525-L531 still said "until then the JSON is a self-reference anchor".
  Rewrote to describe the JSON as the authoritative clubSandwich fixture
  with the empirical parity margin.

Both are factual corrections; no methodology surface changes. Test
still passes at 1e-6 tolerance.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 1798566506e603af3aa2943cb1db9a22c6721a5c


Overall Assessment

✅ Looks good

Executive Summary

  • This re-review is limited to the CR2/Bell-McCaffrey parity harness and its documentation/test fixtures; no estimator implementation in diff_diff/*.py changed.
  • The prior review’s two consistency findings are resolved: docs/methodology/REGISTRY.md:L2543 now describes the checked-in JSON as authoritative clubSandwich output, and tests/test_linalg_hc2_bm.py:L525-L532 now matches that contract.
  • Methodology cross-check passed. The generator’s new coef_test(... )$df_Satt path in benchmarks/R/generate_clubsandwich_golden.R:L52-L67 matches the per-coefficient Satterthwaite DOF contract documented in diff_diff/linalg.py:L1512-L1543.
  • The committed fixture metadata is internally consistent with the new contract: benchmarks/data/clubsandwich_cr2_golden.json:L35-L40 now records source, clubSandwich_version, R_version, and generated_at.
  • I found no unmitigated P0/P1 issues in methodology, variance/SE handling, edge-case behavior, or default behavior changes.
  • Verification was static only: pytest, numpy, and Rscript are unavailable in this environment, so I could not rerun the parity test or the R generator.

Methodology

  • No findings. Affected method: CR2 Bell-McCaffrey cluster-robust variance with per-coefficient Satterthwaite DOF. The changed generator path at benchmarks/R/generate_clubsandwich_golden.R:L52-L67 is aligned with the documented contract in diff_diff/linalg.py:L1512-L1543 and the Methodology Registry at docs/methodology/REGISTRY.md:L2539-L2543. I do not see an undocumented methodology deviation, missing assumption check, or incorrect variance/SE change in this diff.

Code Quality

  • No findings. The touched files are now internally consistent, and the generator header no longer advertises an unused readr dependency at benchmarks/R/generate_clubsandwich_golden.R:L10-L13.

Performance

  • No findings. The diff changes only benchmark/doc/test assets; no library hot path or estimator runtime path is modified.

Maintainability

  • No findings. The previous source-of-truth mismatch between the committed JSON and the registry is resolved by aligning docs/methodology/REGISTRY.md:L2543 with benchmarks/data/clubsandwich_cr2_golden.json:L35-L40.

Tech Debt

  • No findings. Removing the resolved regeneration item from TODO.md is appropriate for this PR.

Security

  • No findings.

Documentation/Tests

  • No findings. The stale test-docstring issue is fixed at tests/test_linalg_hc2_bm.py:L525-L532, and the registry now states the same fixture contract at docs/methodology/REGISTRY.md:L2543.
  • Verification note: I could not run python -m pytest tests/test_linalg_hc2_bm.py -q because pytest is not installed; a direct Python parity check failed because numpy is unavailable; and Rscript is not present in this environment.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 16, 2026
@igerber igerber merged commit 7b55db5 into main May 16, 2026
31 of 32 checks passed
@igerber igerber deleted the golden-clubsandwich-r-regen branch May 16, 2026 12:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant