BaconDecomposition methodology audit (Goodman-Bacon 2021) by igerber · Pull Request #454 · igerber/diff-diff

igerber · 2026-05-16T13:50:03Z

Summary

Promotes BaconDecomposition from In Progress → Complete (R parity goldens pending) in METHODOLOGY_REVIEW.md. Operationalizes the paper review landed in PR Add Goodman-Bacon (2021) paper review #451 against diff_diff/bacon.py.
Theorem 1 exact-weights rewrite (bacon.py:_recompute_exact_weights): the prior "exact" path was missing the (1 - n_kU) factor in subsample variance, did not square the sample share, and added an extraneous unit_share factor. The post-hoc sum-to-1 normalization masked the relative-weight error but produced ~0.3% decomposition error vs TWFE. Rewrite implements Eqs. 7-9 + 10e-g exactly; TWFE-vs-weighted-sum identity now holds at atol=1e-10.
Default weights flipped from "approximate" to "exact" at three entry points (BaconDecomposition.__init__, bacon_decompose, TwoWayFixedEffects.decompose). Approximate remains opt-in. Survey-design behavior change: exact mode rejects time-varying within-unit survey columns; users with such inputs migrate to weights="approximate". Documented in CHANGELOG + the bacon_decompose() docstring.
Always-treated warn+remap per paper footnote 11 with ordered-time logic (first_treat <= min(time), excluding sentinels {0, np.inf}): works correctly on event-time panels with negative or zero-crossing labels. Remapping uses an internal column; user's first_treat column is preserved unchanged. Count surfaced as new BaconDecompositionResults.n_always_treated_remapped field. Loop gate for treated_vs_never comparisons uses POST-remap U count so panels whose U is entirely remapped always-treated still emit β̂_{kU} terms.
REGISTRY.md ## BaconDecomposition block replaced with paper-review-sourced content + four sub-notes (weight modes, always-treated remap, R parity status, unbalanced-panel deviation). Explicit removal: the prior block's "Weights may be negative for later-vs-earlier comparisons" claim was incorrect — Theorem 1 weights are strictly positive and sum to 1; negative weights are an estimand-level phenomenon at the ATT level, not estimator-level.
New tests/test_methodology_bacon.py (~28 tests across 6 classes): hand-checks of Eqs. 7-9 + 10b-d at atol=1e-10 on a hand-derived balanced panel, R parity (skip on missing goldens), always-treated remap mechanics, edge cases (negative time, unbalanced panel, no-untreated, single-cohort), weight modes, survey-design narrowing.
New R generator benchmarks/R/generate_bacon_golden.R covering three DGP fixtures. JSON goldens deferred until bacondecomp R package is installed (tracked in TODO.md).

Methodology references (required if estimator / math changes)

Method name(s): BaconDecomposition (bacon.py), TwoWayFixedEffects.decompose (twfe.py)
Paper / source link(s): Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. Journal of Econometrics 225(2), 254-277. DOI: 10.1016/j.jeconom.2021.03.014. Paper review on file: docs/methodology/papers/goodman-bacon-2021-review.md (PR Add Goodman-Bacon (2021) paper review #451).
Any intentional deviations from the source (and why): (a) Unbalanced panels accepted with a UserWarning (paper Appendix A's proof assumes balanced; library extension, Theorem 1 identity holds only approximately); (b) weights="approximate" opt-in fast path with simplified variance not present in R bacondecomp (numerical output may differ from R); (c) weights="exact" requires within-unit-constant survey columns (paper notation does not specify a survey design; library convention from the exact-path per-unit aggregation). All three documented in REGISTRY.md ## BaconDecomposition Notes / Deviation block.

Validation

Tests added/updated: tests/test_methodology_bacon.py (new, ~28 tests, 6 classes); tests/test_bacon.py updated (test_weighted_sum_equals_twfe tolerance tightened to < 1e-10; TestWeightsParameter renamed test_approximate_weights_default → test_exact_weights_default + test_approximate_weights_opt_in). Hand-calculable Theorem 1 DGP derived in-test with weights {0.3, 0.4, 0.1, 0.2} precomputed from Eqs. 10e-g.
Backtest / simulation / notebook evidence (if applicable): 60 tests pass in the methodology-bacon + test-bacon files (3 skipped for R parity). 96 tests pass across all bacon/decompose callers in the broader regression sweep (pytest -k 'bacon or decompose').

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

Promotes BaconDecomposition from In Progress → Complete (R parity pending) in METHODOLOGY_REVIEW.md. Operationalizes the paper review landed in PR #451 against diff_diff/bacon.py. Audit findings and corrections: 1. Theorem 1 exact-weights rewrite (bacon.py:_recompute_exact_weights). The prior "exact" mode did not actually compute Eqs. 7-9 / 10e-g — it was missing the (1 - n_kU) factor in the within-subsample treatment variance, did not square the sample share, and added an extraneous unit_share factor not present in the paper. The post-hoc sum-to-1 normalization masked the relative-weight error but produced ~0.3% decomposition error vs TWFE (0.007 absolute on a 3-cohort + never- treated DGP). Rewrote the function to compute exact numerators of Eqs. 10e/f/g and let post-hoc normalization handle the V_hat^D denominator (Theorem 1 guarantees V_hat^D = sum(numerators)). Now matches TWFE at atol=1e-10 on noisy and hand-calculable DGPs. 2. Default `weights` flipped from "approximate" to "exact" at 3 entry points: BaconDecomposition.__init__() (bacon.py:397), bacon_decompose() (bacon.py:1064), TwoWayFixedEffects.decompose() (twfe.py:684). The approximate path remains opt-in for speed-sensitive diagnostic loops. diff_diff/diagnostic_report.py:1740 updated to pass explicit weights="exact". The existing test_weighted_sum_equals_twfe tolerance was tightened from < 0.1 to < 1e-10 to lock the Theorem 1 algebraic identity contract. **Survey-design behavior change**: weights="exact" routes through _validate_unit_constant_survey, which rejects survey designs whose weights / strata / PSU / FPC columns vary within a unit across periods. The previous approximate default tolerated time-varying within-unit survey weights via observation-level weighted means. Migration: pass weights="approximate" explicitly to retain the legacy path. Documented in CHANGELOG Changed entry and the new bacon_decompose() docstring survey_design parameter block. 3. Always-treated warn+remap per paper footnote 11 (bacon.py:fit()). Units with first_treat <= min(time) (excluding never-treated sentinels 0 and np.inf) are auto-remapped to U via an internal column (__bacon_first_treat_internal__), preserving the user's first_treat column unchanged. The count is exposed on the result via a new BaconDecompositionResults.n_always_treated_remapped field and rendered in summary() when nonzero. **n_never_treated reports TRUE never-treated only**, computed from the original user column before remap — remapped always-treated units appear separately as n_always_treated_remapped, no double- counting. **Loop gate uses POST-remap U count**: the treated_vs_never comparison loop gates on n_units_in_U_bucket (post-remap) so panels whose U is composed entirely of remapped always-treated units still emit beta_kU^{2x2} terms. Without this distinction the loop would silently drop those terms and break the Theorem 1 identity. **Ordered-time logic**: detection uses first_treat <= min(time) (not positive-sign restriction), so event-time panels with negative or zero-crossing period labels (e.g. time ∈ [-2,..,3]) work correctly. A cohort at first_treat=-1 on such a panel is a valid timing group; a cohort at first_treat=-3 is remapped to U. Both timing_groups filters updated to exclude only the U sentinels, not positive values. REGISTRY.md replacement: - Replaced ## BaconDecomposition block with paper-review-sourced content plus four sub-notes (weight modes, always-treated remap with ordered- time logic, R parity status, unbalanced-panel deviation). - Explicitly removed the prior block's "Weights may be negative for later-vs-earlier comparisons" claim. Theorem 1 weights are strictly positive and sum to 1 (positivity is the headline of the theorem); negative weights are an estimand-level phenomenon (Borusyak-Jaravel 2017, de Chaisemartin-D'Haultfoeuille 2020) at the ATT level, not estimator-level. - Narrowed the machine-precision claim to balanced panels only; the unbalanced-panel library extension is documented as an explicit Deviation block (Goodman-Bacon Appendix A's proof assumes balanced panels; under unbalance, the Theorem 1 identity holds only approximately, though outputs remain finite). New artifacts: - tests/test_methodology_bacon.py (~1050 lines, ~28 tests across 6 classes): - TestBaconHandCalculation: hand-checks Eqs. 7-9 + 10b-d at atol=1e-10 on a minimal hand-derived balanced panel (weights {0.3, 0.4, 0.1, 0.2} hand-computed from sample shares and treatment-share variances). - TestBaconParityR: skips on missing R goldens. - TestBaconAlwaysTreatedRemap: regression-tests warn+remap mechanics including user-data-preservation, U-bucket-only-from- remap (the bug from the loop-gate fix), negative first_treat as a valid cohort (event-time encoding), and remap of negative first_treat below min(time). - TestBaconEdgeCases: no-untreated, single-cohort, unbalanced panel (finite but NOT machine precision), constant-ATT recovery. - TestBaconWeightModes: locks exact-is-default contract. - TestBaconSurveyDesignNarrowing: survey_design composes with exact mode + warn+remap; defaulted BaconDecomposition(), bacon_decompose(), and the survey-time-varying-weights rejection contract are pinned. - benchmarks/R/generate_bacon_golden.R (234 lines): R generator script for bacondecomp::bacon() parity goldens across 3 DGP fixtures. JSON goldens deferred until bacondecomp R package is installed in the local R library (TODO.md follow-up row). CHANGELOG entries: ### Changed (default flip + survey behavior change), TODO.md: prior BaconDecomposition row replaced by narrower R-parity goldens deferral row. Test results: 96 passed, 3 skipped (R parity) across all bacon/decompose callers (test_bacon.py, test_methodology_bacon.py, test_business_report, test_methodology_twfe, test_practitioner, test_target_parameter, test_survey_phase3, test_survey_phase6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-16T14:13:53Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 61b30bb200739930d13aecfdb14a53dd0098ffc6

Overall Assessment
Needs changes

Executive Summary

I did not find an unmitigated P0/P1 methodology defect in the exact-weight rewrite itself. The Eqs. 7-9 / 10e-g implementation, always-treated remap, and documented unbalanced-panel extension are internally consistent with the updated registry and paper-review notes.
Severity P1: the reporting wrapper now hardcodes weights="exact" but does not expose the documented weights="approximate" fallback, so survey-backed DiagnosticReport/BusinessReport Bacon replays can now degrade to an error on panels with within-unit-varying survey columns.
Severity P3: the new zero-crossing support language is broader than the actual API contract, because first_treat == 0 is still reserved as the never-treated sentinel.
Severity P3: user-facing parity wording now overstates the validation state; R parity goldens are still tracked as pending in the methodology docs.

Methodology

Severity: P3. Impact: the PR now says negative or zero-crossing labels “work correctly,” but 0 is still hardcoded as a never-treated sentinel, so a real cohort with first_treat == 0 would be folded into U rather than treated as a timing group. That is contract/documentation drift rather than a proven flaw in the new Theorem 1 math, but it is still worth fixing because the PR explicitly broadens the supported labeling story. Concrete fix: either narrow the wording to exclude treatment cohorts at 0, or add explicit validation / configurable sentinel support before advertising zero-crossing support. diff_diff/bacon.py:L469-L481, diff_diff/bacon.py:L552-L559, diff_diff/bacon.py:L592-L600, docs/methodology/REGISTRY.md:L2674-L2675

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity: P1. Impact: DiagnosticReport has no public way to select Bacon’s weight mode, yet _check_bacon() now hardcodes weights="exact". That breaks the PR’s own documented migration path for survey-backed panels whose weights / PSU / strata / FPC vary within unit: core Bacon still supports weights="approximate" for that case, but the report wrapper can only produce an "error" block instead of a usable replay. This is the “incomplete parameter propagation” anti-pattern on a changed public behavior. Concrete fix: add a bacon_weights (or similarly named) kwarg to DiagnosticReport, forward it through BusinessReport, and thread it into bacon_decompose(); alternatively, catch the exact-mode unit-constancy failure and downgrade to an explicit skip/fallback path. Add a regression for survey-backed staggered reports with within-unit-varying survey columns. diff_diff/diagnostic_report.py:L316-L330, diff_diff/diagnostic_report.py:L1740-L1747, diff_diff/bacon.py:L517-L524, diff_diff/bacon.py:L1256-L1265, docs/methodology/REPORTING.md:L223-L233

Tech Debt

No findings. The deferred R-golden work is already tracked in TODO.md, so I am not counting that as a blocker.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new example text says exact mode is “matching R bacondecomp::bacon() at atol=1e-6,” but the repo’s own methodology tracking still says parity goldens are pending and the parity tests skip when the JSON is absent. That overstates the level of shipped validation for a statistical contract. Concrete fix: soften the wording to “intended to match” until the goldens land, or commit benchmarks/data/r_bacondecomp_golden.json in the same PR. diff_diff/bacon.py:L1281-L1283, docs/methodology/REGISTRY.md:L2661-L2671, METHODOLOGY_REVIEW.md:L935-L936

Path to Approval

Expose Bacon’s weight mode on DiagnosticReport and its BusinessReport passthrough, or implement an explicit skip/fallback path instead of hard-failing the Bacon block on exact-mode survey validation.
Add DR/BR regression coverage for a survey-backed staggered panel with within-unit-varying survey columns, asserting the exact-mode failure surface and the approximate-mode success path.

Verification note: I could not execute tests/test_bacon.py or tests/test_methodology_bacon.py in this environment because pytest and the project’s Python deps (for example pandas) are not installed here.

P1 (Maintainability): DiagnosticReport._check_bacon now emits a structured `status="skipped"` block when bacon_decompose raises the within-unit-varying-survey ValueError from `_validate_unit_constant_survey`, instead of falling through to the opaque `status="error"` block. The skip reason names the precomputed={'bacon': ...} + explicit weights="approximate" escape hatch so survey-backed reports on panels with time-varying within- unit weights / strata / PSU / FPC retain a documented migration path. Other ValueErrors (and other exception types) still surface via the existing `error` block. New regression at TestBaconSurveyDesignNarrowing::test_diagnostic_report_skips_with_structured_reason_on_time_varying_survey. P3 (Methodology, zero-crossing wording): clarify that "negative or zero-crossing labels work correctly" applies to the **time axis** only (`time` column with event-time labels like [-2,..,3]), not to `first_treat`. `first_treat ∈ {0, np.inf}` remains reserved as the never-treated sentinel and is not configurable today; a real treatment cohort with `first_treat == 0` would be folded into U and should be re-labeled. Updated docstring, code comment, REGISTRY note, and CHANGELOG entry to surface this restriction explicitly. P3 (Documentation/Tests, R parity claim): soften the bacon_decompose docstring example from "matching R bacondecomp::bacon() at atol=1e-6" to "intended to match R bacondecomp::bacon() at atol=1e-6 (R parity goldens pending see TODO.md)" since the JSON goldens are still unshipped. REGISTRY weight-modes note refined to call out hand-calc / TWFE-identity validation at atol=1e-10 vs. the pending direct R parity. Tests: 61 pass in test_methodology_bacon.py + test_bacon.py (one new regression), 3 skipped (R parity); 97 pass across the broader bacon/decompose surface.

github-actions · 2026-05-16T14:34:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: c88667e0b54b72c3da258829385dfe9950c01fbe

Overall Assessment
✅ Looks good

Executive Summary

No unmitigated P0/P1 findings in the changed Bacon estimator paths. The exact-weight rewrite in diff_diff/bacon.py:L812-L970 matches the Theorem 1 / Eqs. 7-9, 10e-g contract documented in docs/methodology/REGISTRY.md:L2613-L2648 and docs/methodology/papers/goodman-bacon-2021-review.md:L50-L79.
The always-treated handling now matches Goodman-Bacon footnote 11: remap happens before decomposition, the treated_vs_never loop now gates on the post-remap U bucket, and regression tests cover the previously missing U-only-remapped and negative-time cases. diff_diff/bacon.py:L545-L670, tests/test_methodology_bacon.py:L539-L707
The previous P1 on reporting wrappers is addressed. DiagnosticReport._check_bacon() now returns a structured skip with an explicit weights="approximate" / precomputed={'bacon': ...} escape hatch, and BusinessReport already forwards the required passthrough args. diff_diff/diagnostic_report.py:L1737-L1772, diff_diff/business_report.py:L404-L420
Remaining issues are P3-only documentation drift: some high-level prose still uses the old positive-only always-treated condition, and one CHANGELOG sentence still overstates direct R parity.
I could not run pytest here because pytest is not installed in this environment.

Methodology

Severity: P3. Impact: high-level methodology prose still says the always-treated remap applies to 0 < first_treat <= min(time), but the implementation and the detailed registry note now use ordered-time first_treat <= min(time) excluding sentinels. That can mislead users with negative-coded / event-time panels about what gets remapped to U. Concrete fix: align the summary wording in docs/methodology/REGISTRY.md:L2607-L2611, METHODOLOGY_REVIEW.md:L923-L941, and tests/test_methodology_bacon.py:L19-L20 / L491-L492 to the detailed wording already used in docs/methodology/REGISTRY.md:L2673-L2676 and diff_diff/bacon.py:L467-L487.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings. The prior re-review blocker on the report wrapper is resolved by the new structured-skip path in diff_diff/diagnostic_report.py:L1749-L1772.

Tech Debt

Severity: P3. Impact: direct R parity goldens are still outstanding, but this is explicitly tracked in TODO.md:L77-L77, and the PR ships both the generator and skip-aware parity tests, so it is mitigated and non-blocking under the stated rubric. Concrete fix: generate and commit benchmarks/data/r_bacondecomp_golden.json once bacondecomp is available.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: CHANGELOG.md:L16-L16 still says the new default exact path is “matching R bacondecomp::bacon(),” while the methodology review, registry, TODO, and the new parity tests all state that direct R parity is still pending until the golden JSON is generated. That overstates the shipped validation state. Concrete fix: soften that sentence to “intended to match” or “R parity pending” until benchmarks/data/r_bacondecomp_golden.json lands. METHODOLOGY_REVIEW.md:L929-L936, TODO.md:L77-L77, tests/test_methodology_bacon.py:L300-L309

R2 verdict was ✅ Looks good with three P3 informational items. Closes all three: 1. Prose alignment: high-level summary text in `REGISTRY.md`, `METHODOLOGY_REVIEW.md`, and the test module docstring/class docstring used the old positive-only `0 < first_treat <= min(time)` condition for the always-treated remap, while the implementation and the detailed registry note already use ordered-time logic excluding the `{0, np.inf}` sentinels. Aligned the summary wording across all four surfaces (REGISTRY.md:2609, METHODOLOGY_REVIEW.md:923 + Corrections Made entry, test_methodology_bacon.py header + class docstring) to match the detailed-note phrasing. 2. CHANGELOG R-parity claim: the Changed entry still said the new default is "matching R `bacondecomp::bacon()`" while the methodology review, registry, TODO, and parity tests all show direct R parity is pending until the goldens land. Softened to "intended to match" with the validation state spelled out (hand-calc + TWFE-vs-weighted- sum identity at atol=1e-10; direct R bit parity pending). 3. R parity goldens TODO row: non-blocking, already tracked in TODO.md; no action required this PR. Tests: 61 pass in test_methodology_bacon.py + test_bacon.py (same as post-R1), 3 skipped (R parity).

github-actions · 2026-05-16T14:52:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 785c65f92f615c5654b625171b2e3e973dfea341

Overall Assessment
✅ Looks good

Executive Summary

No unmitigated P0/P1 findings in the changed Bacon estimator/reporting paths. The exact-weight rewrite now matches the merged Theorem 1 / Eqs. 7-9, 10e-g contract in diff_diff/bacon.py and docs/methodology/REGISTRY.md.
Default propagation is consistent across diff_diff/bacon.py, diff_diff/bacon.py, diff_diff/twfe.py, and the replay path in diff_diff/diagnostic_report.py.
The prior re-review’s public-documentation drift is mostly addressed: the user-facing prose now states R parity is still pending and uses the ordered-time remap wording in CHANGELOG.md, METHODOLOGY_REVIEW.md, and docs/methodology/REGISTRY.md.
Remaining items are P3-only: one misleading warning string, plus the already-tracked R-golden follow-up.
Static review only; I could not execute local Python checks here because the available shell Python lacks pandas.

Methodology

No findings.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: direct bacondecomp parity is still pending until the golden JSON is generated, but this is explicitly tracked and the new parity tests skip cleanly, so it is mitigated and non-blocking under the stated rubric. Concrete fix: run benchmarks/R/generate_bacon_golden.R, commit benchmarks/data/r_bacondecomp_golden.json, and let tests/test_methodology_bacon.py execute in CI. TODO.md

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the new always-treated warning tells users to “pass first_treat=0 or first_treat=np.inf,” but first_treat is the column-name argument, not a value slot, so the remediation text is misleading when the warning fires. Concrete fix: rewrite the warning in diff_diff/bacon.py (lines 573-580) to tell users to recode the affected rows’ first_treat values to 0 or np.inf before fitting.
Severity: P3. Impact: the test prose still mixes the paper-review shorthand t_i < 1 with the merged implementation/registry contract first_treat <= min(time), which keeps the equality-at-first-observed-period case slightly ambiguous for future reviewers. Concrete fix: align the class docstring in tests/test_methodology_bacon.py with the merged registry wording in docs/methodology/REGISTRY.md, or explicitly label the == min(time) case as a library/R-alignment convention.

R3 verdict was ✅ Looks good with two P3 informational items: 1. Warning string accuracy: the always-treated remap warning told users to "pass first_treat=0 or first_treat=np.inf explicitly to silence this warning," but `first_treat` is the column-name argument to `fit()`, not a value slot. Rewrote to "recode the affected rows' first_treat values to 0 or np.inf in your input data before fitting" to point users at the actual remediation surface. 2. Test class docstring: clarified that the implementation contract `first_treat <= min(time)` includes the `== min(time)` boundary case that the paper's strict `t_i < 1` shorthand excludes, and explained the pragmatic rationale (units treated at the first observable period have no untreated cell and cannot contribute to any 2x2 DD as a treated cohort). Aligns with merged registry note wording and removes the shorthand-vs-implementation ambiguity that the R3 reviewer flagged. Tests: 61 pass in test_methodology_bacon.py + test_bacon.py, 3 skipped (R parity).

github-actions · 2026-05-16T15:01:12Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d35fe915678bed2377a901d544fc516c91cbc251

Overall Assessment

Looks good. I found no unmitigated P0/P1 issues in the changed Bacon methodology, weighting, remap, or default-propagation paths.

Executive Summary

The exact-weight rewrite in diff_diff/bacon.py aligns with the Goodman-Bacon Theorem 1 formulas captured in the paper review and registry, and I did not find an undocumented methodology deviation in the changed estimator code.
The prior re-review P3 about the misleading always-treated warning is resolved: the warning now correctly tells users to recode affected first_treat values, not the first_treat argument name, in diff_diff/bacon.py.
Default propagation is consistent across BaconDecomposition, bacon_decompose, TwoWayFixedEffects.decompose, and the diagnostic-report replay path.
Remaining items are P3 only: one partial wrapper fix in DiagnosticReport._check_bacon(), and one public-docstring drift where wrapper docs still omit the new remap/sentinel contract.
R parity goldens are still pending, but that follow-up is explicitly tracked in TODO.md and the new parity tests skip cleanly until the JSON fixture exists.
Static review only: I could not run local pandas/R validation in this environment.

Methodology

No findings. The exact-weight implementation in diff_diff/bacon.py:L813-L971 matches the formulas documented in docs/methodology/papers/goodman-bacon-2021-review.md:L63-L82 and docs/methodology/REGISTRY.md:L2609-L2676. The first_treat <= min(time) remap and unbalanced-panel handling are documented Note/Deviation items, so they are mitigated under the stated rubric.

Code Quality

Severity: P3. Impact: DiagnosticReport._check_bacon() now downgrades the within-unit-varying-survey failure to status="skipped", but replicate-weight survey designs still fall through the generic exception path and surface as status="error", even though Bacon explicitly declares that design unsupported. This is not a correctness bug, but it leaves the new skip-vs-error fix partial in the same wrapper path. Concrete fix: add an except NotImplementedError as exc branch before the generic handler in diff_diff/diagnostic_report.py:L1749-L1780, and return a structured skip keyed to the rejection raised in diff_diff/bacon.py:L515-L520.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3. Impact: direct R parity against bacondecomp::bacon() is still deferred, so the new exact path is substantiated by hand-calculation and TWFE-identity tests rather than committed cross-language goldens. This is already tracked and therefore mitigated. Concrete fix: generate and commit benchmarks/data/r_bacondecomp_golden.json from benchmarks/R/generate_bacon_golden.R, then let tests/test_methodology_bacon.py::TestBaconParityR run in CI. See TODO.md:L77.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the public wrapper docstrings in bacon_decompose() and TwoWayFixedEffects.decompose() still describe only 0/np.inf as the never-treated contract, while the changed implementation now also remaps first_treat <= min(time) into U and reserves first_treat == 0 as a sentinel. Users reading wrapper help can miss the behavior change even though BaconDecomposition.fit() and the registry document it. Concrete fix: mirror the expanded first_treat contract from diff_diff/bacon.py:L467-L487 into diff_diff/bacon.py:L1252-L1277 and diff_diff/twfe.py:L703-L714, or explicitly point those docstrings at docs/methodology/REGISTRY.md:L2674-L2676.

R4 verdict was Looks good with two P3 polish items (plus the standing R-parity-goldens follow-up, which is tracked and non-blocking). Both addressed: 1. Replicate-weight skip path (P3 Code Quality): the R1 skip-vs-error fix on `DiagnosticReport._check_bacon` only handled the within-unit- varying-survey ValueError case. Replicate-weight survey designs raise NotImplementedError from `BaconDecomposition.fit()` and were still falling through to the generic exception handler as `status="error"`. Added an `except NotImplementedError` branch that returns a structured skip with a reason naming the supported alternative (TSL-based design via SurveyDesign + precomputed= escape hatch). New regression at `TestBaconSurveyDesignNarrowing::test_diagnostic_report_skips_with_structured_reason_on_replicate_weights`. 2. Wrapper docstring mirror (P3 Documentation/Tests): the public `bacon_decompose()` and `TwoWayFixedEffects.decompose()` wrappers had short `first_treat` parameter descriptions that mentioned only the 0/np.inf never-treated convention, omitting the new always-treated remap + sentinel reservation contract that lives in `BaconDecomposition.fit()`. Mirrored the expanded contract into both wrappers (with a pointer to the full docstring on the class) so users reading wrapper help see the behavior change. Tests: 62 pass in test_methodology_bacon.py + test_bacon.py (one new regression), 3 skipped (R parity); 98 pass across the broader bacon/decompose surface.

github-actions · 2026-05-16T15:09:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 5a508343da40a18d3c3f3867571d3f87997e894c

Overall Assessment

✅ Looks good — no unmitigated P0/P1 findings.

Executive Summary

The exact-weight rewrite in diff_diff/bacon.py:L813-L971 matches the Goodman-Bacon Theorem 1 formulas documented in docs/methodology/papers/goodman-bacon-2021-review.md:L50-L79 and docs/methodology/REGISTRY.md:L2613-L2648; I did not find an undocumented estimator, weighting, or variance mismatch in the changed path.
Default propagation is consistent across BaconDecomposition, bacon_decompose, TwoWayFixedEffects.decompose, and diagnostic-report replay: diff_diff/bacon.py:L426-L438, diff_diff/bacon.py:L1226-L1343, diff_diff/twfe.py:L677-L770, diff_diff/diagnostic_report.py:L1737-L1807.
The prior re-review P3s are resolved: wrapper docstrings now carry the remap/sentinel/default contract, and DiagnosticReport._check_bacon() now converts replicate-weight unsupported designs into a structured skip instead of a generic error: diff_diff/bacon.py:L1252-L1288, diff_diff/twfe.py:L703-L726, diff_diff/diagnostic_report.py:L1777-L1801.
The always-treated remap, post-remap U-loop gating, and negative-time handling are covered by targeted regressions in tests/test_methodology_bacon.py:L489-L722.
Remaining follow-up is P3 only: R parity goldens are still pending, but that is explicitly tracked in TODO.md:L77-L77 and the parity tests skip cleanly until the fixture exists.
Static review only: I could not execute the test suite in this environment because pytest is not installed.

Methodology

No findings. The changed estimator math and default behavior are consistent with the current Methodology Registry contract and the Goodman-Bacon paper review.

Code Quality

No findings. The prior partial wrapper-handling issue is resolved in diff_diff/diagnostic_report.py:L1749-L1801.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3 (tracked in TODO.md). Impact: direct bacondecomp::bacon() parity is still deferred because benchmarks/data/r_bacondecomp_golden.json has not been committed yet, so CI currently validates the new path via hand-calculation and TWFE-identity tests rather than cross-language goldens. Concrete fix: install bacondecomp and jsonlite, run benchmarks/R/generate_bacon_golden.R, commit benchmarks/data/r_bacondecomp_golden.json, and let tests/test_methodology_bacon.py::TestBaconParityR run normally. Refs: TODO.md:L77-L77, benchmarks/R/generate_bacon_golden.R:L1-L234, tests/test_methodology_bacon.py:L301-L445.

Security

No findings.

Documentation/Tests

No findings. The previous docstring drift is resolved, and the new regression coverage is appropriately targeted at the methodology-risky paths, including remap behavior and diagnostic-report survey skips: diff_diff/bacon.py:L1252-L1288, diff_diff/twfe.py:L703-L726, tests/test_methodology_bacon.py:L1020-L1141.

Closes the PR #454 deferred R parity follow-up (TODO.md row removed). Generated `benchmarks/data/r_bacondecomp_golden.json` from the committed `benchmarks/R/generate_bacon_golden.R` script against `bacondecomp 0.1.1` on R 4.5.2. Three DGP fixtures: `uniform_3groups_with_never_treated`, `two_groups_no_never_treated`, `always_treated_remapped`. Parity results at atol=1e-6 via `tests/test_methodology_bacon.py::TestBaconParityR`: - TWFE coefficient: ✅ matches across all 3 fixtures - Weights-sum: ✅ matches across all 3 fixtures - Per-component: ✅ on the 2 non-remap fixtures; **structural convention divergence** on `always_treated_remapped` (skipped per-component, kept aggregate). R keeps `first_treat=1` as a distinct timing cohort and emits `Later vs Always Treated` comparisons; Python's paper-footnote-11 convention remaps those units to `U` and folds them into a single `treated_vs_never` cell per treated cohort. The aggregate is invariant per Theorem 1 — the U bucket's weight is re-allocated across nested 2x2 cells but the total weight on {cohort_k vs U} is identical. Only the per-component breakdown differs structurally between conventions. Tracker promotions: - METHODOLOGY_REVIEW.md: BaconDecomposition status row → **Complete** (was `**Complete** (R parity pending)`); removed from In Progress prose mention; removed from Priority Order substantive-review list; Test Coverage count refreshed (24 → 33); R Comparison Results block rewritten as **Validated**. - docs/methodology/REGISTRY.md: Reference Implementations bullet + Verified Components checklist + Note (weight modes) updated; new Note (R parity convention divergence on always-treated) documents the convention. - TODO.md: BaconDecomposition R parity goldens row removed. - CHANGELOG.md: new `[Unreleased]` Added bullet for the close-out; PR-B Changed entry tightened ("intended to match" → "matching ... at atol=1e-6"). - diff_diff/bacon.py: `bacon_decompose` docstring example wording tightened from "intended to match" to "matches" with TestBaconParityR pointer. Tests: 33/33 pass in test_methodology_bacon.py (no skips; was 30+3 skipped); 32 pass in test_bacon.py; 101 pass across the broader bacon/decompose surface (was 98+3 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…e-out R6 verdict was Looks good with 1 P3 informational item: the older PR-B audit bullet at CHANGELOG.md:13 (added in PR #454) still described the pre-goldens deferral state ("JSON goldens deferred", "TestBaconParityR skips with a pointer when goldens missing", "status flipped to **Complete (R parity goldens pending)**"). That contradicts the new PR-457 bullet at CHANGELOG.md:11 (committed goldens + 4 active parity tests) within the same [Unreleased] section, so the release notes read as internally inconsistent. Updated 3 strings in the PR-B bullet to reflect the within-release close-out: - Status flip wording: now says the (R parity pending) caveat was closed by the parity-goldens bullet above in this same release. - TestBaconParityR description: 4 tests, all active post-release; skips only in partial-checkout scenarios. - (4) outcome: parity goldens deferral was closed within this release.

igerber force-pushed the feature/bacon-methodology-audit branch from 6b20a4d to 61b30bb Compare May 16, 2026 14:05

igerber added the ready-for-ci Triggers CI test workflows label May 16, 2026

igerber merged commit 294b57c into main May 16, 2026
33 of 34 checks passed

igerber deleted the feature/bacon-methodology-audit branch May 16, 2026 17:01

igerber mentioned this pull request May 16, 2026

BaconDecomposition R parity goldens #457

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

BaconDecomposition methodology audit (Goodman-Bacon 2021)#454

BaconDecomposition methodology audit (Goodman-Bacon 2021)#454
igerber merged 5 commits into
mainfrom
feature/bacon-methodology-audit

igerber commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 16, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant