BaconDecomposition methodology audit (Goodman-Bacon 2021)#454
Conversation
Promotes BaconDecomposition from In Progress → Complete (R parity pending) in METHODOLOGY_REVIEW.md. Operationalizes the paper review landed in PR #451 against diff_diff/bacon.py. Audit findings and corrections: 1. Theorem 1 exact-weights rewrite (bacon.py:_recompute_exact_weights). The prior "exact" mode did not actually compute Eqs. 7-9 / 10e-g — it was missing the (1 - n_kU) factor in the within-subsample treatment variance, did not square the sample share, and added an extraneous unit_share factor not present in the paper. The post-hoc sum-to-1 normalization masked the relative-weight error but produced ~0.3% decomposition error vs TWFE (0.007 absolute on a 3-cohort + never- treated DGP). Rewrote the function to compute exact numerators of Eqs. 10e/f/g and let post-hoc normalization handle the V_hat^D denominator (Theorem 1 guarantees V_hat^D = sum(numerators)). Now matches TWFE at atol=1e-10 on noisy and hand-calculable DGPs. 2. Default `weights` flipped from "approximate" to "exact" at 3 entry points: BaconDecomposition.__init__() (bacon.py:397), bacon_decompose() (bacon.py:1064), TwoWayFixedEffects.decompose() (twfe.py:684). The approximate path remains opt-in for speed-sensitive diagnostic loops. diff_diff/diagnostic_report.py:1740 updated to pass explicit weights="exact". The existing test_weighted_sum_equals_twfe tolerance was tightened from < 0.1 to < 1e-10 to lock the Theorem 1 algebraic identity contract. **Survey-design behavior change**: weights="exact" routes through _validate_unit_constant_survey, which rejects survey designs whose weights / strata / PSU / FPC columns vary within a unit across periods. The previous approximate default tolerated time-varying within-unit survey weights via observation-level weighted means. Migration: pass weights="approximate" explicitly to retain the legacy path. Documented in CHANGELOG Changed entry and the new bacon_decompose() docstring survey_design parameter block. 3. Always-treated warn+remap per paper footnote 11 (bacon.py:fit()). Units with first_treat <= min(time) (excluding never-treated sentinels 0 and np.inf) are auto-remapped to U via an internal column (__bacon_first_treat_internal__), preserving the user's first_treat column unchanged. The count is exposed on the result via a new BaconDecompositionResults.n_always_treated_remapped field and rendered in summary() when nonzero. **n_never_treated reports TRUE never-treated only**, computed from the original user column before remap — remapped always-treated units appear separately as n_always_treated_remapped, no double- counting. **Loop gate uses POST-remap U count**: the treated_vs_never comparison loop gates on n_units_in_U_bucket (post-remap) so panels whose U is composed entirely of remapped always-treated units still emit beta_kU^{2x2} terms. Without this distinction the loop would silently drop those terms and break the Theorem 1 identity. **Ordered-time logic**: detection uses first_treat <= min(time) (not positive-sign restriction), so event-time panels with negative or zero-crossing period labels (e.g. time ∈ [-2,..,3]) work correctly. A cohort at first_treat=-1 on such a panel is a valid timing group; a cohort at first_treat=-3 is remapped to U. Both timing_groups filters updated to exclude only the U sentinels, not positive values. REGISTRY.md replacement: - Replaced ## BaconDecomposition block with paper-review-sourced content plus four sub-notes (weight modes, always-treated remap with ordered- time logic, R parity status, unbalanced-panel deviation). - Explicitly removed the prior block's "Weights may be negative for later-vs-earlier comparisons" claim. Theorem 1 weights are strictly positive and sum to 1 (positivity is the headline of the theorem); negative weights are an estimand-level phenomenon (Borusyak-Jaravel 2017, de Chaisemartin-D'Haultfoeuille 2020) at the ATT level, not estimator-level. - Narrowed the machine-precision claim to balanced panels only; the unbalanced-panel library extension is documented as an explicit Deviation block (Goodman-Bacon Appendix A's proof assumes balanced panels; under unbalance, the Theorem 1 identity holds only approximately, though outputs remain finite). New artifacts: - tests/test_methodology_bacon.py (~1050 lines, ~28 tests across 6 classes): - TestBaconHandCalculation: hand-checks Eqs. 7-9 + 10b-d at atol=1e-10 on a minimal hand-derived balanced panel (weights {0.3, 0.4, 0.1, 0.2} hand-computed from sample shares and treatment-share variances). - TestBaconParityR: skips on missing R goldens. - TestBaconAlwaysTreatedRemap: regression-tests warn+remap mechanics including user-data-preservation, U-bucket-only-from- remap (the bug from the loop-gate fix), negative first_treat as a valid cohort (event-time encoding), and remap of negative first_treat below min(time). - TestBaconEdgeCases: no-untreated, single-cohort, unbalanced panel (finite but NOT machine precision), constant-ATT recovery. - TestBaconWeightModes: locks exact-is-default contract. - TestBaconSurveyDesignNarrowing: survey_design composes with exact mode + warn+remap; defaulted BaconDecomposition(), bacon_decompose(), and the survey-time-varying-weights rejection contract are pinned. - benchmarks/R/generate_bacon_golden.R (234 lines): R generator script for bacondecomp::bacon() parity goldens across 3 DGP fixtures. JSON goldens deferred until bacondecomp R package is installed in the local R library (TODO.md follow-up row). CHANGELOG entries: ### Changed (default flip + survey behavior change), TODO.md: prior BaconDecomposition row replaced by narrower R-parity goldens deferral row. Test results: 96 passed, 3 skipped (R parity) across all bacon/decompose callers (test_bacon.py, test_methodology_bacon.py, test_business_report, test_methodology_twfe, test_practitioner, test_target_parameter, test_survey_phase3, test_survey_phase6). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
6b20a4d to
61b30bb
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
Verification note: I could not execute |
P1 (Maintainability): DiagnosticReport._check_bacon now emits a
structured `status="skipped"` block when bacon_decompose raises the
within-unit-varying-survey ValueError from `_validate_unit_constant_survey`,
instead of falling through to the opaque `status="error"` block. The skip
reason names the precomputed={'bacon': ...} + explicit weights="approximate"
escape hatch so survey-backed reports on panels with time-varying within-
unit weights / strata / PSU / FPC retain a documented migration path.
Other ValueErrors (and other exception types) still surface via the
existing `error` block. New regression at
TestBaconSurveyDesignNarrowing::test_diagnostic_report_skips_with_structured_reason_on_time_varying_survey.
P3 (Methodology, zero-crossing wording): clarify that "negative or
zero-crossing labels work correctly" applies to the **time axis** only
(`time` column with event-time labels like [-2,..,3]), not to
`first_treat`. `first_treat ∈ {0, np.inf}` remains reserved as the
never-treated sentinel and is not configurable today; a real treatment
cohort with `first_treat == 0` would be folded into U and should be
re-labeled. Updated docstring, code comment, REGISTRY note, and
CHANGELOG entry to surface this restriction explicitly.
P3 (Documentation/Tests, R parity claim): soften the bacon_decompose
docstring example from "matching R bacondecomp::bacon() at atol=1e-6"
to "intended to match R bacondecomp::bacon() at atol=1e-6 (R parity
goldens pending see TODO.md)" since the JSON goldens are still
unshipped. REGISTRY weight-modes note refined to call out hand-calc /
TWFE-identity validation at atol=1e-10 vs. the pending direct R parity.
Tests: 61 pass in test_methodology_bacon.py + test_bacon.py (one new
regression), 3 skipped (R parity); 97 pass across the broader
bacon/decompose surface.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R2 verdict was ✅ Looks good with three P3 informational items. Closes
all three:
1. Prose alignment: high-level summary text in `REGISTRY.md`,
`METHODOLOGY_REVIEW.md`, and the test module docstring/class
docstring used the old positive-only `0 < first_treat <= min(time)`
condition for the always-treated remap, while the implementation and
the detailed registry note already use ordered-time logic excluding
the `{0, np.inf}` sentinels. Aligned the summary wording across all
four surfaces (REGISTRY.md:2609, METHODOLOGY_REVIEW.md:923 +
Corrections Made entry, test_methodology_bacon.py header + class
docstring) to match the detailed-note phrasing.
2. CHANGELOG R-parity claim: the Changed entry still said the new
default is "matching R `bacondecomp::bacon()`" while the methodology
review, registry, TODO, and parity tests all show direct R parity is
pending until the goldens land. Softened to "intended to match"
with the validation state spelled out (hand-calc + TWFE-vs-weighted-
sum identity at atol=1e-10; direct R bit parity pending).
3. R parity goldens TODO row: non-blocking, already tracked in TODO.md;
no action required this PR.
Tests: 61 pass in test_methodology_bacon.py + test_bacon.py (same as
post-R1), 3 skipped (R parity).
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R3 verdict was ✅ Looks good with two P3 informational items: 1. Warning string accuracy: the always-treated remap warning told users to "pass first_treat=0 or first_treat=np.inf explicitly to silence this warning," but `first_treat` is the column-name argument to `fit()`, not a value slot. Rewrote to "recode the affected rows' first_treat values to 0 or np.inf in your input data before fitting" to point users at the actual remediation surface. 2. Test class docstring: clarified that the implementation contract `first_treat <= min(time)` includes the `== min(time)` boundary case that the paper's strict `t_i < 1` shorthand excludes, and explained the pragmatic rationale (units treated at the first observable period have no untreated cell and cannot contribute to any 2x2 DD as a treated cohort). Aligns with merged registry note wording and removes the shorthand-vs-implementation ambiguity that the R3 reviewer flagged. Tests: 61 pass in test_methodology_bacon.py + test_bacon.py, 3 skipped (R parity).
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Looks good. I found no unmitigated P0/P1 issues in the changed Bacon methodology, weighting, remap, or default-propagation paths. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R4 verdict was Looks good with two P3 polish items (plus the standing R-parity-goldens follow-up, which is tracked and non-blocking). Both addressed: 1. Replicate-weight skip path (P3 Code Quality): the R1 skip-vs-error fix on `DiagnosticReport._check_bacon` only handled the within-unit- varying-survey ValueError case. Replicate-weight survey designs raise NotImplementedError from `BaconDecomposition.fit()` and were still falling through to the generic exception handler as `status="error"`. Added an `except NotImplementedError` branch that returns a structured skip with a reason naming the supported alternative (TSL-based design via SurveyDesign + precomputed= escape hatch). New regression at `TestBaconSurveyDesignNarrowing::test_diagnostic_report_skips_with_structured_reason_on_replicate_weights`. 2. Wrapper docstring mirror (P3 Documentation/Tests): the public `bacon_decompose()` and `TwoWayFixedEffects.decompose()` wrappers had short `first_treat` parameter descriptions that mentioned only the 0/np.inf never-treated convention, omitting the new always-treated remap + sentinel reservation contract that lives in `BaconDecomposition.fit()`. Mirrored the expanded contract into both wrappers (with a pointer to the full docstring on the class) so users reading wrapper help see the behavior change. Tests: 62 pass in test_methodology_bacon.py + test_bacon.py (one new regression), 3 skipped (R parity); 98 pass across the broader bacon/decompose surface.
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good — no unmitigated P0/P1 findings. Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Closes the PR #454 deferred R parity follow-up (TODO.md row removed). Generated `benchmarks/data/r_bacondecomp_golden.json` from the committed `benchmarks/R/generate_bacon_golden.R` script against `bacondecomp 0.1.1` on R 4.5.2. Three DGP fixtures: `uniform_3groups_with_never_treated`, `two_groups_no_never_treated`, `always_treated_remapped`. Parity results at atol=1e-6 via `tests/test_methodology_bacon.py::TestBaconParityR`: - TWFE coefficient: ✅ matches across all 3 fixtures - Weights-sum: ✅ matches across all 3 fixtures - Per-component: ✅ on the 2 non-remap fixtures; **structural convention divergence** on `always_treated_remapped` (skipped per-component, kept aggregate). R keeps `first_treat=1` as a distinct timing cohort and emits `Later vs Always Treated` comparisons; Python's paper-footnote-11 convention remaps those units to `U` and folds them into a single `treated_vs_never` cell per treated cohort. The aggregate is invariant per Theorem 1 — the U bucket's weight is re-allocated across nested 2x2 cells but the total weight on {cohort_k vs U} is identical. Only the per-component breakdown differs structurally between conventions. Tracker promotions: - METHODOLOGY_REVIEW.md: BaconDecomposition status row → **Complete** (was `**Complete** (R parity pending)`); removed from In Progress prose mention; removed from Priority Order substantive-review list; Test Coverage count refreshed (24 → 33); R Comparison Results block rewritten as **Validated**. - docs/methodology/REGISTRY.md: Reference Implementations bullet + Verified Components checklist + Note (weight modes) updated; new Note (R parity convention divergence on always-treated) documents the convention. - TODO.md: BaconDecomposition R parity goldens row removed. - CHANGELOG.md: new `[Unreleased]` Added bullet for the close-out; PR-B Changed entry tightened ("intended to match" → "matching ... at atol=1e-6"). - diff_diff/bacon.py: `bacon_decompose` docstring example wording tightened from "intended to match" to "matches" with TestBaconParityR pointer. Tests: 33/33 pass in test_methodology_bacon.py (no skips; was 30+3 skipped); 32 pass in test_bacon.py; 101 pass across the broader bacon/decompose surface (was 98+3 skipped). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e-out R6 verdict was Looks good with 1 P3 informational item: the older PR-B audit bullet at CHANGELOG.md:13 (added in PR #454) still described the pre-goldens deferral state ("JSON goldens deferred", "TestBaconParityR skips with a pointer when goldens missing", "status flipped to **Complete (R parity goldens pending)**"). That contradicts the new PR-457 bullet at CHANGELOG.md:11 (committed goldens + 4 active parity tests) within the same [Unreleased] section, so the release notes read as internally inconsistent. Updated 3 strings in the PR-B bullet to reflect the within-release close-out: - Status flip wording: now says the (R parity pending) caveat was closed by the parity-goldens bullet above in this same release. - TestBaconParityR description: 4 tests, all active post-release; skips only in partial-checkout scenarios. - (4) outcome: parity goldens deferral was closed within this release.
Summary
METHODOLOGY_REVIEW.md. Operationalizes the paper review landed in PR Add Goodman-Bacon (2021) paper review #451 againstdiff_diff/bacon.py.bacon.py:_recompute_exact_weights): the prior "exact" path was missing the(1 - n_kU)factor in subsample variance, did not square the sample share, and added an extraneousunit_sharefactor. The post-hoc sum-to-1 normalization masked the relative-weight error but produced ~0.3% decomposition error vs TWFE. Rewrite implements Eqs. 7-9 + 10e-g exactly; TWFE-vs-weighted-sum identity now holds atatol=1e-10.weightsflipped from"approximate"to"exact"at three entry points (BaconDecomposition.__init__,bacon_decompose,TwoWayFixedEffects.decompose). Approximate remains opt-in. Survey-design behavior change: exact mode rejects time-varying within-unit survey columns; users with such inputs migrate toweights="approximate". Documented in CHANGELOG + thebacon_decompose()docstring.first_treat <= min(time), excluding sentinels{0, np.inf}): works correctly on event-time panels with negative or zero-crossing labels. Remapping uses an internal column; user'sfirst_treatcolumn is preserved unchanged. Count surfaced as newBaconDecompositionResults.n_always_treated_remappedfield. Loop gate fortreated_vs_nevercomparisons uses POST-remap U count so panels whose U is entirely remapped always-treated still emitβ̂_{kU}terms.## BaconDecompositionblock replaced with paper-review-sourced content + four sub-notes (weight modes, always-treated remap, R parity status, unbalanced-panel deviation). Explicit removal: the prior block's "Weights may be negative for later-vs-earlier comparisons" claim was incorrect — Theorem 1 weights are strictly positive and sum to 1; negative weights are an estimand-level phenomenon at the ATT level, not estimator-level.tests/test_methodology_bacon.py(~28 tests across 6 classes): hand-checks of Eqs. 7-9 + 10b-d atatol=1e-10on a hand-derived balanced panel, R parity (skip on missing goldens), always-treated remap mechanics, edge cases (negative time, unbalanced panel, no-untreated, single-cohort), weight modes, survey-design narrowing.benchmarks/R/generate_bacon_golden.Rcovering three DGP fixtures. JSON goldens deferred untilbacondecompR package is installed (tracked in TODO.md).Methodology references (required if estimator / math changes)
bacon.py), TwoWayFixedEffects.decompose (twfe.py)docs/methodology/papers/goodman-bacon-2021-review.md(PR Add Goodman-Bacon (2021) paper review #451).UserWarning(paper Appendix A's proof assumes balanced; library extension, Theorem 1 identity holds only approximately); (b)weights="approximate"opt-in fast path with simplified variance not present in Rbacondecomp(numerical output may differ from R); (c)weights="exact"requires within-unit-constant survey columns (paper notation does not specify a survey design; library convention from the exact-path per-unit aggregation). All three documented in REGISTRY.md## BaconDecompositionNotes / Deviation block.Validation
tests/test_methodology_bacon.py(new, ~28 tests, 6 classes);tests/test_bacon.pyupdated (test_weighted_sum_equals_twfetolerance tightened to< 1e-10;TestWeightsParameterrenamedtest_approximate_weights_default→test_exact_weights_default+test_approximate_weights_opt_in). Hand-calculable Theorem 1 DGP derived in-test with weights{0.3, 0.4, 0.1, 0.2}precomputed from Eqs. 10e-g.pytest -k 'bacon or decompose').Security / privacy
Generated with Claude Code