diff --git a/CHANGELOG.md b/CHANGELOG.md index 2e845ac3..964e35b9 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,10 +8,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added +- **BaconDecomposition: Goodman-Bacon (2021) methodology audit (PR-B).** Closes the BaconDecomposition row in `METHODOLOGY_REVIEW.md` (status flipped from **In Progress** → **Complete (R parity goldens pending)**). Builds on the PR #451 paper review at `docs/methodology/papers/goodman-bacon-2021-review.md`. **Audit outcomes:** (1) Rewrote `_recompute_exact_weights` in `bacon.py` to actually implement Theorem 1 (Eqs. 7-9 + 10e-g) — the prior "exact" implementation was missing the `(1-n_kU)` factor in the subsample variance, did not square the sample share, and added an extraneous `unit_share` factor not present in the paper; the post-hoc sum-to-1 normalization masked the relative-weight error but produced ~0.3% decomposition error vs TWFE on a 3-cohort + never-treated DGP. The rewrite computes the exact numerators of Eqs. 10e/f/g and lets the post-hoc normalization handle the `V̂^D` denominator (Theorem 1's identity guarantees `V̂^D = Σ numerators`). The TWFE-vs-weighted-sum identity now holds at `atol=1e-10` on both noisy and hand-calculable DGPs. (2) Added always-treated warn+remap per paper footnote 11: units whose `first_treat` is at or before the first observable period (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`) are automatically remapped to the `U` (untreated) bucket via an internal column (`__bacon_first_treat_internal__`) with a `UserWarning`. Detection uses ordered-time logic on the **time axis**, so panels whose `time` column has negative or zero-crossing labels (event-time encodings) are handled correctly; the `0` sentinel restriction applies only to `first_treat`, not to `time`, and a real treatment cohort with `first_treat == 0` would still be folded into U today (re-label such cohorts to a non-sentinel value before fitting). The user's original `first_treat` column is preserved unchanged. The count is surfaced as a new `BaconDecompositionResults.n_always_treated_remapped` dataclass field, rendered in `summary()` output when nonzero. **`n_never_treated` reports TRUE never-treated only**, computed from the original user column before remap — remapped always-treated units appear separately as `n_always_treated_remapped`, no double-counting. (3) New methodology test file `tests/test_methodology_bacon.py` (~24 tests across 6 classes: `TestBaconHandCalculation` hand-checks Eqs. 7-9 + 10b-d on a minimal balanced panel at `atol=1e-10`; `TestBaconParityR` skips with a pointer when goldens missing; `TestBaconAlwaysTreatedRemap` regression-tests warn+remap mechanics including user-data-preservation; `TestBaconEdgeCases` exercises no-untreated, single-cohort, unbalanced panel, constant-ATT recovery; `TestBaconWeightModes` locks the new exact-is-default contract; `TestBaconSurveyDesignNarrowing` confirms survey_design composes with exact mode and warn+remap). (4) R `bacondecomp::bacon()` parity generator committed at `benchmarks/R/generate_bacon_golden.R` covering three DGP fixtures (3-groups-with-U, 2-groups-no-U, always-treated-remapped); JSON goldens deferred until `bacondecomp` R package is installed (parity tests skip cleanly with an explicit pointer). (5) `docs/methodology/REGISTRY.md` `## BaconDecomposition` block replaced with the paper-review-sourced entry plus three new sub-notes: weight modes (exact vs approximate), always-treated remap, R parity status. **Explicit removal:** the prior REGISTRY block's "Weights may be negative for later-vs-earlier comparisons" claim was incorrect per Theorem 1 (decomposition weights are strictly positive and sum to 1; negative weights are an estimand-level phenomenon, not estimator-level) and is dropped from the new entry. Closes the BaconDecomposition follow-up tracked at `TODO.md` (the prior row added in PR #451 is replaced by a narrower R-parity-goldens deferral row). - **`SpilloverDiD` — ring-indicator spillover-aware DiD (Butts 2021).** New standalone estimator at `diff_diff/spillover.py` implementing two-stage Gardner methodology with ring-indicator covariates that identify direct effect on treated (`tau_total`) alongside per-ring spillover effects on near-control units (`delta_j`). Documented synthesis of ingredients (no single published software covers the exact recipe — `did2s` implements Gardner two-stage without rings; the Butts ring estimator has no R/Stata package): Butts (2021) Section 5 / Table 2 identification, Gardner (2022) two-stage residualize-then-fit, and the Conley spatial-HAC vcov shipped in 3.3.3. Handles both panel non-staggered (Equations 5/6/8) and Section 5 staggered timing in one estimator — non-staggered is the special case where all treated units share an onset time. **API:** `SpilloverDiD(rings=[0, 50, 100, 200], conley_coords=("lat","lon"), ...).fit(data, outcome="y", unit="unit", time="t", treatment="D")` (binary D auto-converted to `first_treat`) or `.fit(..., first_treat="first_treat")` (Gardner convention). Result: `SpilloverDiDResults(DiDResults)` with `.att` = `tau_total`, `.spillover_effects` (per-ring `pd.DataFrame` with `coef`/`se`/`t_stat`/`p_value`/`ci_low`/`ci_high`), `.ring_breakpoints`, `.d_bar`, `.n_units_ever_in_ring`, `.n_far_away_obs`, `.is_staggered`. `.coefficients` exposes all `(1+K)` stage-2 entries (`"treatment"` + `"_spillover_"`) plus an `"ATT"` alias keyed to vcov columns. **Methodology spec (committed):** stage-2 regressor is the time-varying `(1 - D_it) * Ring_{it,j}` form (paper page 12's `S_it = S_i * 1{t >= t_treat}` notation; Section 5 Table 2's `S^k_{it}` / `Ring^k_{it,j}`). Reading the literal unit-static `(1 - D_it) * S_i` from Equation 5 is algebraically rank-deficient under TWFE (`(1-D_it) * S_i = S_i - D_it`, with `S_i` absorbed by `mu_i`, leaving `-D_it`); only the time-varying form supports the paper's identification (Proposition 2.3). Stage-1 subsample uses Butts' STRICTER `Omega_0 = {D_it = 0 AND S_it = 0}` (untreated AND unexposed), not TwoStageDiD's `{D_it = 0}` alone — this prevents spillover-contaminated near-controls in pre/post periods from biasing the time FE. **Gardner identity (non-staggered):** a 20-seed deterministic regression test pins `SpilloverDiD.att` against a direct single-stage TWFE ring regression on the full sample (`y ~ mu_i + lambda_t + tau * D_it + sum_j delta_j * (1 - D_it) * Ring_{it,j}`) at `atol=1e-10` — empirically bit-identical, so the reported non-staggered `tau_total` IS the Butts Eqs. 4-6 estimator. **Identification-check policy (period strict, unit warn-and-drop, plus connectivity):** every period must have at least one Omega_0 row (hard `ValueError` — dropping a period removes all units' cross-time identification). Units lacking Omega_0 rows (e.g. baseline-treated units with `D_it = 1` at every observed `t`) are warned-and-dropped: their unit FE is NaN, residualization writes NaN on their rows, and the downstream finite-mask path excludes them from stage 2 — mirrors `TwoStageDiD`'s always-treated convention. Additionally, the supported-units bipartite graph (units linked by shared Omega_0 periods) must form a single connected component; `K > 1` components raise `ValueError` because the FE solver would return only component-specific constants and residualization would silently mix them across components (defense-in-depth — under absorbing treatment the disconnected case may be unreachable through the upstream validators, but the check future-proofs Wave B follow-ups). **Public API restrictions (Wave B MVP):** `covariates=` raises `NotImplementedError` because Gardner-style two-stage requires covariate effects estimated on the untreated-and-unexposed subsample at stage 1 (appending raw covariates only at stage 2 silently biases `tau_total` / `delta_j` on panels with time-varying covariates); non-absorbing / reversible treatment patterns (e.g. `[0, 1, 0]`) raise `ValueError` rather than being silently coerced into "treated from first 1 onward"; non-constant `first_treat` values across rows of the same unit raise `ValueError`; `conley_coords` is required on every fit path (not just `vcov_type="conley"`) because ring construction always uses it. **Far-away control identification:** uses CURRENT-period untreated status (`D_it = 0`) rather than never-treated-only, so all-eventually-treated staggered designs (no never-treated units) can identify the counterfactual via not-yet-treated far-away rows. **Variance (Wave B MVP):** stage-2 OLS variance via `solve_ols` (HC1 / Conley / cluster paths all flow through). The Gardner GMM first-stage uncertainty correction is NOT applied at stage 2 in this PR (documented limitation; planned follow-up extends `two_stage.py::_compute_gmm_variance` to accept a Conley kernel matrix in place of HC1's identity at the influence-function outer-product step). **Deferred features (planned follow-ups):** `event_study=True` per-event-time × ring coefficients (Butts Table 2), `survey_design=` integration, `ring_method="count"` (count-of-treated-in-ring), data-driven `d_bar` selection (Butts 2021b / Butts 2023 JUE Insight), Gardner GMM first-stage correction at stage 2, sparse staggered ring-distance path. **Tests:** `tests/test_spillover.py` (157 tests across ring-construction primitives, validators, fit integration, raw-data invariant, identification MC — non-staggered DGP at 50 seeds + 200-seed `@pytest.mark.slow` variant recovers both `tau_total` and `delta_1`; staggered DGP at 30 seeds anchors both `tau_total` and `delta_1` — Conley plumbing (verifies `solve_ols` is called with `vcov_type="conley"` + Conley kwargs, no silent HC1 fallback), Gardner identity bit-identity, coefficients-vs-vcov alignment, warn-and-drop, rank_deficient_action validation, Omega_0 bipartite-graph connectivity, anticipation behavior on both fit paths). DGP factories `tests/_dgp_utils.py::generate_butts_nonstaggered_dgp` / `generate_butts_staggered_dgp` satisfy Butts Assumptions 1/3/5/7 by construction. - **`ChaisemartinDHaultfoeuille.predict_het` × `placebo`: R-parity on both global and per-path surfaces.** R-verified — `did_multiplegt_dyn(predict_het, placebo)` emits heterogeneity OLS results on backward (placebo) horizons via R's `DIDmultiplegtDYN:::did_multiplegt_main` placebo block (`effect = matrix(-i, ...)` rbind site); the same block runs per-by_level under `did_multiplegt_dyn(by_path, predict_het, placebo)`, so both global `res$results$predict_het` and per-by_level `res$by_level_i$results$predict_het` slots emit backward rows. R's predict_het syntax with `placebo > 0` requires the `c(-1)` sentinel in the horizon vector to trigger "compute heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo) positions" — passing positive-only horizons errors with "specified numbers in predict_het that exceed the number of placebos". Python mirrors via `_compute_heterogeneity_test(..., placebo=L_max)` (set automatically from `self.placebo` at both global and per-path call sites in `fit()`) — the function iterates forward (1..L_max) and backward (-1..-L_max) horizons in a single loop with an explicit `out_idx < 0` eligibility guard for backward horizons whose `F_g` is too small (would otherwise silently misread `N_mat` via numpy negative indexing). `results.heterogeneity_effects` uses negative-int keys for backward horizons; `path_heterogeneity_effects` does the same per path. Placebo rows in `to_dataframe(level="by_path")` have non-NaN `het_*` columns when `placebo=True` and `heterogeneity=` are both set. **Survey gate (warn + skip):** `survey_design + placebo + heterogeneity` emits a `UserWarning` at fit-time and falls back to forward-horizon-only heterogeneity on both surfaces — the Binder TSL cell-period allocator's REGISTRY justification is tied to **post-period** attribution; backward-horizon attribution puts ψ_g mass on a pre-period cell, a separate library-extension claim that needs its own derivation. Forward-horizon `predict_het + survey_design` continues to work unchanged on both global and per-path surfaces. The function-level `_compute_heterogeneity_test` keeps a per-iteration `NotImplementedError` backstop for direct callers that bypass fit(). Pre-period allocator derivation deferred to a follow-up methodology PR (tracked in TODO.md). R parity confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneityWithPlacebo` (scenario 23, `multi_path_reversible_predict_het_with_placebo_global`, `placebo=2, effects=3, no by_path`) and `::TestDCDHDynRParityByPathHeterogeneityWithPlacebo` (scenario 22, same DGP plus `by_path=3`); pinned at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5` for `beta` / `se` / `t_stat` / `n_obs` and `INFERENCE_RTOL=1e-4` for `p_value` / `conf_int` across 3 paths × (3 forward + 2 placebo) = 15 horizons + 1 global × 5 horizons. Cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo` (placebo het column population, survey-gate warn+skip behavior, forward+survey anti-regression, `out_idx<0` eligibility guard, single-path telescope `path_heterogeneity_effects[(only_path,)] == heterogeneity_effects` bit-exactly, summary rendering, direct-call `NotImplementedError` backstop). Closes TODO #422. ### Changed +- **BaconDecomposition: default `weights` flipped from `"approximate"` to `"exact"` (PR-B methodology audit).** The new default uses Goodman-Bacon (2021) Theorem 1's exact Eqs. 7-9 + 10e-g weights, **intended to match** R `bacondecomp::bacon()` (direct R parity at `atol=1e-6` pending the R `bacondecomp` install; validated via hand-calculation + TWFE-vs-weighted-sum identity at `atol=1e-10` — see TODO.md). The `weights="approximate"` path remains available as an opt-in fast diagnostic for speed-sensitive loops; its numerical output may differ from R. Three entry points were flipped: `BaconDecomposition(weights="exact")` (`bacon.py:397`), `bacon_decompose(weights="exact")` (`bacon.py:1064`), `TwoWayFixedEffects.decompose(weights="exact")` (`twfe.py:684`). **Behavior change for users not passing explicit `weights=`**: the decomposition weights are now paper-faithful by default. Users who depended on the previous `"approximate"` numerics for diagnostic plots or comparison-type weight shares can preserve the old behavior by passing `weights="approximate"` explicitly. **Survey-design behavior change**: `weights="exact"` (now the default) routes through `_validate_unit_constant_survey`, which rejects survey designs whose weights / strata / PSU / FPC columns vary within a unit across periods (the exact-mode path collapses to per-unit aggregation via `groupby().first()`). The previous `weights="approximate"` default tolerated time-varying within-unit survey weights via observation-level weighted means. Users whose survey-weighted Bacon calls used time-varying within-unit weights must now either (a) collapse their weights to be unit-constant or (b) pass explicit `weights="approximate"` to retain the legacy obs-level path. The production diagnostic surface (`diff_diff/diagnostic_report.py:1740`) was updated to pass explicit `weights="exact"`. Existing test assertions in `tests/test_bacon.py` continue to pass with the new default; the `test_weighted_sum_equals_twfe` tolerance was tightened from `< 0.1` to `< 1e-10` to lock the Theorem 1 algebraic-identity contract. + - **`ChaisemartinDHaultfoeuille.predict_het` inference: t-distribution df threading (closes TODO pilot-412).** `_compute_heterogeneity_test` now passes `df = n_obs - rank(design)` to `safe_inference` on the non-survey OLS path, matching R `did_multiplegt_dyn(predict_het=...)`'s t-distribution inference (`DIDmultiplegtDYN:::did_multiplegt_main` `t_stat <- qt(0.975, df.residual(model))` site). Pre-PR Python used `df=None` (normal Z critical), producing 0.1-2% rtol gaps on `p_value` and `conf_int` vs R. Parity tolerance tightened on the existing forward-horizon scenarios (`multi_path_reversible_predict_het`, `multi_path_reversible_by_path_predict_het`) from "unpinned" to `INFERENCE_RTOL=1e-4` on `p_value` and `conf_int`; `beta` / `se` / `t_stat` continue at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5`. **Post-drop rank (post-2026-05-16 wrap-up):** the df denominator uses the post-drop numerical rank via `_detect_rank_deficiency`, which `solve_ols` already calls internally. For full-rank designs `rank == n_params` and behavior is bit-identical to the pre-PR `n_obs - n_params` path; for near-rank-deficient designs that `solve_ols` retains rather than NaN-out (e.g., cohort-collinearity at high horizons), the post-drop rank is strictly lower and the post-PR `df` is larger, matching R's `lm()` convention. The Z-vs-t REGISTRY deviation note is replaced with an "R parity (post-2026-05-15 df threading)" positive-claim note. - **`ChaisemartinDHaultfoeuille.by_path` negative-baseline path regression coverage.** New `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary::test_negative_baseline_path_supported` exercises switchers with `D_{g,1} = -1` and asserts that `path_effects` correctly contains negative-baseline tuple keys (e.g., `(-1, 0, 0, 0)`, `(-1, 1, 1, 1)`). This closes the test-coverage gap from PR #419: the existing `test_negative_integer_D_supported` only covered paths with negative values in non-baseline positions (e.g., `(0, -1, -1, -1)`), which does not trigger R's documented `substr(path, 1, 1)` baseline-extraction bug. Python's tuple-key matching is correct under any baseline value; this test pins the contract. No R-parity fixture is added because R is the buggy side on this regime — the deviation is documented in the REGISTRY non-binary treatment Note. diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md index 7607f706..e5dc409a 100644 --- a/METHODOLOGY_REVIEW.md +++ b/METHODOLOGY_REVIEW.md @@ -78,7 +78,7 @@ The catalog grew incrementally over several quarters, so formats vary across the | Tool | Module | R Reference | Status | Last Review | |------|--------|-------------|--------|-------------| -| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **In Progress** | — | +| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **Complete** (R parity pending) | 2026-05-16 | | HonestDiD | `honest_did.py` | `HonestDiD` package | **Complete** | 2026-04-01 | | PreTrendsPower | `pretrends.py` | `pretrends` package | **In Progress** | — | | PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **In Progress** | — | @@ -909,18 +909,41 @@ and covariate-adjusted specifications.) | Module | `bacon.py` | | Primary Reference | Goodman-Bacon (2021), *Difference-in-differences with variation in treatment timing*, J. Econometrics 225(2), 254-277 | | R Reference | `bacondecomp::bacon()` | -| Status | **In Progress** | -| Last Review | — | +| Status | **Complete** (R parity goldens pending) | +| Last Review | 2026-05-16 | -**Documentation in place:** -- REGISTRY.md section: `## BaconDecomposition` (three comparison types — treated_vs_never, earlier_vs_later, later_vs_earlier; weight construction; TWFE reconstitution; weighted survey path under Phase 3) -- Implementation: `tests/test_bacon.py` (basic decomposition, weight properties, integration with `TwoWayFixedEffects.decompose()`) +**Verified Components:** +- [x] Theorem 1 decomposition identity: `β̂^DD = Σ s · β̂^{2x2}` at `atol=1e-10` (hand-calculable + noisy DGPs) +- [x] Weight sum-to-1: `Σ s = 1.0` at `atol=1e-10` under `weights="exact"` +- [x] Three comparison types correctly classified: `treated_vs_never`, `earlier_vs_later`, `later_vs_earlier` +- [x] Eq. 7 hand-checked: `V̂_{kU}^D = n_{kU}(1-n_{kU}) · D̄_k(1-D̄_k)` (via weight-ratio test, `atol=1e-10`) +- [x] Eq. 8 hand-checked: `V̂_{kℓ}^{D,k} = n_{kℓ}(1-n_{kℓ}) · (D̄_k-D̄_ℓ)/(1-D̄_ℓ) · (1-D̄_k)/(1-D̄_ℓ)` +- [x] Eq. 9 hand-checked: `V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1-n_{kℓ}) · D̄_ℓ/D̄_k · (D̄_k-D̄_ℓ)/D̄_k` +- [x] Eq. 10b 2x2 estimator value: hand-calculable panel → β̂_{kU}^{2x2} = ATT exactly +- [x] Always-treated remap to U (paper footnote 11): `first_treat <= min(time)` (excluding never-treated sentinels `0` and `np.inf`) units auto-remapped via internal column, user's data preserved, count exposed on result +- [x] `weights="exact"` is the default (PR-B 2026-05-16); `weights="approximate"` retained as opt-in +- [x] Unbalanced panel: accepted with `UserWarning` (paper assumes balanced; library extension) +- [x] No untreated group: `s_{kU}` terms drop, weights renormalize, sum-to-1 still holds +- [x] Single timing group with U: only `treated_vs_never` comparisons +- [x] Survey design composes cleanly with exact mode and warn+remap +- [ ] R `bacondecomp::bacon()` parity at `atol=1e-6` — R generator script committed; JSON goldens pending follow-up R install (see TODO.md) -**Outstanding for promotion:** -- **Substantive review pass — first target chosen during the 2026-05-15 methodology-review refresh session.** Read Goodman-Bacon (2021), audit `bacon.py` against the paper's decomposition (Equation 11, weight construction in Section 3, three comparison types in Section 4), generate R parity fixtures via `bacondecomp::bacon()`, write `tests/test_methodology_bacon.py` with paper-equation-numbered assertions, populate Verified Components / Corrections Made / Deviations here. -- Paper review under `docs/methodology/papers/goodman-bacon-2021-review.md` -- R parity fixture against `bacondecomp::bacon()` covering treated_vs_never, earlier_vs_later, later_vs_earlier weight buckets and their relative shares -- Verify the REGISTRY Implementation Checklist (all rows currently unchecked except the survey-design Phase 3 row) +**Test Coverage:** +- 24 methodology tests in `tests/test_methodology_bacon.py` across 6 classes (21 active + 3 R-parity tests that skip on missing goldens) +- 32 existing tests in `tests/test_bacon.py` (basic decomposition, weight properties, weights-parameter API, TWFE integration, visualization, balanced-panel warnings, edge cases) + +**R Comparison Results:** +- **Pending**: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time. R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`. JSON goldens land at `benchmarks/data/r_bacondecomp_golden.json` once generated. `tests/test_methodology_bacon.py::TestBaconParityR` skips with a pointer until then. Tracked in TODO.md follow-up row. + +**Corrections Made:** +1. **Theorem 1 exact-weights rewrite** (`bacon.py:_recompute_exact_weights`, lines ~740-880). The previous "exact" mode implementation did not actually compute Eqs. 7-9 / 10e-g — it was missing the `(1 - n_kU)` factor in the within-subsample treatment variance, did not square the sample share, and added an extraneous `unit_share` factor not present in the paper. The post-hoc sum-to-1 normalization masked the relative-weight error but produced a decomposition error of ~0.3% (0.007 absolute) against TWFE on a 3-cohort + never-treated DGP. **Rewrote** the function to compute the exact numerators of Eqs. 10e/f/g (with proper Eqs. 7-9 variances) and let the post-hoc normalization handle the `V̂^D` denominator (Theorem 1 identity guarantees `V̂^D = Σ numerators`). Now matches TWFE at `atol=1e-10`. The existing `test_weighted_sum_equals_twfe` tolerance was tightened from `< 0.1` to `< 1e-10` to lock the contract. +2. **Default `weights` flipped from `"approximate"` to `"exact"`** at three entry points: `BaconDecomposition.__init__()` (`bacon.py:397`), `bacon_decompose()` convenience function (`bacon.py:1064`), `TwoWayFixedEffects.decompose()` (`twfe.py:684`). The paper-faithful Theorem 1 weights are now the default; the simplified approximate path remains opt-in via explicit `weights="approximate"`. `diff_diff/diagnostic_report.py:1740` (production diagnostic surface) was updated to pass explicit `weights="exact"`. +3. **Always-treated warn+remap via internal column** (`bacon.py:fit()`, lines ~487-525). Paper footnote 11 puts units with `t_i < 1` in `U`, but `bacon.py` previously only mapped `first_treat ∈ {0, np.inf}` into U. Added detection using ordered-time logic on the **time axis** (`first_treat <= min(time)` while excluding the never-treated sentinels `0` and `np.inf`) with `UserWarning` and automatic remap via an internal column (`__bacon_first_treat_internal__`), preserving the user's `first_treat` column unchanged. Detection handles event-time-encoded panels (`time ∈ [-2,..,3]`) correctly; the `0` sentinel restriction applies only to `first_treat`. Count exposed via new `BaconDecompositionResults.n_always_treated_remapped` field. + +**Deviations from R's `bacondecomp::bacon()`:** +1. **Unbalanced panel acceptance** (library extension): R errors on unbalanced panels; Python emits a `UserWarning` and decomposes. The paper's Appendix A proof assumes balanced panels — decomposition on unbalanced panels is approximate to Theorem 1. +2. **Approximate weight mode** (Python-only optimization): `weights="approximate"` is a library-only fast path with simplified variance computation, not present in R. Users who want Python-R numerical parity should pass `weights="exact"` (the new default). +3. **NaN for invalid inference fields not applicable**: the decomposition is deterministic; there are no SE/p-value fields on the comparison output. The `decomposition_error` field is a finite float (zero in well-conditioned cases). --- diff --git a/TODO.md b/TODO.md index 2e336f97..41693602 100644 --- a/TODO.md +++ b/TODO.md @@ -74,7 +74,7 @@ Deferred items from PR reviews that were not addressed before merge. | Issue | Location | PR | Priority | |-------|----------|----|----------| -| BaconDecomposition: substantive methodology audit pass against Goodman-Bacon (2021). Paper review on file at `docs/methodology/papers/goodman-bacon-2021-review.md` includes a proposed Methodology Registry Entry, but the `REGISTRY.md` `## BaconDecomposition` section (lines ~2598-2654) still carries the older contract. Audit pass needs to (a) verify `bacon.py` matches Theorem 1 / Eqs. 10a-g exactly, (b) decide how to handle genuinely always-treated units (`0 < first_treat <= min(time)`) — paper convention puts them in `U`, but `bacon.py` currently treats only `first_treat ∈ {0, np.inf}` as the `U` bucket, (c) generate R parity fixtures via `bacondecomp::bacon()`, (d) write `tests/test_methodology_bacon.py`, (e) replace the REGISTRY entry with the proposed text, (f) populate Verified Components / Corrections Made / Deviations in `METHODOLOGY_REVIEW.md` and flip status to Complete. | `diff_diff/bacon.py`, `docs/methodology/REGISTRY.md`, `tests/test_methodology_bacon.py`, `METHODOLOGY_REVIEW.md` | follow-up | Medium | +| BaconDecomposition R parity goldens: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time (2026-05-16). R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`, writing `benchmarks/data/r_bacondecomp_golden.json`. `tests/test_methodology_bacon.py::TestBaconParityR` (3 tests) skips with a pointer until the JSON lands. The PR-B audit substantiates Theorem 1 (Eqs. 7-9 + 10e-g) via hand-calculable + machine-precision identity tests; R parity is desirable as a cross-language anchor but not the only substantiation. Mirrors StaggeredTripleDifference precedent (PR #245). | `benchmarks/R/generate_bacon_golden.R`, `benchmarks/data/r_bacondecomp_golden.json` (TBD), `tests/test_methodology_bacon.py::TestBaconParityR` | follow-up | Medium | | dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low | | dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | #408 | Medium | | dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low | diff --git a/benchmarks/R/generate_bacon_golden.R b/benchmarks/R/generate_bacon_golden.R new file mode 100644 index 00000000..040f2552 --- /dev/null +++ b/benchmarks/R/generate_bacon_golden.R @@ -0,0 +1,234 @@ +#!/usr/bin/env Rscript +# Generate R bacondecomp parity goldens for diff-diff BaconDecomposition. +# +# Requires: install.packages("bacondecomp") (CRAN; main function is bacon()) +# install.packages("jsonlite") +# Output: ../data/r_bacondecomp_golden.json +# +# The diff-diff BaconDecomposition implementation (`diff_diff/bacon.py`) with +# the default ``weights="exact"`` is expected to match the values in this JSON +# to atol=1e-6 on the per-component (treated, control, type) tuples, and to +# match the TWFE coefficient to the same tolerance. The ``weights="approximate"`` +# path is a library-only optimization and is NOT covered by this parity harness. +# +# Three fixtures: +# 1. uniform_3groups_with_never_treated — 3 timing groups + never-treated U; +# exercises all three comparison types (treated/never, earlier/later, +# later/earlier). +# 2. two_groups_no_never_treated — 2 timing groups only; tests the +# timing-only decomposition where the s_{kU} terms drop. +# 3. always_treated_remapped — 3 timing groups + 1 always-treated cohort +# (first_treat = 1). Validates that Python's warn+remap of t_i < 1 into +# U matches R bacondecomp's native behavior. +# +# Run: +# cd benchmarks/R && Rscript generate_bacon_golden.R + +suppressPackageStartupMessages({ + library(bacondecomp) + library(jsonlite) +}) + +stopifnot(packageVersion("bacondecomp") >= "0.1.0") + +# --------------------------------------------------------------------------- +# DGP helpers +# --------------------------------------------------------------------------- + +# Build a balanced panel with absorbing treatment. +# n_units : units per timing group (excluding never-treated) +# n_periods : panel length (1..T) +# cohort_times : vector of first-treatment times, one per cohort +# always_treated_count : optional cohort treated at first_treat = 1 +# (i.e., always-treated for the observable window) +# never_treated_count : units with first_treat = 0 +# true_effect : constant ATT +# seed : reproducibility +build_panel <- function(n_units_per_cohort, n_periods, cohort_times, + always_treated_count = 0L, never_treated_count = 0L, + true_effect = 2.0, seed = 42L) { + set.seed(seed) + units <- list() + uid <- 1L + + # Always-treated cohort (first_treat = 1; treated in every observable period) + if (always_treated_count > 0L) { + for (i in seq_len(always_treated_count)) { + units[[length(units) + 1L]] <- data.frame( + unit = uid, time = seq_len(n_periods), first_treat = 1L + ) + uid <- uid + 1L + } + } + + # Never-treated U + if (never_treated_count > 0L) { + for (i in seq_len(never_treated_count)) { + units[[length(units) + 1L]] <- data.frame( + unit = uid, time = seq_len(n_periods), first_treat = 0L + ) + uid <- uid + 1L + } + } + + # Treated cohorts + for (g in cohort_times) { + for (i in seq_len(n_units_per_cohort)) { + units[[length(units) + 1L]] <- data.frame( + unit = uid, time = seq_len(n_periods), first_treat = as.integer(g) + ) + uid <- uid + 1L + } + } + + df <- do.call(rbind, units) + + # Treatment indicator: D_{it} = 1 iff first_treat in {1,..,T} AND time >= first_treat. + df$D <- as.integer(df$first_treat > 0L & df$time >= df$first_treat) + + # Outcome: unit FE + linear time + constant treatment effect + noise. + unit_fe <- rnorm(uid - 1L, sd = 2.0) + df$y <- unit_fe[df$unit] + + 0.1 * df$time + + true_effect * df$D + + rnorm(nrow(df), sd = 0.5) + + df +} + +# --------------------------------------------------------------------------- +# Extract bacondecomp::bacon() output into a fixture-shaped list. +# --------------------------------------------------------------------------- + +extract_bacon <- function(df, fixture_name) { + # bacondecomp::bacon takes the OUTCOME ~ TREATMENT formula plus id_var/time_var. + # It returns a data.frame with columns: treated, untreated, estimate, weight, + # plus a `type` column (e.g. "Both Treated", "Treated vs Untreated"), and an + # attribute beta_hat_w (the weighted sum, which equals the TWFE coefficient). + res <- bacondecomp::bacon( + formula = y ~ D, + data = df, + id_var = "unit", + time_var = "time" + ) + + # When the data contains a never-treated group, bacon() returns a list with + # $two_by_twos (the per-component table) and $Omega (the variance-weighted + # contributions). Without never-treated, it returns the data.frame directly. + if (is.list(res) && !is.data.frame(res)) { + components_df <- res$two_by_twos + twfe_coef <- as.numeric(attr(res, "beta_hat_w")) + # Fallback: re-derive TWFE from the components if attr is missing. + if (is.null(twfe_coef) || length(twfe_coef) == 0L) { + twfe_coef <- sum(components_df$estimate * components_df$weight) + } + } else { + components_df <- res + twfe_coef <- sum(components_df$estimate * components_df$weight) + } + + # Components vary across bacondecomp versions; normalize the column names. + cols <- names(components_df) + treated_col <- if ("treated" %in% cols) "treated" else "g1" + untreated_col <- if ("untreated" %in% cols) "untreated" else "g2" + estimate_col <- if ("estimate" %in% cols) "estimate" else "Estimate" + weight_col <- if ("weight" %in% cols) "weight" else "Weight" + type_col <- if ("type" %in% cols) "type" else NA_character_ + + components <- lapply(seq_len(nrow(components_df)), function(i) { + list( + treated_group = as.numeric(components_df[[treated_col]][i]), + control_group = as.numeric(components_df[[untreated_col]][i]), + estimate = as.numeric(components_df[[estimate_col]][i]), + weight = as.numeric(components_df[[weight_col]][i]), + type = if (!is.na(type_col)) + as.character(components_df[[type_col]][i]) + else NA_character_ + ) + }) + + weights_sum <- sum(sapply(components, function(c) c$weight)) + + list( + panel = list( + unit = as.integer(df$unit), + time = as.integer(df$time), + y = as.numeric(df$y), + first_treat = as.integer(df$first_treat), + treated = as.integer(df$D) + ), + r_twfe_coef = twfe_coef, + r_components = components, + r_weights_sum = weights_sum, + n_components = length(components) + ) +} + +# --------------------------------------------------------------------------- +# Fixtures +# --------------------------------------------------------------------------- + +cat("Building fixture 1: uniform_3groups_with_never_treated...\n") +df1 <- build_panel( + n_units_per_cohort = 30L, + n_periods = 6L, + cohort_times = c(3L, 4L, 5L), + always_treated_count = 0L, + never_treated_count = 30L, + true_effect = 2.0, + seed = 101L +) +fixture_1 <- extract_bacon(df1, "uniform_3groups_with_never_treated") + +cat("Building fixture 2: two_groups_no_never_treated...\n") +df2 <- build_panel( + n_units_per_cohort = 30L, + n_periods = 6L, + cohort_times = c(3L, 5L), + always_treated_count = 0L, + never_treated_count = 0L, + true_effect = 2.0, + seed = 202L +) +fixture_2 <- extract_bacon(df2, "two_groups_no_never_treated") + +cat("Building fixture 3: always_treated_remapped...\n") +# 3 timing-cohorts + 5 always-treated units (first_treat = 1, i.e., treated +# in every observable period) + 30 never-treated. R's bacondecomp natively +# groups the first_treat=1 cohort with U (since they are treated throughout +# every observable period and never serve as a within-window control), which +# matches what diff-diff's warn+remap does in Python. +df3 <- build_panel( + n_units_per_cohort = 25L, + n_periods = 6L, + cohort_times = c(3L, 4L, 5L), + always_treated_count = 5L, + never_treated_count = 25L, + true_effect = 2.0, + seed = 303L +) +fixture_3 <- extract_bacon(df3, "always_treated_remapped") + +# --------------------------------------------------------------------------- +# Write JSON +# --------------------------------------------------------------------------- + +out <- list( + meta = list( + generated_at = format(Sys.Date()), + bacondecomp_version = as.character(packageVersion("bacondecomp")), + r_version = R.version.string, + description = paste( + "Goodman-Bacon (2021) decomposition parity goldens for diff-diff", + "BaconDecomposition. Parity target: atol=1e-6 on per-component", + "(treated, control, type) tuples plus the TWFE coefficient." + ) + ), + uniform_3groups_with_never_treated = fixture_1, + two_groups_no_never_treated = fixture_2, + always_treated_remapped = fixture_3 +) + +out_path <- "../data/r_bacondecomp_golden.json" +write_json(out, out_path, pretty = TRUE, digits = NA, auto_unbox = TRUE) +cat(sprintf("Wrote %s\n", out_path)) diff --git a/diff_diff/bacon.py b/diff_diff/bacon.py index 90438a9f..b0c690f1 100644 --- a/diff_diff/bacon.py +++ b/diff_diff/bacon.py @@ -32,6 +32,13 @@ class Comparison2x2: The timing group used as "treated" in this comparison. control_group : Any The timing group used as "control" in this comparison. + For ``comparison_type="treated_vs_never"``, this is the literal + string ``"never_treated"``, which refers to the **post-remap U + bucket** (the paper's ``U`` per Goodman-Bacon 2021 footnote 11). + On inputs with no remapped always-treated units this is exactly + the true never-treated set; with remapping it is the broader U + bucket. Check ``BaconDecompositionResults.n_never_treated`` and + ``n_always_treated_remapped`` for the precise composition. comparison_type : str Type of comparison: "treated_vs_never", "earlier_vs_later", or "later_vs_earlier". @@ -94,6 +101,17 @@ class BaconDecompositionResults: Number of distinct treatment timing groups. n_never_treated : int Number of never-treated units. + n_always_treated_remapped : int + Number of units whose ``first_treat`` was at or before the first + observable period (``first_treat <= min(time)``, excluding the + never-treated sentinels ``0`` and ``np.inf``) and which were + automatically remapped to the ``U`` (untreated) bucket per + Goodman-Bacon (2021) footnote 11. Detection uses ordered-time + logic so negative or zero-crossing period labels work correctly. + Zero on inputs where the user only used the ``first_treat ∈ {0, + np.inf}`` sentinels. The user's original ``first_treat`` column + is preserved unchanged on the input ``data`` frame; remapping + happens in an internal column. timing_groups : List[Any] List of treatment timing cohorts. """ @@ -111,6 +129,9 @@ class BaconDecompositionResults: timing_groups: List[Any] n_obs: int = 0 decomposition_error: float = field(default=0.0) + # Count of units auto-remapped from 0 < first_treat <= min(time) into the + # U bucket per Goodman-Bacon (2021) footnote 11. Always 0 on legacy inputs. + n_always_treated_remapped: int = 0 # Survey design metadata (SurveyMetadata instance from diff_diff.survey) survey_metadata: Optional[Any] = field(default=None) @@ -139,8 +160,12 @@ def summary(self) -> str: f"{'Treatment timing groups:':<35} {self.n_timing_groups:>10}", f"{'Never-treated units:':<35} {self.n_never_treated:>10}", f"{'Total 2x2 comparisons:':<35} {len(self.comparisons):>10}", - "", ] + if self.n_always_treated_remapped > 0: + lines.append( + f"{'Always-treated remapped to U:':<35} " f"{self.n_always_treated_remapped:>10}" + ) + lines.append("") # Add survey design info if self.survey_metadata is not None: @@ -322,15 +347,19 @@ class BaconDecomposition: Parameters ---------- - weights : str, default="approximate" + weights : str, default="exact" Weight calculation method: + - "exact" (default): Variance-based weights from Goodman-Bacon (2021) + Theorem 1, Eqs. 7-9 and 10e-g. Produces the paper-faithful + decomposition where the weighted sum matches the TWFE estimate + to machine precision. Use for publication-quality work and the + standard methodology contract. - "approximate": Fast simplified formula using group shares and - treatment variance. Good for diagnostic purposes where relative - weights are sufficient to identify problematic comparisons. - - "exact": Variance-based weights from Goodman-Bacon (2021) Theorem 1. - Use for publication-quality decompositions where the weighted sum - must closely match the TWFE estimate. + treatment variance, with post-hoc sum-to-1 normalization. Opt + in for speed-sensitive diagnostic loops where the relative weight + structure is sufficient. Approximate-mode results may differ + numerically from R ``bacondecomp::bacon()``. Attributes ---------- @@ -394,17 +423,18 @@ class BaconDecomposition: TwoWayFixedEffects : The TWFE estimator being decomposed """ - def __init__(self, weights: str = "approximate"): + def __init__(self, weights: str = "exact"): """ Initialize BaconDecomposition. Parameters ---------- - weights : str, default="approximate" + weights : str, default="exact" Weight calculation method: - - "approximate": Fast simplified formula (default) - - "exact": Variance-based weights from Goodman-Bacon (2021) + - "exact" (default): Variance-based weights from Goodman-Bacon + (2021) Theorem 1 (paper-faithful Eqs. 7-9 and 10e-g). + - "approximate": Fast simplified formula. Opt in for speed. """ if weights not in ("approximate", "exact"): raise ValueError(f"weights must be 'approximate' or 'exact', got '{weights}'") @@ -435,8 +465,26 @@ def fit( time : str Name of time period column. first_treat : str - Name of column indicating when unit was first treated. - Use 0 (or np.inf) for never-treated units. + Name of column indicating when unit was first treated. The + values ``0`` and ``np.inf`` are **reserved as never-treated + sentinels** (not configurable today); a real treatment cohort + with ``first_treat == 0`` would be folded into ``U`` and + should instead be re-labeled to a non-sentinel value before + fitting. Units whose ``first_treat`` is at or before the + first observable period (``first_treat <= min(time)``, + excluding the never-treated sentinels ``0`` and ``np.inf``) + are automatically remapped to the ``U`` (untreated) bucket + per Goodman-Bacon (2021) footnote 11, with a + ``UserWarning``. Detection uses ordered-time logic on the + **time axis** so panels whose ``time`` column contains + negative or zero-crossing labels (e.g. event-time + ``time ∈ [-2,..,3]``) are handled correctly; the ``0`` + sentinel restriction applies only to ``first_treat``, not to + ``time``. The user's original ``first_treat`` column on + ``data`` is preserved unchanged; remapping happens in an + internal column. The count of remapped units is exposed on + the result as + ``BaconDecompositionResults.n_always_treated_remapped``. survey_design : SurveyDesign, optional Survey design specification for weighted estimation. When provided, all means and group shares use survey weights. @@ -488,6 +536,58 @@ def fit( df[time] = pd.to_numeric(df[time]) df[first_treat] = pd.to_numeric(df[first_treat]) + # Preserve the user-provided column name so we can count TRUE + # never-treated units below (post-remap, `first_treat` will be + # rebound to the internal column which folds remapped always-treated + # into the same `0` sentinel bucket as never-treated). + user_first_treat_col = first_treat + + # Always-treated remap (Goodman-Bacon 2021, footnote 11): + # The paper convention puts units treated before the first observable + # period into the U bucket alongside never-treated units. The library's + # prior sentinel-only convention (`first_treat ∈ {0, np.inf}`) is + # narrower than the paper's U. + # + # Detection uses ORDERED-TIME logic on the `time` axis + # (`first_treat <= min(time)`), NOT positive-sign restriction, so + # panels whose `time` column has negative or zero-crossing labels + # (e.g. event-time `time ∈ [-2,..,3]`) are handled correctly. + # Sentinel rows (`first_treat ∈ {0, np.inf}`) are excluded from + # the remap so the never-treated contract is preserved. NOTE: the + # `0` sentinel restriction applies to `first_treat` only, not to + # `time`; a real treatment cohort with `first_treat == 0` is not + # supported today and would be folded into `U` (re-label such + # cohorts to a non-sentinel value before fitting). + # Remapping writes to an internal column; the user's `first_treat` + # column is preserved unchanged (df = data.copy() above). + df["__bacon_first_treat_internal__"] = df[first_treat] + min_period = df[time].min() + is_U_sentinel = (df["__bacon_first_treat_internal__"] == 0) | ( + df["__bacon_first_treat_internal__"] == np.inf + ) + always_treated_mask = (~is_U_sentinel) & ( + df["__bacon_first_treat_internal__"] <= min_period + ) + n_always_treated_remapped = int(df.loc[always_treated_mask, unit].nunique()) + if n_always_treated_remapped > 0: + warnings.warn( + f"Detected {n_always_treated_remapped} always-treated units " + f"(first_treat <= {min_period}, excluding sentinel values " + f"0 and np.inf). Remapping to U bucket per Goodman-Bacon " + f"(2021) footnote 11. The original first_treat column is " + f"preserved; remapping happens in an internal column. To " + f"silence this warning, recode the affected rows' " + f"first_treat values to 0 or np.inf in your input data " + f"before fitting.", + UserWarning, + stacklevel=2, + ) + df.loc[always_treated_mask, "__bacon_first_treat_internal__"] = 0 + # Rebind the local first_treat name to the internal column so all + # downstream df[first_treat] reads (and helper-function passthroughs) + # see the remapped data without per-site rewrites. + first_treat = "__bacon_first_treat_internal__" + # Check for balanced panel periods_per_unit = df.groupby(unit)[time].count() if periods_per_unit.nunique() > 1: @@ -502,13 +602,41 @@ def fit( time_periods = sorted(df[time].unique()) # Identify never-treated and timing groups - # Never-treated: first_treat = 0 or inf + # Never-treated: first_treat = 0 or np.inf (library sentinels). + # Timing groups: every other value in the (post-remap) internal + # column. Do NOT restrict to positive values — negative-coded + # event-time cohorts (e.g. first_treat=-1 on a panel with + # min(time)=-2) are valid timing groups. never_treated_mask = (df[first_treat] == 0) | (df[first_treat] == np.inf) - timing_groups = sorted([g for g in df[first_treat].unique() if g > 0 and g != np.inf]) + timing_groups = sorted([g for g in df[first_treat].unique() if g != 0 and g != np.inf]) + + # `n_never_treated` reports TRUE never-treated units, computed from + # the original user-provided column BEFORE the remap. Remapped + # always-treated units are reported separately via + # `n_always_treated_remapped` so the two counts do not double-count. + unit_info_user = df.groupby(unit).agg({user_first_treat_col: "first"}).reset_index() + n_never_treated = int( + ( + (unit_info_user[user_first_treat_col] == 0) + | (unit_info_user[user_first_treat_col] == np.inf) + ).sum() + ) - # Get unit-level treatment timing - unit_info = df.groupby(unit).agg({first_treat: "first"}).reset_index() - n_never_treated = ((unit_info[first_treat] == 0) | (unit_info[first_treat] == np.inf)).sum() + # `n_units_in_U_bucket` is the POST-remap count used to decide + # whether `treated_vs_never` (β̂_{kU}^{2x2}) comparisons should + # be generated. It includes BOTH true never-treated AND remapped + # always-treated units, since the paper convention puts them in + # the same `U` bucket (Goodman-Bacon 2021 footnote 11). Without + # this distinction from `n_never_treated`, panels whose U is + # composed entirely of remapped always-treated units would + # silently drop all β̂_{kU}^{2x2} terms and break the Theorem 1 + # identity at the loop gate. + unit_info_internal = df.groupby(unit).agg({first_treat: "first"}).reset_index() + n_units_in_U_bucket = int( + ( + (unit_info_internal[first_treat] == 0) | (unit_info_internal[first_treat] == np.inf) + ).sum() + ) # Create treatment indicator (D_it = 1 if treated at time t) # Use unique internal name to avoid conflicts with user data @@ -523,8 +651,11 @@ def fit( # Perform decomposition comparisons = [] - # 1. Treated vs Never-treated comparisons - if n_never_treated > 0: + # 1. Treated vs Never-treated comparisons. + # Gate on the POST-remap U bucket count so panels whose U is + # composed entirely of remapped always-treated units still emit + # β̂_{kU}^{2x2} terms (paper Goodman-Bacon 2021 footnote 11). + if n_units_in_U_bucket > 0: for g in timing_groups: comp = self._compute_treated_vs_never( df, @@ -638,6 +769,7 @@ def fit( timing_groups=timing_groups, n_obs=len(df), decomposition_error=decomp_error, + n_always_treated_remapped=n_always_treated_remapped, survey_metadata=survey_metadata, ) @@ -690,117 +822,153 @@ def _recompute_exact_weights( weights: Optional[np.ndarray] = None, ) -> None: """ - Recompute weights using exact variance-based formula from Theorem 1. - - This modifies comparison weights in-place to use the exact formula - from Goodman-Bacon (2021) which accounts for within-group variance - of the treatment indicator in each 2x2 comparison window. - - When survey weights are provided, uses weighted unit counts and - within-group variance of the treatment indicator. + Recompute weights using the exact Theorem 1 formula from + Goodman-Bacon (2021). + + Implements Eqs. 7-9 (subsample FE-adjusted treatment-dummy + variances) and Eqs. 10e-g (decomposition weights). Only the + NUMERATORS of Eqs. 10e-g are written here; the post-hoc + sum-to-1 normalization in ``fit()`` handles the ``V̂^D`` + denominator, which is mathematically equivalent per Theorem 1's + identity ``V̂^D = Σ numerators``. + + Notation (per paper §2, pp. 256-258): + n_k sample share of timing group k (fraction of units) + n_kU relative size of group k in pair (k, U): n_k/(n_k+n_U) + D̄_k share of periods group k spends treated: (T-k+1)/T + + Equations: + V̂_{kU}^D = n_{kU}(1-n_{kU}) · D̄_k(1-D̄_k) (Eq. 7) + V̂_{kℓ}^{D,k} = n_{kℓ}(1-n_{kℓ}) · (D̄_k-D̄_ℓ)/(1-D̄_ℓ) + · (1-D̄_k)/(1-D̄_ℓ) (Eq. 8) + V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1-n_{kℓ}) · D̄_ℓ/D̄_k + · (D̄_k-D̄_ℓ)/D̄_k (Eq. 9) + + s_{kU} ∝ (n_k+n_U)^2 · V̂_{kU}^D (Eq. 10e) + s_{kℓ}^k ∝ ((n_k+n_ℓ)(1-D̄_ℓ))^2 · V̂_{kℓ}^{D,k} (Eq. 10f) + s_{kℓ}^ℓ ∝ ((n_k+n_ℓ)·D̄_k)^2 · V̂_{kℓ}^{D,ℓ} (Eq. 10g) + + When survey weights are provided, sample shares use weighted unit + counts (constant-within-unit weights are required and enforced + upstream by ``_validate_unit_constant_survey``). The ``D̄_k`` term + is panel-share-based and unaffected by sampling weights. + + Modifies ``comparisons[i].weight`` in place. The caller then + normalizes to sum to 1. """ - n_total_obs = len(df) - w_arr = weights if weights is not None else np.ones(n_total_obs) - # Store weights as a column for safe label-based subsetting - df = df.copy() - df["_sw"] = w_arr - w_total = np.sum(w_arr) - n_total_units = df[unit].nunique() - - for comp in comparisons: - # Get data for this specific comparison - if comp.comparison_type == "treated_vs_never": - pre_periods = [t for t in time_periods if t < comp.treated_group] - post_periods = [t for t in time_periods if t >= comp.treated_group] - # Get units in each group - units_treated = df[df[first_treat] == comp.treated_group][unit].unique() - units_control = df[(df[first_treat] == 0) | (df[first_treat] == np.inf)][ - unit - ].unique() - elif comp.comparison_type == "earlier_vs_later": - g_early = comp.treated_group - g_late = comp.control_group - pre_periods = [t for t in time_periods if t < g_early] - post_periods = [t for t in time_periods if g_early <= t < g_late] - units_treated = df[df[first_treat] == g_early][unit].unique() - units_control = df[df[first_treat] == g_late][unit].unique() - else: # later_vs_earlier - g_late = comp.treated_group - g_early = comp.control_group - pre_periods = [t for t in time_periods if g_early <= t < g_late] - post_periods = [t for t in time_periods if t >= g_late] - units_treated = df[df[first_treat] == g_late][unit].unique() - units_control = df[df[first_treat] == g_early][unit].unique() - - if not pre_periods or not post_periods: + # Panel length T (Eq. 7-9 use share-of-periods D̄_k = (T-k+1)/T) + T = len(time_periods) + if T <= 0: + for comp in comparisons: comp.weight = 0.0 - continue + return - # Subset to the 2x2 comparison sample - relevant_periods = set(pre_periods) | set(post_periods) - all_units = set(units_treated) | set(units_control) + # Per-unit first observation (treatment timing is unit-invariant) + df_copy = df.copy() + df_copy["_sw"] = weights if weights is not None else np.ones(len(df)) + unit_first = df_copy.groupby(unit).agg({first_treat: "first", "_sw": "first"}) + ft_per_unit = unit_first[first_treat] + sw_per_unit = unit_first["_sw"] - df_22 = df[(df[unit].isin(all_units)) & (df[time].isin(relevant_periods))] + # Total weighted unit mass (denominator for n_k, n_U) + if weights is None: + unit_mass_total = float(len(ft_per_unit)) - if len(df_22) == 0: - comp.weight = 0.0 - continue + def _mass(mask: pd.Series) -> float: + return float(int(mask.sum())) + + else: + unit_mass_total = float(sw_per_unit.sum()) - # Count units in this comparison - n_k = len(units_treated) - n_l = len(units_control) + def _mass(mask: pd.Series) -> float: + return float(sw_per_unit[mask].sum()) - if n_k == 0 or n_l == 0: + if unit_mass_total <= 0: + for comp in comparisons: comp.weight = 0.0 - continue - - # Weighted observation counts for the 2x2 sample - w_22 = df_22["_sw"].values - w_22_sum = np.sum(w_22) - - # Sample share of this comparison (weighted) - sample_share = w_22_sum / w_total - - # Weighted group shares within the 2x2 - treated_mask_22 = df_22[unit].isin(units_treated) - w_k = np.sum(w_22[treated_mask_22.values]) - n_k_share = w_k / w_22_sum if w_22_sum > 0 else 0.0 - - # Create treatment indicator for the 2x2 - T_pre = len(pre_periods) - T_post = len(post_periods) - T_window = T_pre + T_post - - # Variance of D within the 2x2 for treated group - # D = 0 in pre, D = 1 in post for treated units - # D = 0 for all periods for control units in this window - D_k = T_post / T_window # proportion treated for treated group - - # Within-comparison variance of treatment (weighted) - # Var(D) = n_k/(n_k+n_l) * D_k * (1-D_k) for the 2x2 - var_D_22 = n_k_share * D_k * (1 - D_k) - - # Exact weight: proportional to sample share * variance - # Scale by weighted unit share to account for subsample - # Use survey-weighted unit mass when weights present - if weights is not None: - # Sum of per-unit weights for treated + control units in this 2x2 - unit_w_k = ( - df_22.loc[treated_mask_22, "_sw"] - .groupby(df_22.loc[treated_mask_22, unit]) - .first() - .sum() - ) - unit_w_l = ( - df_22.loc[~treated_mask_22, "_sw"] - .groupby(df_22.loc[~treated_mask_22, unit]) - .first() - .sum() + return + + # U bucket sample share (never-treated + always-treated post-remap; + # remap to 0/inf already happened in fit() via __bacon_first_treat_internal__). + is_U = (ft_per_unit == 0) | (ft_per_unit == np.inf) + n_U = _mass(is_U) / unit_mass_total + + # Per-cohort sample shares n_g and panel-share D̄_g. + # Timing groups: exclude only the U sentinels (0, np.inf). Negative + # event-time cohorts (e.g. first_treat=-1 on a panel with min(time)=-2) + # are valid timing groups. + timing_groups = sorted(g for g in ft_per_unit.unique() if g != 0 and g != np.inf) + n_g: Dict[Any, float] = { + g: _mass(ft_per_unit == g) / unit_mass_total for g in timing_groups + } + # D̄_g = share of periods group g spends treated. + # For absorbing treatment with first-treatment time g over panel + # periods [1, T], the treated periods are [g, T], i.e. T - g + 1 + # periods. We use the actual period values from time_periods to + # handle panels indexed from 0 / arbitrary start. + t_arr = np.asarray(sorted(time_periods)) + D_bar: Dict[Any, float] = {g: float(np.sum(t_arr >= g)) / T for g in timing_groups} + + for comp in comparisons: + if comp.comparison_type == "treated_vs_never": + k = comp.treated_group + n_k = n_g.get(k, 0.0) + if n_k <= 0 or n_U <= 0: + comp.weight = 0.0 + continue + n_kU = n_k / (n_k + n_U) + D_k = D_bar.get(k, 0.0) + # Eq. 7 + V_kU = n_kU * (1.0 - n_kU) * D_k * (1.0 - D_k) + # Eq. 10e numerator + comp.weight = (n_k + n_U) ** 2 * V_kU + elif comp.comparison_type == "earlier_vs_later": + # k = early (treated in 2x2), ℓ = late (control during MID) + k = comp.treated_group + ell = comp.control_group + n_k = n_g.get(k, 0.0) + n_ell = n_g.get(ell, 0.0) + if n_k <= 0 or n_ell <= 0: + comp.weight = 0.0 + continue + D_k = D_bar.get(k, 0.0) + D_ell = D_bar.get(ell, 0.0) + # Eq. 8 requires D̄_ℓ < 1 (denominators 1 - D̄_ℓ) + if D_ell >= 1.0: + comp.weight = 0.0 + continue + n_kl = n_k / (n_k + n_ell) + # Eq. 8 + V_kl_k = ( + n_kl + * (1.0 - n_kl) + * (D_k - D_ell) + / (1.0 - D_ell) + * (1.0 - D_k) + / (1.0 - D_ell) ) - unit_share = (unit_w_k + unit_w_l) / w_total - else: - unit_share = (n_k + n_l) / n_total_units - comp.weight = sample_share * var_D_22 * unit_share + # Eq. 10f numerator + comp.weight = ((n_k + n_ell) * (1.0 - D_ell)) ** 2 * V_kl_k + else: # later_vs_earlier + # ℓ = late (treated in 2x2), k = early (already-treated control) + ell = comp.treated_group + k = comp.control_group + n_k = n_g.get(k, 0.0) + n_ell = n_g.get(ell, 0.0) + if n_k <= 0 or n_ell <= 0: + comp.weight = 0.0 + continue + D_k = D_bar.get(k, 0.0) + D_ell = D_bar.get(ell, 0.0) + # Eq. 9 requires D̄_k > 0 (denominators D̄_k) + if D_k <= 0.0: + comp.weight = 0.0 + continue + n_kl = n_k / (n_k + n_ell) + # Eq. 9 + V_kl_l = n_kl * (1.0 - n_kl) * D_ell / D_k * (D_k - D_ell) / D_k + # Eq. 10g numerator + comp.weight = ((n_k + n_ell) * D_k) ** 2 * V_kl_l def _compute_treated_vs_never( self, @@ -1061,7 +1229,7 @@ def bacon_decompose( unit: str, time: str, first_treat: str, - weights: str = "approximate", + weights: str = "exact", survey_design: object = None, ) -> BaconDecompositionResults: """ @@ -1082,15 +1250,42 @@ def bacon_decompose( time : str Name of time period column. first_treat : str - Name of column indicating when unit was first treated. - Use 0 (or np.inf) for never-treated units. - weights : str, default="approximate" + Name of column indicating when unit was first treated. The + values ``0`` and ``np.inf`` are **reserved as never-treated + sentinels**; a real treatment cohort with ``first_treat == 0`` + would be folded into ``U`` and should be re-labeled to a + non-sentinel value before fitting. Units whose ``first_treat`` + is at or before the first observable period + (``first_treat <= min(time)``, excluding the sentinels) are + automatically remapped to the ``U`` (untreated) bucket per + Goodman-Bacon (2021) footnote 11, with a ``UserWarning``. See + ``BaconDecomposition.fit()`` for the full contract and + ``BaconDecompositionResults.n_always_treated_remapped`` for the + count. The user's original ``first_treat`` column is preserved + unchanged. + weights : str, default="exact" Weight calculation method: - - "approximate": Fast simplified formula (default). Good for - diagnostic purposes where relative weights are sufficient. - - "exact": Variance-based weights from Goodman-Bacon (2021) - Theorem 1. Use for publication-quality decompositions. + - "exact" (default): Variance-based weights from Goodman-Bacon + (2021) Theorem 1, Eqs. 7-9 and 10e-g. Paper-faithful. + - "approximate": Fast simplified formula. Opt in for + speed-sensitive diagnostic loops; numerical output may differ + from R ``bacondecomp::bacon()``. + survey_design : SurveyDesign, optional + Survey design specification for weighted estimation. When provided, + cell means, group shares, and within-transform use survey weights. + The decomposition remains diagnostic (no survey vcov needed). + + **Default-flip caveat (PR-B, 2026-05-16):** the new + ``weights="exact"`` default routes through + ``_validate_unit_constant_survey``, which **rejects survey + designs whose weights / strata / PSU / FPC columns vary within + a unit across periods** (the exact path collapses to per-unit + aggregation via ``groupby().first()``). Users whose survey + design has time-varying within-unit columns must either (a) + collapse the columns to be unit-constant or (b) pass explicit + ``weights="approximate"`` to retain the legacy observation-level + weighted-means path. Returns ------- @@ -1106,7 +1301,10 @@ def bacon_decompose( -------- >>> from diff_diff import bacon_decompose >>> - >>> # Quick diagnostic (default) + >>> # Default: paper-faithful Goodman-Bacon (2021) Theorem 1 weights + >>> # (weights="exact"); intended to match R bacondecomp::bacon() at + >>> # atol=1e-6 (R parity goldens pending — see TODO.md "R parity + >>> # goldens generation" for the deferred validation step). >>> results = bacon_decompose( ... data=panel_df, ... outcome='earnings', @@ -1115,14 +1313,15 @@ def bacon_decompose( ... first_treat='treatment_year' ... ) >>> - >>> # Publication-quality exact decomposition - >>> results = bacon_decompose( + >>> # Opt-in: simplified-variance fast path for diagnostic loops + >>> # (numerical output may differ from R; sum-to-1 still holds). + >>> results_approx = bacon_decompose( ... data=panel_df, ... outcome='earnings', ... unit='state', ... time='year', ... first_treat='treatment_year', - ... weights='exact' + ... weights='approximate' ... ) >>> >>> # View summary diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py index f9a21940..e68f1c0f 100644 --- a/diff_diff/diagnostic_report.py +++ b/diff_diff/diagnostic_report.py @@ -1743,8 +1743,62 @@ def _check_bacon(self) -> Dict[str, Any]: unit=unit, time=time, first_treat=first_treat, + weights="exact", # paper-faithful Eqs. 7-9 / 10e-g survey_design=self._survey_design, ) + except ValueError as exc: + # PR #454 R1 P1: the exact-mode survey path enforces + # within-unit constancy of survey columns via + # ``_validate_unit_constant_survey``. Survey-backed reports on + # panels with time-varying within-unit weights / strata / PSU + # / FPC fail that check. The library still supports those + # panels via explicit ``bacon_decompose(weights="approximate", ...)``, + # so emit a structured skip pointing users at the + # ``precomputed={'bacon': ...}`` escape hatch rather than an + # opaque ``error`` block. + if "varies within units" in str(exc): + return { + "status": "skipped", + "reason": ( + "Survey design has within-unit-varying columns " + "(weights / strata / PSU / FPC), which the " + 'paper-faithful ``weights="exact"`` Bacon path ' + "rejects. Run ``bacon_decompose(data, ..., " + 'weights="approximate", survey_design=design)`` ' + "yourself and pass via " + "``DiagnosticReport(..., precomputed={'bacon': result})`` " + f"to populate this section. Validator detail: {exc}" + ), + } + return { + "status": "error", + "reason": f"bacon_decompose raised {type(exc).__name__}: {exc}", + } + except NotImplementedError as exc: + # PR #454 R4 P3: ``BaconDecomposition.fit()`` raises + # ``NotImplementedError`` for replicate-weight survey designs + # (bacon.py rejects them because Bacon is a diagnostic that + # does not compute replicate-based variance). Match the + # within-unit-varying skip pattern above: emit a structured + # skip naming the ``precomputed={'bacon': ...}`` escape hatch + # so survey-backed reports with replicate-weight designs + # produce an actionable skip rather than an opaque + # ``status="error"`` block. + return { + "status": "skipped", + "reason": ( + "Survey design uses replicate weights, which the " + "Bacon decomposition does not support (bacon is a " + "diagnostic and does not compute replicate-based " + "variance). To populate this section, run " + "``bacon_decompose(data, ..., " + "survey_design=SurveyDesign(weights=..., strata=..., " + "psu=..., fpc=...))`` with a TSL-based design and " + "pass via " + "``DiagnosticReport(..., precomputed={'bacon': result})``. " + f"Rejection detail: {exc}" + ), + } except Exception as exc: # noqa: BLE001 return { "status": "error", diff --git a/diff_diff/twfe.py b/diff_diff/twfe.py index fefe8306..9f52d3dc 100644 --- a/diff_diff/twfe.py +++ b/diff_diff/twfe.py @@ -681,7 +681,7 @@ def decompose( unit: str, time: str, first_treat: str, - weights: str = "approximate", + weights: str = "exact", ) -> "BaconDecompositionResults": """ Perform Goodman-Bacon decomposition of TWFE estimate. @@ -702,14 +702,28 @@ def decompose( Name of time period column. first_treat : str Name of column indicating when each unit was first treated. - Use 0 (or np.inf) for never-treated units. - weights : str, default="approximate" + The values ``0`` and ``np.inf`` are **reserved as + never-treated sentinels**; a real treatment cohort with + ``first_treat == 0`` would be folded into ``U`` and should + be re-labeled to a non-sentinel value before fitting. Units + whose ``first_treat`` is at or before the first observable + period (``first_treat <= min(time)``, excluding the + sentinels) are automatically remapped to the ``U`` + (untreated) bucket per Goodman-Bacon (2021) footnote 11 + with a ``UserWarning``. See ``BaconDecomposition.fit()`` for + the full contract and + ``BaconDecompositionResults.n_always_treated_remapped`` for + the count. The user's original ``first_treat`` column is + preserved unchanged. + weights : str, default="exact" Weight calculation method: - - "approximate": Fast simplified formula (default). Good for - diagnostic purposes where relative weights are sufficient. - - "exact": Variance-based weights from Goodman-Bacon (2021) - Theorem 1. Use for publication-quality decompositions. + - "exact" (default): Variance-based weights from Goodman-Bacon + (2021) Theorem 1, Eqs. 7-9 and 10e-g. Paper-faithful and + the standard methodology contract. + - "approximate": Fast simplified formula. Opt in for + speed-sensitive diagnostic loops; numerical output may + differ from R ``bacondecomp::bacon()``. Returns ------- diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index b6dfdd42..d8bc1d0e 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -2598,61 +2598,82 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in ## BaconDecomposition -**Primary source:** [Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. *Journal of Econometrics*, 225(2), 254-277.](https://doi.org/10.1016/j.jeconom.2021.03.014) +**Primary source:** [Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. *Journal of Econometrics*, 225(2), 254-277.](https://doi.org/10.1016/j.jeconom.2021.03.014). Paper review on file: `docs/methodology/papers/goodman-bacon-2021-review.md`. + +**Scope:** Decomposes the two-way fixed-effects DD (TWFEDD) estimator in Equation (2) when treatment timing varies across units. Theorem 1 expresses `β̂^DD` as a **positively-weighted average of all possible 2x2 DD estimators** in the data, with weights summing to 1. The decomposition is a **diagnostic tool**, not a treatment-effect estimator: it explains *which comparisons drive* the TWFEDD coefficient and *why* the estimator can fail to identify an interpretable causal parameter when treatment effects vary over time. **Key implementation requirements:** *Assumption checks / warnings:* - Requires variation in treatment timing (staggered adoption) -- Warns if only one treatment cohort (decomposition not meaningful) -- Uses never-treated units as controls when present; falls back to timing-only comparisons otherwise +- Always-treated units (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`; paper footnote 11) are automatically remapped to the `U` (untreated) bucket with a `UserWarning`; see the `**Note (always-treated remap)**` below for the full ordered-time / sentinel contract +- Unbalanced panels are accepted with a `UserWarning`; the paper's Appendix A proof assumes balanced panels +- Falls back to timing-only comparisons when no never-treated units are present (no untreated group → `s_{kU}` terms drop, weights rescale to sum to 1; **VWCT and ΔATT can still bias the result** — see paper Eqs. 14-15) -*Estimator equation (as implemented):* +*Estimator equation (Theorem 1, Equation 10a):* -TWFE decomposes as: ``` -τ̂^TWFE = Σ_k s_k × τ̂_k +β̂^DD = Σ_{k ≠ U} s_{kU} · β̂_{kU}^{2x2} + + Σ_{k ≠ U} Σ_{ℓ > k} [ s_{kℓ}^k · β̂_{kℓ}^{2x2,k} + s_{kℓ}^ℓ · β̂_{kℓ}^{2x2,ℓ} ] ``` -where k indexes 2×2 comparisons and s_k are Bacon weights. -Three comparison types: -1. **Treated vs. Never-treated** (if never-treated exist): - ``` - τ̂_{T,U} = (Ȳ_{T,post} - Ȳ_{T,pre}) - (Ȳ_{U,post} - Ȳ_{U,pre}) - ``` +The three 2x2 estimators (Eqs. 10b-d): +``` +β̂_{kU}^{2x2} = (ȳ_k^POST(k) - ȳ_k^PRE(k)) - (ȳ_U^POST(k) - ȳ_U^PRE(k)) (Eq. 10b) +β̂_{kℓ}^{2x2,k} = (ȳ_k^MID(k,ℓ) - ȳ_k^PRE(k)) - (ȳ_ℓ^MID(k,ℓ) - ȳ_ℓ^PRE(k)) (Eq. 10c) +β̂_{kℓ}^{2x2,ℓ} = (ȳ_ℓ^POST(ℓ) - ȳ_ℓ^MID(k,ℓ)) - (ȳ_k^POST(ℓ) - ȳ_k^MID(k,ℓ)) (Eq. 10d) +``` -2. **Earlier vs. Later-treated** (Earlier as treated, Later as control pre-treatment): - ``` - τ̂_{k,l} = DiD using early-treated as treatment, late-treated as control - ``` +Comparison-type labels in `BaconDecompositionResults.comparisons`: +- `"treated_vs_never"` ↔ Eq. 10b +- `"earlier_vs_later"` ↔ Eq. 10c (k = early = treated; ℓ = late = control during MID) +- `"later_vs_earlier"` ↔ Eq. 10d (ℓ = late = treated; k = early = already-treated control) -3. **Later vs. Earlier-treated** (problematic: uses post-treatment outcomes as control): - ``` - τ̂_{l,k} = DiD using late-treated as treatment, early-treated (post) as control - ``` +*Weight construction (Eqs. 7-9 for variances, 10e-g for weights):* + +Fixed-effects-adjusted treatment-dummy variances: +``` +V̂_{kU}^D = n_{kU}(1 - n_{kU}) · D̄_k(1 - D̄_k) (Eq. 7) +V̂_{kℓ}^{D,k} = n_{kℓ}(1 - n_{kℓ}) · (D̄_k - D̄_ℓ)/(1 - D̄_ℓ) · (1 - D̄_k)/(1 - D̄_ℓ) (Eq. 8) +V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1 - n_{kℓ}) · D̄_ℓ/D̄_k · (D̄_k - D̄_ℓ)/D̄_k (Eq. 9) +``` + +Decomposition weights: +``` +s_{kU} = ((n_k + n_U)^2 · V̂_{kU}^D) / V̂^D (Eq. 10e) +s_{kℓ}^k = ((n_k + n_ℓ)(1 - D̄_ℓ))^2 · V̂_{kℓ}^{D,k} / V̂^D (Eq. 10f) +s_{kℓ}^ℓ = ((n_k + n_ℓ) · D̄_k)^2 · V̂_{kℓ}^{D,ℓ} / V̂^D (Eq. 10g) +``` -Weights depend on group sizes and variance in treatment timing. +Where `n_k` is the sample share of timing group `k`, `n_{kℓ} = n_k / (n_k + n_ℓ)`, and `D̄_k = (T - k + 1)/T` is the share of periods group `k` spends treated. Weights are **strictly positive** and sum to 1 (Theorem 1). *Standard errors:* -- Not typically computed (decomposition is exact) -- Individual 2×2 estimates can have SEs +- The decomposition is a deterministic algebraic identity; standard errors are not the paper's focus. The point estimates and weights are exact given the data, and on **balanced** panels `β̂^DD` from the decomposition matches the OLS coefficient from TWFEDD to machine precision under `weights="exact"`. The machine-precision claim does **not** extend to unbalanced panels (see edge cases below). +- Inference for the TWFEDD coefficient itself is typically cluster-robust at the unit level. *Edge cases:* - Continuous treatment: not supported, requires binary -- Weights may be negative for later-vs-earlier comparisons -- Single treatment time: no decomposition possible +- Single treatment time: K=1 with U is valid (only `treated_vs_never` terms); K=1 without U has no decomposition (zero variation in `D`). +- `D̄_k → 0` or `D̄_k → 1`: subsample treatment variance goes to zero, that timing group contributes zero weight +- Always-treated units: see `**Note (always-treated remap)**` below **Reference implementation(s):** -- R: `bacondecomp::bacon()` -- Stata: `bacondecomp` +- R: `bacondecomp::bacon()` (CRAN). Parity script at `benchmarks/R/generate_bacon_golden.R`; goldens pending follow-up R install (see TODO.md). +- Stata: `bacondecomp` (SSC). Authors: Goodman-Bacon, Goldring, Nichols (2019). **Requirements checklist:** -- [ ] Three comparison types: treated_vs_never, earlier_vs_later, later_vs_earlier -- [ ] Weights sum to approximately 1 (numerical precision) -- [ ] TWFE coefficient ≈ weighted sum of 2×2 estimates -- [ ] Visualization shows weight vs. estimate by comparison type +- [x] Three comparison types: treated_vs_never, earlier_vs_later, later_vs_earlier +- [x] Weights sum to 1 (machine precision under `weights="exact"` on **balanced** panels; see PR-B audit) +- [x] TWFE coefficient = weighted sum of 2×2 estimates (machine precision under `weights="exact"` on **balanced** panels) +- [x] Visualization shows weight vs. estimate by comparison type +- [x] Always-treated remap to U per Goodman-Bacon (2021) footnote 11 (PR-B audit) +- [x] Hand-calculable Theorem 1 verification: `tests/test_methodology_bacon.py::TestBaconHandCalculation` (7 tests, atol=1e-10) +- [ ] R `bacondecomp::bacon()` parity at atol=1e-6 (R generator script committed; JSON goldens pending follow-up R install — `tests/test_methodology_bacon.py::TestBaconParityR` skips when missing) - [x] Survey design support (Phase 3): weighted cell means, weighted within-transform, weighted group shares -- **Note:** Bacon decomposition with survey weights is diagnostic; exact-sum guarantee is approximate; `weights="exact"` requires within-unit-constant survey columns (approximate path accepts time-varying weights) +- **Note (weight modes):** `weights="exact"` (default, paper-faithful Eqs. 7-9 + 10e-g) vs `weights="approximate"` (simplified variance, opt-in for speed-sensitive diagnostic loops). The PR-A paper review (#451) and PR-B audit established `"exact"` as the default with the **intent** to match R `bacondecomp::bacon()` and the paper's Theorem 1 contract; R parity is validated by hand-calculation (atol=1e-10) and TWFE-vs-weighted-sum identity (atol=1e-10) but the direct R bit-by-bit parity at atol=1e-6 is still pending the R `bacondecomp` install — see Test Coverage checklist above. The approximate path is retained for backward compatibility; numerical output may differ from R. +- **Note (always-treated remap):** Units whose `first_treat` is at or before the first observable period (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`) are automatically remapped to the `U` bucket via an internal column (`__bacon_first_treat_internal__`) with a `UserWarning` — per paper footnote 11. Detection uses ordered-time logic on the **time axis**, so panels whose `time` column has negative or zero-crossing labels (e.g. event-time `time ∈ [-2,..,3]`) are handled correctly: a cohort at `first_treat=-1` on such a panel is a valid timing group; a cohort at `first_treat=-3` is remapped to U. The user's original `first_treat` column on the input `data` frame is preserved unchanged. The count of remapped units is surfaced via `BaconDecompositionResults.n_always_treated_remapped`. **Sentinel restriction:** `first_treat ∈ {0, np.inf}` is reserved as the never-treated marker and is not configurable today; a real treatment cohort with `first_treat == 0` would be folded into `U` and should be re-labeled to a non-sentinel value before fitting. The `0` reservation applies to `first_treat` only, not to `time`. +- **Note (Bacon survey diagnostic):** Bacon decomposition with survey weights is diagnostic; exact-sum guarantee holds at machine precision under `weights="exact"` **on balanced panels**. `weights="exact"` requires within-unit-constant survey columns (approximate path accepts time-varying weights). +- **Deviation (unbalanced-panel library extension):** Unbalanced panels are accepted with a `UserWarning` ("Unbalanced panel detected. Bacon decomposition assumes balanced panels. Results may be inaccurate."). Goodman-Bacon (2021) Appendix A's proof assumes a balanced panel; under unbalance, the Theorem 1 identity holds only approximately. The decomposition still returns finite, well-defined outputs but `weights="exact"` does NOT achieve the machine-precision algebraic identity that the balanced-panel claims above describe. --- diff --git a/tests/test_bacon.py b/tests/test_bacon.py index 349f66d8..cce72d3d 100644 --- a/tests/test_bacon.py +++ b/tests/test_bacon.py @@ -119,7 +119,11 @@ def test_weights_sum_to_one(self): assert abs(total_weight - 1.0) < 0.01, f"Weights sum to {total_weight}, not 1.0" def test_weighted_sum_equals_twfe(self): - """Test that weighted sum of 2x2 estimates equals TWFE.""" + """Test that weighted sum of 2x2 estimates equals TWFE. + + Under the default ``weights="exact"`` mode (Goodman-Bacon 2021 + Theorem 1, Eqs. 10a-g), the identity holds to machine precision. + """ data = generate_staggered_data(seed=456) results = bacon_decompose( @@ -127,14 +131,15 @@ def test_weighted_sum_equals_twfe(self): outcome='outcome', unit='unit', time='time', - first_treat='first_treat' + first_treat='first_treat', + weights='exact', ) weighted_sum = sum(c.weight * c.estimate for c in results.comparisons) - # Allow for small numerical error - assert abs(results.twfe_estimate - weighted_sum) < 0.1, ( - f"TWFE ({results.twfe_estimate:.4f}) != weighted sum ({weighted_sum:.4f})" + assert abs(results.twfe_estimate - weighted_sum) < 1e-10, ( + f"TWFE ({results.twfe_estimate:.10f}) != weighted sum " + f"({weighted_sum:.10f}) under exact mode" ) def test_comparison_types(self): @@ -422,12 +427,18 @@ def test_plot_bacon_invalid_type(self): class TestWeightsParameter: """Tests for configurable weights parameter.""" - def test_approximate_weights_default(self): - """Test that approximate weights are used by default.""" + def test_exact_weights_default(self): + """Test that exact weights are used by default. + + Goodman-Bacon (2021) Theorem 1 is the paper-faithful default + (Eqs. 10a-g). Prior releases defaulted to ``"approximate"``; + the default flipped to ``"exact"`` in PR-B (2026-05-16) so the + diagnostic surface matches R ``bacondecomp::bacon()``. + """ data = generate_staggered_data(seed=789) decomp = BaconDecomposition() - assert decomp.weights == "approximate" + assert decomp.weights == "exact" results = decomp.fit( data, @@ -439,6 +450,33 @@ def test_approximate_weights_default(self): # Weights should sum to 1 total_weight = sum(c.weight for c in results.comparisons) + assert abs(total_weight - 1.0) < 1e-10 + + def test_approximate_weights_opt_in(self): + """Test the legacy approximate-weights path still works. + + ``weights="approximate"`` is retained for backward compatibility + and speed-sensitive diagnostic loops. Numerical output may differ + from R ``bacondecomp::bacon()`` (which implements the exact + Eqs. 7-9). The sum-to-1 contract is preserved via post-hoc + normalization. + """ + data = generate_staggered_data(seed=789) + + decomp = BaconDecomposition(weights="approximate") + assert decomp.weights == "approximate" + + results = decomp.fit( + data, + outcome='outcome', + unit='unit', + time='time', + first_treat='first_treat' + ) + + # Sum-to-1 contract preserved via normalization, but tolerance is + # looser than exact-mode (legacy approximate floor). + total_weight = sum(c.weight for c in results.comparisons) assert abs(total_weight - 1.0) < 0.01 def test_exact_weights(self): diff --git a/tests/test_methodology_bacon.py b/tests/test_methodology_bacon.py new file mode 100644 index 00000000..f4e43745 --- /dev/null +++ b/tests/test_methodology_bacon.py @@ -0,0 +1,1141 @@ +""" +Methodology verification tests for BaconDecomposition. + +Targets the Goodman-Bacon (2021) decomposition theorem: + + β̂^DD = Σ_{k≠U} s_{kU} · β̂_{kU}^{2x2} + + Σ_{k≠U} Σ_{ℓ>k} [s_{kℓ}^k · β̂_{kℓ}^{2x2,k} + s_{kℓ}^ℓ · β̂_{kℓ}^{2x2,ℓ}] (Eq. 10a) + +with weights from Eqs. 7-9 + 10e-g. See: +- ``docs/methodology/papers/goodman-bacon-2021-review.md`` paper review +- ``docs/methodology/REGISTRY.md`` ``## BaconDecomposition`` block +- ``METHODOLOGY_REVIEW.md`` ``BaconDecomposition`` section + +Test class breakdown: +- ``TestBaconHandCalculation`` — hand-calculable balanced panel; sum-to-1, + TWFE-vs-weighted-sum identity, per-Eq variance hand-checks at atol=1e-10. +- ``TestBaconParityR`` — R parity at atol=1e-6 against the committed + ``benchmarks/data/r_bacondecomp_golden.json`` (skip if missing). +- ``TestBaconAlwaysTreatedRemap`` — warn+remap of first_treat <= min(time) + (excluding never-treated sentinels 0 and np.inf) to U; user's first_treat + column preserved unchanged. +- ``TestBaconEdgeCases`` — no-untreated, single-cohort, boundary D̄_k, + unbalanced panel, constant-ATT recovery. +- ``TestBaconWeightModes`` — exact-is-default, approximate-opt-in, + exact-vs-approximate differ meaningfully. +- ``TestBaconSurveyDesignNarrowing`` — survey_design composes cleanly + with the remap and the new exact-mode default. +""" + +from __future__ import annotations + +import json +import warnings +from pathlib import Path +from typing import List, Tuple + +import numpy as np +import pandas as pd +import pytest + +from diff_diff import ( + BaconDecomposition, + bacon_decompose, +) + +# --------------------------------------------------------------------------- +# Hand-calculable DGP +# --------------------------------------------------------------------------- +# +# Balanced panel with two timing groups + never-treated, T = 4 periods, +# 3 units per group. The treatment shares are: +# +# D̄_2 = (T - 2 + 1) / T = 3/4 = 0.75 +# D̄_3 = (T - 3 + 1) / T = 2/4 = 0.50 +# D̄_U = 0 +# +# Sample shares (n_k = units in group / total units): +# +# n_2 = 3/9 = 1/3 +# n_3 = 3/9 = 1/3 +# n_U = 3/9 = 1/3 +# +# Theorem 1 weights (numerators of Eqs. 10e-g, before V̂^D normalization): +# +# s_{2U} ∝ (n_2 + n_U)^2 · V̂_{2U}^D = (2/3)^2 · (1/2)(1/2)(3/4)(1/4) = 0.020833... +# s_{3U} ∝ (n_3 + n_U)^2 · V̂_{3U}^D = (2/3)^2 · (1/2)(1/2)(1/2)(1/2) = 0.027778... +# s_{23}^k ∝ ((n_2+n_3)(1-D̄_3))^2 · V̂^{D,k}_{23} = ((2/3)(1/2))^2 · 0.0625 = 0.006944... +# s_{23}^ℓ ∝ ((n_2+n_3)·D̄_2)^2 · V̂^{D,ℓ}_{23} = ((2/3)(3/4))^2 · 0.05555 = 0.013889... +# +# V̂^D = Σ numerators = 0.069444... → normalized weights: +# +# s_{2U} = 0.3 +# s_{3U} = 0.4 +# s_{23}^k = 0.1 +# s_{23}^ℓ = 0.2 +# +# With constant ATT=5 across cohorts and periods (no noise), every 2x2 DD +# equals 5, so β̂^DD = 5 × (0.3 + 0.4 + 0.1 + 0.2) = 5 exactly. + + +def _hand_calc_panel(true_effect: float = 5.0) -> pd.DataFrame: + """Build the hand-calculable 2-cohort + U panel described above. + + No noise. Outcome y = unit_fe + 0.5*time + true_effect*D, with unit + fixed effects spaced so each unit's intercept is distinct. + """ + rows: List[Tuple] = [] + uid = 1 + # Group 2 (first_treat=2): 3 units, 4 periods each + for _ in range(3): + for t in range(1, 5): + D = 1 if t >= 2 else 0 + y = float(uid) * 10.0 + 0.5 * t + true_effect * D + rows.append((uid, t, y, 2)) + uid += 1 + # Group 3 (first_treat=3): 3 units + for _ in range(3): + for t in range(1, 5): + D = 1 if t >= 3 else 0 + y = float(uid) * 10.0 + 0.5 * t + true_effect * D + rows.append((uid, t, y, 3)) + uid += 1 + # Never-treated U (first_treat=0): 3 units + for _ in range(3): + for t in range(1, 5): + y = float(uid) * 10.0 + 0.5 * t + rows.append((uid, t, y, 0)) + uid += 1 + return pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"]) + + +def _staggered_data(seed: int = 42) -> pd.DataFrame: + """Larger DGP for sum-to-1 / TWFE-equals-sum checks beyond the toy case.""" + rng = np.random.default_rng(seed) + n_units = 60 + n_periods = 8 + cohort_periods = np.array([3, 5, 7]) + # 20% never-treated, evenly split across cohorts otherwise + cohort_assign = rng.choice( + np.concatenate([[0], cohort_periods]), + size=n_units, + p=[0.2, 0.27, 0.27, 0.26], + ) + rows = [] + for u in range(n_units): + ft = cohort_assign[u] + unit_fe = rng.standard_normal() * 2.0 + for t in range(1, n_periods + 1): + D = int(ft > 0 and t >= ft) + y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.5 + rows.append((u + 1, t, y, int(ft))) + return pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"]) + + +# --------------------------------------------------------------------------- +# 1. Hand-calculable Theorem 1 verification +# --------------------------------------------------------------------------- + + +class TestBaconHandCalculation: + """Theorem 1 / Eqs. 7-9 + 10e-g verified on a minimal balanced panel. + + The hand-calculable DGP at module scope was hand-derived to produce + normalized weights ``{0.3, 0.4, 0.1, 0.2}`` and TWFE = ATT = 5 exactly + (constant treatment effect, no noise). Tolerances are tight (1e-10) + because the identity is purely algebraic. + """ + + def test_weights_sum_to_one(self) -> None: + df = _hand_calc_panel() + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + total = sum(c.weight for c in results.comparisons) + assert abs(total - 1.0) < 1e-10 + + def test_twfe_equals_weighted_sum(self) -> None: + """Theorem 1's algebraic identity at machine precision.""" + df = _hand_calc_panel() + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + weighted_sum = sum(c.weight * c.estimate for c in results.comparisons) + assert abs(results.twfe_estimate - weighted_sum) < 1e-10 + # Also check the dataclass-cached value. + assert results.decomposition_error < 1e-10 + + def test_three_comparison_types_present(self) -> None: + df = _hand_calc_panel() + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + types = {c.comparison_type for c in results.comparisons} + assert "treated_vs_never" in types + assert "earlier_vs_later" in types + assert "later_vs_earlier" in types + + def test_eq_10b_treated_vs_never_value(self) -> None: + """β̂_{2U}^{2x2} = 5 on the hand-calc panel (constant ATT, no noise).""" + df = _hand_calc_panel() + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + treated_vs_never = [ + c for c in results.comparisons if c.comparison_type == "treated_vs_never" + ] + # All treated_vs_never comparisons should yield ATT = 5 + for comp in treated_vs_never: + assert ( + abs(comp.estimate - 5.0) < 1e-10 + ), f"{comp.treated_group} vs U: estimate={comp.estimate}, expected 5.0" + + def test_eq_7_treated_untreated_variance(self) -> None: + """V̂_{kU}^D = n_kU(1-n_kU) · D̄_k(1-D̄_k) — verify via weight-share decomp. + + s_{2U} / s_{3U} = (n_2+n_U)^2 · V̂_{2U}^D / ((n_3+n_U)^2 · V̂_{3U}^D) + With equal cohort sizes, this simplifies to: + V̂_{2U}^D / V̂_{3U}^D = (D̄_2(1-D̄_2)) / (D̄_3(1-D̄_3)) + = (0.75 · 0.25) / (0.5 · 0.5) + = 0.1875 / 0.25 = 0.75 + So s_{2U} / s_{3U} = 0.75 (= 0.3 / 0.4). + """ + df = _hand_calc_panel() + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + s2U = next( + c.weight + for c in results.comparisons + if c.comparison_type == "treated_vs_never" and c.treated_group == 2 + ) + s3U = next( + c.weight + for c in results.comparisons + if c.comparison_type == "treated_vs_never" and c.treated_group == 3 + ) + # Expected normalized: s2U=0.3, s3U=0.4 → ratio = 0.75 + assert abs(s2U / s3U - 0.75) < 1e-10 + + def test_eq_8_earlier_vs_later_variance(self) -> None: + """V̂_{kℓ}^{D,k}: weight s_{23}^k normalized expected = 0.1. + + Per the hand calc: s_{23}^k = 0.006944... / 0.069444... = 0.1 + """ + df = _hand_calc_panel() + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + s_kl_k = next( + c.weight + for c in results.comparisons + if c.comparison_type == "earlier_vs_later" + and c.treated_group == 2 + and c.control_group == 3 + ) + assert abs(s_kl_k - 0.1) < 1e-10 + + def test_eq_9_later_vs_earlier_variance(self) -> None: + """V̂_{kℓ}^{D,ℓ}: weight s_{23}^ℓ normalized expected = 0.2. + + Per the hand calc: s_{23}^ℓ = 0.013889... / 0.069444... = 0.2 + """ + df = _hand_calc_panel() + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + s_kl_l = next( + c.weight + for c in results.comparisons + if c.comparison_type == "later_vs_earlier" + and c.treated_group == 3 + and c.control_group == 2 + ) + assert abs(s_kl_l - 0.2) < 1e-10 + + +# --------------------------------------------------------------------------- +# 2. R parity (skip if golden JSON not committed) +# --------------------------------------------------------------------------- + +_R_GOLDEN_PATH = ( + Path(__file__).resolve().parent.parent / "benchmarks" / "data" / "r_bacondecomp_golden.json" +) + + +def _load_r_golden() -> dict: + if not _R_GOLDEN_PATH.exists(): + pytest.skip( + f"R parity goldens missing at {_R_GOLDEN_PATH}. To generate, " + "install R + `install.packages('bacondecomp')` + " + "`install.packages('jsonlite')` then `cd benchmarks/R && " + "Rscript generate_bacon_golden.R`. The R goldens are deferred " + "until R is provisioned (see TODO.md)." + ) + return json.loads(_R_GOLDEN_PATH.read_text()) + + +class TestBaconParityR: + """R `bacondecomp::bacon()` parity at atol=1e-6 (when goldens present).""" + + @pytest.fixture(scope="class") + def golden(self) -> dict: + return _load_r_golden() + + def test_twfe_coef_matches_r(self, golden) -> None: + for fixture_name, fix in golden.items(): + if fixture_name == "meta": + continue + panel = pd.DataFrame(fix["panel"]) + results = bacon_decompose( + panel, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + assert abs(results.twfe_estimate - fix["r_twfe_coef"]) < 1e-6, ( + f"{fixture_name}: TWFE Python={results.twfe_estimate} " f"vs R={fix['r_twfe_coef']}" + ) + + def test_weights_sum_matches_r(self, golden) -> None: + for fixture_name, fix in golden.items(): + if fixture_name == "meta": + continue + panel = pd.DataFrame(fix["panel"]) + results = bacon_decompose( + panel, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + py_sum = sum(c.weight for c in results.comparisons) + assert abs(py_sum - fix["r_weights_sum"]) < 1e-6, ( + f"{fixture_name}: weight sum Python={py_sum} " f"vs R={fix['r_weights_sum']}" + ) + + def test_component_estimates_match_r(self, golden) -> None: + """Match per-component estimates at atol=1e-6 across Python + R. + + Join key is ``(comparison_type, treated_group_float, control_canonical)`` + where ``control_canonical`` is the literal ``"U"`` for + ``treated_vs_never`` comparisons (Python stores the string + ``"never_treated"``; R may use ``Inf`` or its own string) and the + float-coerced cohort timing for timing-vs-timing comparisons. + Including ``comparison_type`` in the key disambiguates the two + directions of a timing pair (earlier_vs_later vs later_vs_earlier). + """ + + def _canonical_control(ctype: str, group): + if ctype == "treated_vs_never": + return "U" + return float(group) + + def _classify_r_type(c: dict, fixture_name: str) -> str: + # R bacondecomp's `type` strings vary across versions + # ("Treated vs Untreated", "Earlier vs Later Treated", ...). + # Fall back to inferring from the control_group: U sentinel + # (0, np.inf, or "never"-containing string) -> treated_vs_never; + # otherwise treated_group < control_group is earlier-vs-later. + t = c.get("type") or "" + if "never" in t.lower() or "untreated" in t.lower(): + return "treated_vs_never" + ctrl = c["control_group"] + if isinstance(ctrl, str) and "never" in ctrl.lower(): + return "treated_vs_never" + if isinstance(ctrl, (int, float)) and (ctrl == 0 or np.isinf(ctrl)): + return "treated_vs_never" + try: + if float(c["treated_group"]) < float(ctrl): + return "earlier_vs_later" + return "later_vs_earlier" + except (TypeError, ValueError): + pytest.fail( + f"{fixture_name}: cannot classify R component " + f"treated={c.get('treated_group')} " + f"control={ctrl} type={t!r}" + ) + + for fixture_name, fix in golden.items(): + if fixture_name == "meta": + continue + panel = pd.DataFrame(fix["panel"]) + results = bacon_decompose( + panel, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + py_estimates = {} + py_weights = {} + for c in results.comparisons: + key = ( + c.comparison_type, + float(c.treated_group), + _canonical_control(c.comparison_type, c.control_group), + ) + py_estimates[key] = c.estimate + py_weights[key] = c.weight + r_estimates: dict = {} + r_weights: dict = {} + for c in fix["r_components"]: + ctype = _classify_r_type(c, fixture_name) + key = ( + ctype, + float(c["treated_group"]), + _canonical_control(ctype, c["control_group"]), + ) + r_estimates[key] = c["estimate"] + r_weights[key] = c["weight"] + # Full-set equality: no Python component missing from R, no R + # component missing from Python. A dropped β̂_{kU} term or an + # extra spurious comparison would fail here. + py_keys = set(py_estimates) + r_keys = set(r_estimates) + missing_in_r = py_keys - r_keys + missing_in_py = r_keys - py_keys + assert not missing_in_r and not missing_in_py, ( + f"{fixture_name}: component sets differ. " + f"In Python but not R: {sorted(missing_in_r)}. " + f"In R but not Python: {sorted(missing_in_py)}." + ) + # Per-component estimate AND weight parity. + for k in py_keys: + assert abs(py_estimates[k] - r_estimates[k]) < 1e-6, ( + f"{fixture_name} {k}: estimate Python={py_estimates[k]} " + f"vs R={r_estimates[k]}" + ) + assert abs(py_weights[k] - r_weights[k]) < 1e-6, ( + f"{fixture_name} {k}: weight Python={py_weights[k]} " f"vs R={r_weights[k]}" + ) + + +# --------------------------------------------------------------------------- +# 3. Always-treated warn+remap +# --------------------------------------------------------------------------- + + +def _panel_with_always_treated() -> pd.DataFrame: + """Build a panel with a small always-treated cohort (first_treat=1).""" + rng = np.random.default_rng(7) + rows = [] + uid = 1 + # 4 always-treated units (first_treat=1; treated in every observable period) + for _ in range(4): + unit_fe = rng.standard_normal() * 2.0 + for t in range(1, 7): + y = unit_fe + 0.05 * t + 2.0 + rng.standard_normal() * 0.3 + rows.append((uid, t, y, 1)) + uid += 1 + # 10 never-treated + for _ in range(10): + unit_fe = rng.standard_normal() * 2.0 + for t in range(1, 7): + y = unit_fe + 0.05 * t + rng.standard_normal() * 0.3 + rows.append((uid, t, y, 0)) + uid += 1 + # 10 cohort-3 + 10 cohort-5 + for cohort in (3, 5): + for _ in range(10): + unit_fe = rng.standard_normal() * 2.0 + for t in range(1, 7): + D = int(t >= cohort) + y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3 + rows.append((uid, t, y, cohort)) + uid += 1 + return pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"]) + + +class TestBaconAlwaysTreatedRemap: + """Goodman-Bacon (2021) footnote 11 with the library's first-period + boundary convention. + + The paper's footnote 11 says units treated before the first observable + period (``t_i < 1`` under the paper's 1-indexed convention) belong in + ``U``. The library generalizes this to units whose + ``first_treat <= min(time)`` (i.e. includes ``first_treat == min(time)``, + which the paper's strict ``<`` shorthand excludes). The library + convention is pragmatic: units treated at the first observable period + have no untreated cell within the panel and cannot contribute to any + valid 2x2 DD as a treated cohort, so folding them into ``U`` mirrors + the always-treated handling. The never-treated sentinels + (``first_treat ∈ {0, np.inf}``) are excluded from the remap. + + bacon.py applies the remap via an internal column + (``__bacon_first_treat_internal__``), preserving the user's original + ``first_treat`` column unchanged. Detection uses ordered-time logic + on the **time axis**, so event-time-encoded panels + (``time ∈ [-2,..,3]``) are handled correctly. + """ + + def test_warn_emitted_on_remap(self) -> None: + df = _panel_with_always_treated() + with pytest.warns(UserWarning, match="Remapping to U bucket"): + bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + + def test_user_first_treat_column_unchanged(self) -> None: + """Regression test: user's data should not be mutated by the remap.""" + df = _panel_with_always_treated() + original = df["first_treat"].copy() + with warnings.catch_warnings(): + warnings.simplefilter("ignore", category=UserWarning) + bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + pd.testing.assert_series_equal(df["first_treat"], original) + + def test_n_always_treated_remapped_reported(self) -> None: + df = _panel_with_always_treated() + with warnings.catch_warnings(): + warnings.simplefilter("ignore", category=UserWarning) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + # 4 units were always-treated (first_treat=1) + assert results.n_always_treated_remapped == 4 + + def test_treated_vs_never_emitted_when_U_is_only_remapped_always_treated( + self, + ) -> None: + """Regression test for the R1 P0 finding. + + When the user supplies NO ``first_treat ∈ {0, np.inf}`` units (so + ``n_never_treated == 0``) but supplies always-treated units that + the remap reclassifies into ``U``, the ``treated_vs_never`` + comparison loop must still fire — gated on the POST-remap U + bucket, not the pre-remap never-treated count. Otherwise + Theorem 1's algebraic identity breaks because all + ``β̂_{kU}^{2x2}`` terms are silently dropped. + """ + rng = np.random.default_rng(13) + rows = [] + uid = 1 + # 8 always-treated units (first_treat=1) — these become the entire U bucket + for _ in range(8): + unit_fe = rng.standard_normal() * 2.0 + for t in range(1, 7): + y = unit_fe + 0.05 * t + 2.0 + rng.standard_normal() * 0.3 + rows.append((uid, t, y, 1)) + uid += 1 + # 8 cohort-3 + 8 cohort-5 (no true never-treated) + for cohort in (3, 5): + for _ in range(8): + unit_fe = rng.standard_normal() * 2.0 + for t in range(1, 7): + D = int(t >= cohort) + y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3 + rows.append((uid, t, y, cohort)) + uid += 1 + df = pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"]) + with warnings.catch_warnings(): + warnings.simplefilter("ignore", category=UserWarning) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + # n_never_treated counts TRUE never-treated only (user column = 0 or inf). + # All 8 U-bucket units came from the always-treated remap, so this is 0. + assert results.n_never_treated == 0 + assert results.n_always_treated_remapped == 8 + # CRITICAL: treated_vs_never comparisons MUST be present because the + # remapped U bucket is the paper's `U` per footnote 11. + types = {c.comparison_type for c in results.comparisons} + assert "treated_vs_never" in types, ( + "Theorem 1 violation: treated_vs_never comparisons were " + "silently dropped when U is composed of remapped always-treated " + "units only (pre-fix R1 P0 regression)." + ) + # Decomposition identity must hold at machine precision. + weighted_sum = sum(c.weight * c.estimate for c in results.comparisons) + assert abs(results.twfe_estimate - weighted_sum) < 1e-10 + assert results.decomposition_error < 1e-10 + + def test_negative_first_treat_as_valid_timing_group(self) -> None: + """Regression for nonpositive-time encodings (event-time panels). + + On a panel with ``time = [-2, -1, 0, 1, 2, 3]``, a cohort with + ``first_treat = -1`` is a valid timing group with one observable + pre-period (t=-2). Detection must use ordered-time logic, not + positive-sign restriction. Prior to the R3 P0 fix this cohort + was silently dropped (both the remap mask and the timing_groups + filter required ``first_treat > 0``), violating Theorem 1. + """ + rng = np.random.default_rng(73) + rows = [] + uid = 1 + # 10 never-treated U + for _ in range(10): + unit_fe = rng.standard_normal() * 2.0 + for t in range(-2, 4): + y = unit_fe + 0.05 * t + rng.standard_normal() * 0.3 + rows.append((uid, t, y, 0)) + uid += 1 + # 8 cohort treated at t=-1 (valid timing group, one pre-period) + for _ in range(8): + unit_fe = rng.standard_normal() * 2.0 + for t in range(-2, 4): + D = int(t >= -1) + y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3 + rows.append((uid, t, y, -1)) + uid += 1 + # 8 cohort treated at t=2 + for _ in range(8): + unit_fe = rng.standard_normal() * 2.0 + for t in range(-2, 4): + D = int(t >= 2) + y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3 + rows.append((uid, t, y, 2)) + uid += 1 + df = pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"]) + with warnings.catch_warnings(): + warnings.simplefilter("ignore", category=UserWarning) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + # -1 cohort should appear as a timing group, NOT remapped to U + assert -1 in results.timing_groups + assert 2 in results.timing_groups + assert results.n_always_treated_remapped == 0 + # Both cohorts should produce treated_vs_never comparisons + types_per_treated = {(c.treated_group, c.comparison_type) for c in results.comparisons} + assert (-1, "treated_vs_never") in types_per_treated + assert (2, "treated_vs_never") in types_per_treated + # Theorem 1 identity holds at machine precision. + weighted_sum = sum(c.weight * c.estimate for c in results.comparisons) + assert abs(results.twfe_estimate - weighted_sum) < 1e-10 + assert results.decomposition_error < 1e-10 + + def test_negative_first_treat_below_min_time_remapped(self) -> None: + """``first_treat=-3`` on a panel with ``min(time)=-2`` is always-treated. + + The unit's first treatment is before the first observable period, + so per paper footnote 11 it goes in U. Detection must use + ordered-time logic (``first_treat <= min(time)``), not positive- + sign restriction. + """ + rng = np.random.default_rng(74) + rows = [] + uid = 1 + # 6 always-treated (first_treat=-3, pre-panel) + for _ in range(6): + unit_fe = rng.standard_normal() * 2.0 + for t in range(-2, 4): + y = unit_fe + 0.05 * t + 2.0 + rng.standard_normal() * 0.3 + rows.append((uid, t, y, -3)) + uid += 1 + # 8 cohort treated at t=2 + for _ in range(8): + unit_fe = rng.standard_normal() * 2.0 + for t in range(-2, 4): + D = int(t >= 2) + y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3 + rows.append((uid, t, y, 2)) + uid += 1 + df = pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"]) + with warnings.catch_warnings(): + warnings.simplefilter("ignore", category=UserWarning) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + assert results.n_always_treated_remapped == 6 + assert results.n_never_treated == 0 + # 2 should be the only timing group (always-treated are remapped to U). + assert 2 in results.timing_groups + assert -3 not in results.timing_groups + # treated_vs_never must fire because U is non-empty post-remap. + types = {c.comparison_type for c in results.comparisons} + assert "treated_vs_never" in types + # Theorem 1 identity holds at machine precision. + weighted_sum = sum(c.weight * c.estimate for c in results.comparisons) + assert abs(results.twfe_estimate - weighted_sum) < 1e-10 + assert results.decomposition_error < 1e-10 + + def test_no_warning_on_sentinel_only_inputs(self) -> None: + """When first_treat ∈ {0, np.inf} only, no remap warning fires.""" + df = _staggered_data(seed=11) + # Replace any first_treat=1 (shouldn't be any in this DGP, but defensive) + df = df[df["first_treat"] != 1] + with warnings.catch_warnings(): + warnings.simplefilter("error", category=UserWarning) + try: + bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + except UserWarning as w: + if "Remapping to U bucket" in str(w): + pytest.fail(f"Unexpected remap warning on sentinel-only inputs: {w}") + + +# --------------------------------------------------------------------------- +# 4. Edge cases +# --------------------------------------------------------------------------- + + +class TestBaconEdgeCases: + """Edge cases enumerated in REGISTRY.md ## BaconDecomposition.""" + + def test_no_untreated_group(self) -> None: + """Only timing groups, no U. Weights still sum to 1 after normalization.""" + df = _staggered_data(seed=22) + # Drop never-treated + df = df[df["first_treat"] > 0].copy() + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + total = sum(c.weight for c in results.comparisons) + assert abs(total - 1.0) < 1e-10 + # No treated_vs_never terms — only timing-only + type_set = {c.comparison_type for c in results.comparisons} + assert "treated_vs_never" not in type_set + + def test_single_timing_group_with_never_treated(self) -> None: + """K=1 (one timing group) + U: only treated_vs_never comparisons.""" + rng = np.random.default_rng(33) + rows = [] + uid = 1 + for _ in range(10): + for t in range(1, 6): + D = int(t >= 3) + y = float(uid) + 0.1 * t + 2.0 * D + rng.standard_normal() * 0.2 + rows.append((uid, t, y, 3)) + uid += 1 + for _ in range(10): + for t in range(1, 6): + y = float(uid) + 0.1 * t + rng.standard_normal() * 0.2 + rows.append((uid, t, y, 0)) + uid += 1 + df = pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"]) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + type_set = {c.comparison_type for c in results.comparisons} + assert type_set == {"treated_vs_never"} + total = sum(c.weight for c in results.comparisons) + assert abs(total - 1.0) < 1e-10 + + def test_unbalanced_panel_warns(self) -> None: + df = _staggered_data(seed=44) + # Drop a few rows to unbalance + df = df.drop(df.sample(n=10, random_state=44).index).copy() + with pytest.warns(UserWarning, match="Unbalanced panel"): + bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + + def test_unbalanced_panel_finite_but_not_machine_precision(self) -> None: + """Unbalanced panels are a library extension (Goodman-Bacon Appendix A + proof assumes balanced panels). Decomposition still produces finite, + well-defined output, but the Theorem 1 identity holds only + approximately — do NOT assert ``decomposition_error < 1e-10`` + on unbalanced data; the REGISTRY Deviation block documents the + deviation explicitly. + """ + df = _staggered_data(seed=44) + df = df.drop(df.sample(n=10, random_state=44).index).copy() + with warnings.catch_warnings(): + warnings.simplefilter("ignore", category=UserWarning) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + # All outputs are finite + assert np.isfinite(results.twfe_estimate) + assert all(np.isfinite(c.weight) for c in results.comparisons) + assert all(np.isfinite(c.estimate) for c in results.comparisons) + # Weights still sum to ~1 after post-hoc normalization + total = sum(c.weight for c in results.comparisons) + assert abs(total - 1.0) < 1e-10 + # But the decomposition identity is NOT machine-precision on + # unbalanced data — REGISTRY's machine-precision claim is scoped + # to balanced panels (see Deviation note in REGISTRY entry). + # Empirically the error is small (well under 0.01) but not 1e-10. + + def test_constant_att_recovers_effect(self) -> None: + """ΔATT=0 + VWCT=0 → β̂^DD ≈ true ATT (sample noise only).""" + df = _staggered_data(seed=55) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + # True ATT = 2.0 in the DGP. With 60 units and clean DGP, the TWFE + # estimate should be within a few hundredths of truth. + assert abs(results.twfe_estimate - 2.0) < 0.2 + + def test_weighted_sum_machine_precision(self) -> None: + """The TWFE-vs-weighted-sum identity holds on noisy data too.""" + df = _staggered_data(seed=66) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + weighted_sum = sum(c.weight * c.estimate for c in results.comparisons) + assert abs(results.twfe_estimate - weighted_sum) < 1e-10 + + +# --------------------------------------------------------------------------- +# 5. Weight modes +# --------------------------------------------------------------------------- + + +class TestBaconWeightModes: + """``weights="exact"`` is the paper-faithful default; ``"approximate"`` opt-in.""" + + def test_exact_is_default(self) -> None: + est = BaconDecomposition() + assert est.weights == "exact" + + def test_approximate_still_supported(self) -> None: + est = BaconDecomposition(weights="approximate") + assert est.weights == "approximate" + df = _staggered_data(seed=77) + results = est.fit( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + ) + # Approximate mode produces valid output with weights summing to 1 + total = sum(c.weight for c in results.comparisons) + assert abs(total - 1.0) < 0.01 + + def test_exact_vs_approximate_differ_meaningfully(self) -> None: + """The two modes produce different relative weights (the approximate + path uses a simplified variance not matching Eqs. 7-9).""" + df = _staggered_data(seed=88) + r_exact = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + ) + r_approx = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="approximate", + ) + # Both have the same set of (treated, control) comparisons + assert len(r_exact.comparisons) == len(r_approx.comparisons) + # But at least one weight differs by more than 1e-6 (i.e., the modes + # are not numerically equivalent on a non-trivial DGP) + exact_map = { + (c.treated_group, c.control_group, c.comparison_type): c.weight + for c in r_exact.comparisons + } + approx_map = { + (c.treated_group, c.control_group, c.comparison_type): c.weight + for c in r_approx.comparisons + } + diffs = [abs(exact_map[k] - approx_map[k]) for k in exact_map if k in approx_map] + assert max(diffs) > 1e-6 + + +# --------------------------------------------------------------------------- +# 6. Survey-design narrowing +# --------------------------------------------------------------------------- + + +class TestBaconSurveyDesignNarrowing: + """survey_design= composes cleanly with the warn+remap and the new exact-mode default.""" + + def test_survey_design_compatible_with_exact_mode(self) -> None: + """``weights="exact"`` (new default) accepts ``survey_design=`` without + path-specific assertion failures. + """ + from diff_diff import SurveyDesign + + df = _staggered_data(seed=99) + # Constant-within-unit weights (required by exact-mode survey validator) + unit_w = df.groupby("unit").ngroup() * 0.1 + 1.0 + df = df.assign(w=unit_w.values) + sd = SurveyDesign(weights="w") + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + survey_design=sd, + ) + # Sum-to-1 contract still holds under survey weighting + total = sum(c.weight for c in results.comparisons) + assert abs(total - 1.0) < 1e-10 + + def test_survey_design_propagates_through_remap(self) -> None: + """When always-treated remap fires and survey_design is set, no crash + and the survey metadata is preserved in the result.""" + from diff_diff import SurveyDesign + + df = _panel_with_always_treated() + unit_w = df.groupby("unit").ngroup() * 0.1 + 1.0 + df = df.assign(w=unit_w.values) + sd = SurveyDesign(weights="w") + with warnings.catch_warnings(): + warnings.simplefilter("ignore", category=UserWarning) + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="exact", + survey_design=sd, + ) + assert results.n_always_treated_remapped == 4 + assert results.survey_metadata is not None + + def _time_varying_survey_panel(self) -> "tuple[pd.DataFrame, object]": + """Build a panel with time-varying within-unit survey weights. + + Used to exercise the post-PR-B default flip: ``weights="exact"`` + (now the default) routes through ``_validate_unit_constant_survey`` + and rejects time-varying within-unit weights; ``weights="approximate"`` + accepts them via observation-level weighted means. + """ + from diff_diff import SurveyDesign + + np.random.seed(42) + n_u, n_t = 20, 4 + df = pd.DataFrame( + { + "unit": np.repeat(range(n_u), n_t), + "time": np.tile(range(1, n_t + 1), n_u), + "first_treat": np.repeat(np.where(np.arange(n_u) < 10, 3, 0), n_t), + "y": np.random.randn(n_u * n_t), + # Time-varying weights (different per period within unit) + "w": np.random.uniform(0.5, 2.0, n_u * n_t), + } + ) + return df, SurveyDesign(weights="w") + + def test_default_bacon_decomposition_class_rejects_time_varying_weights( + self, + ) -> None: + """The new ``weights="exact"`` default routes ``BaconDecomposition()`` + through the unit-constant-survey validator. Time-varying within-unit + weights are rejected. Locks the public-default contract surfaced in + the PR-B Changed CHANGELOG entry. + """ + df, sd = self._time_varying_survey_panel() + with pytest.raises(ValueError, match="varies within units"): + BaconDecomposition().fit(df, "y", "unit", "time", "first_treat", survey_design=sd) + + def test_default_bacon_decompose_function_rejects_time_varying_weights( + self, + ) -> None: + """The new ``weights="exact"`` default flows through + ``bacon_decompose(...)`` (no explicit weights= kwarg). Same + rejection contract.""" + df, sd = self._time_varying_survey_panel() + with pytest.raises(ValueError, match="varies within units"): + bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + survey_design=sd, + ) + + def test_explicit_approximate_accepts_time_varying_weights(self) -> None: + """Users can opt back into the obs-level weighted-means path via + explicit ``weights="approximate"``. This is the documented + migration path in the PR-B Changed CHANGELOG entry. + """ + df, sd = self._time_varying_survey_panel() + # Should not raise; produces a valid decomposition via the legacy + # approximate path that tolerates obs-level weighted means. + results = bacon_decompose( + df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + weights="approximate", + survey_design=sd, + ) + total = sum(c.weight for c in results.comparisons) + assert abs(total - 1.0) < 0.01 + assert np.isfinite(results.twfe_estimate) + + def test_diagnostic_report_skips_with_structured_reason_on_replicate_weights( + self, + ) -> None: + """PR #454 R4 P3 regression: ``DiagnosticReport._check_bacon`` + emits ``status="skipped"`` (not ``"error"``) when the survey + design uses replicate weights, which Bacon rejects with + ``NotImplementedError`` upstream. The skip reason names the + ``precomputed={'bacon': ...}`` escape hatch and points users at + a TSL-based survey design as the supported alternative. + """ + from diff_diff import DiagnosticReport, SurveyDesign + + df, _ = self._time_varying_survey_panel() + df["rep_w1"] = 1.0 + df["rep_w2"] = 1.0 + sd_rep = SurveyDesign( + weights="w", + replicate_weights=["rep_w1", "rep_w2"], + replicate_method="BRR", + ) + + class _Stub: + pass + + dr = DiagnosticReport( + _Stub(), + data=df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + survey_design=sd_rep, + ) + block = dr._check_bacon() + assert block["status"] == "skipped" + reason = block["reason"] + assert "replicate weights" in reason + assert "precomputed" in reason + + def test_diagnostic_report_skips_with_structured_reason_on_time_varying_survey( + self, + ) -> None: + """PR #454 R1 P1 regression: ``DiagnosticReport._check_bacon`` now + emits ``status="skipped"`` (not ``"error"``) when the panel has + within-unit-varying survey columns. The skip reason names the + ``precomputed={'bacon': ...}`` + explicit ``weights="approximate"`` + escape hatch so users have a documented migration path. + """ + from diff_diff import DiagnosticReport + + df, sd = self._time_varying_survey_panel() + + class _Stub: + """Minimal results stub that does not carry survey_metadata, + so the survey-metadata-without-survey_design early-skip at + ``diagnostic_report.py:1723`` does not pre-empt this path.""" + + dr = DiagnosticReport( + _Stub(), + data=df, + outcome="y", + unit="unit", + time="time", + first_treat="first_treat", + survey_design=sd, + ) + block = dr._check_bacon() + assert block["status"] == "skipped" + reason = block["reason"] + assert "varies within units" not in reason or "approximate" in reason + assert "precomputed" in reason + assert 'weights="approximate"' in reason or "approximate" in reason