igerber · igerber · May 16, 2026 · May 16, 2026 · May 16, 2026 · May 16, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
@@ -78,7 +78,7 @@ The catalog grew incrementally over several quarters, so formats vary across the
 
 | Tool | Module | R Reference | Status | Last Review |
 |------|--------|-------------|--------|-------------|
-| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **In Progress** | — |
+| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **Complete** (R parity pending) | 2026-05-16 |
 | HonestDiD | `honest_did.py` | `HonestDiD` package | **Complete** | 2026-04-01 |
 | PreTrendsPower | `pretrends.py` | `pretrends` package | **In Progress** | — |
 | PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **In Progress** | — |
@@ -909,18 +909,41 @@ and covariate-adjusted specifications.)
 | Module | `bacon.py` |
 | Primary Reference | Goodman-Bacon (2021), *Difference-in-differences with variation in treatment timing*, J. Econometrics 225(2), 254-277 |
 | R Reference | `bacondecomp::bacon()` |
-| Status | **In Progress** |
-| Last Review | — |
+| Status | **Complete** (R parity goldens pending) |
+| Last Review | 2026-05-16 |
 
-**Documentation in place:**
-- REGISTRY.md section: `## BaconDecomposition` (three comparison types — treated_vs_never, earlier_vs_later, later_vs_earlier; weight construction; TWFE reconstitution; weighted survey path under Phase 3)
-- Implementation: `tests/test_bacon.py` (basic decomposition, weight properties, integration with `TwoWayFixedEffects.decompose()`)
+**Verified Components:**
+- [x] Theorem 1 decomposition identity: `β̂^DD = Σ s · β̂^{2x2}` at `atol=1e-10` (hand-calculable + noisy DGPs)
+- [x] Weight sum-to-1: `Σ s = 1.0` at `atol=1e-10` under `weights="exact"`
+- [x] Three comparison types correctly classified: `treated_vs_never`, `earlier_vs_later`, `later_vs_earlier`
+- [x] Eq. 7 hand-checked: `V̂_{kU}^D = n_{kU}(1-n_{kU}) · D̄_k(1-D̄_k)` (via weight-ratio test, `atol=1e-10`)
+- [x] Eq. 8 hand-checked: `V̂_{kℓ}^{D,k} = n_{kℓ}(1-n_{kℓ}) · (D̄_k-D̄_ℓ)/(1-D̄_ℓ) · (1-D̄_k)/(1-D̄_ℓ)`
+- [x] Eq. 9 hand-checked: `V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1-n_{kℓ}) · D̄_ℓ/D̄_k · (D̄_k-D̄_ℓ)/D̄_k`
+- [x] Eq. 10b 2x2 estimator value: hand-calculable panel → β̂_{kU}^{2x2} = ATT exactly
+- [x] Always-treated remap to U (paper footnote 11): `first_treat <= min(time)` (excluding never-treated sentinels `0` and `np.inf`) units auto-remapped via internal column, user's data preserved, count exposed on result
+- [x] `weights="exact"` is the default (PR-B 2026-05-16); `weights="approximate"` retained as opt-in
+- [x] Unbalanced panel: accepted with `UserWarning` (paper assumes balanced; library extension)
+- [x] No untreated group: `s_{kU}` terms drop, weights renormalize, sum-to-1 still holds
+- [x] Single timing group with U: only `treated_vs_never` comparisons
+- [x] Survey design composes cleanly with exact mode and warn+remap
+- [ ] R `bacondecomp::bacon()` parity at `atol=1e-6` — R generator script committed; JSON goldens pending follow-up R install (see TODO.md)
 
-**Outstanding for promotion:**
-- **Substantive review pass — first target chosen during the 2026-05-15 methodology-review refresh session.** Read Goodman-Bacon (2021), audit `bacon.py` against the paper's decomposition (Equation 11, weight construction in Section 3, three comparison types in Section 4), generate R parity fixtures via `bacondecomp::bacon()`, write `tests/test_methodology_bacon.py` with paper-equation-numbered assertions, populate Verified Components / Corrections Made / Deviations here.
-- Paper review under `docs/methodology/papers/goodman-bacon-2021-review.md`
-- R parity fixture against `bacondecomp::bacon()` covering treated_vs_never, earlier_vs_later, later_vs_earlier weight buckets and their relative shares
-- Verify the REGISTRY Implementation Checklist (all rows currently unchecked except the survey-design Phase 3 row)
+**Test Coverage:**
+- 24 methodology tests in `tests/test_methodology_bacon.py` across 6 classes (21 active + 3 R-parity tests that skip on missing goldens)
+- 32 existing tests in `tests/test_bacon.py` (basic decomposition, weight properties, weights-parameter API, TWFE integration, visualization, balanced-panel warnings, edge cases)
+
+**R Comparison Results:**
+- **Pending**: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time. R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`. JSON goldens land at `benchmarks/data/r_bacondecomp_golden.json` once generated. `tests/test_methodology_bacon.py::TestBaconParityR` skips with a pointer until then. Tracked in TODO.md follow-up row.
+
+**Corrections Made:**
+1. **Theorem 1 exact-weights rewrite** (`bacon.py:_recompute_exact_weights`, lines ~740-880). The previous "exact" mode implementation did not actually compute Eqs. 7-9 / 10e-g — it was missing the `(1 - n_kU)` factor in the within-subsample treatment variance, did not square the sample share, and added an extraneous `unit_share` factor not present in the paper. The post-hoc sum-to-1 normalization masked the relative-weight error but produced a decomposition error of ~0.3% (0.007 absolute) against TWFE on a 3-cohort + never-treated DGP. **Rewrote** the function to compute the exact numerators of Eqs. 10e/f/g (with proper Eqs. 7-9 variances) and let the post-hoc normalization handle the `V̂^D` denominator (Theorem 1 identity guarantees `V̂^D = Σ numerators`). Now matches TWFE at `atol=1e-10`. The existing `test_weighted_sum_equals_twfe` tolerance was tightened from `< 0.1` to `< 1e-10` to lock the contract.
+2. **Default `weights` flipped from `"approximate"` to `"exact"`** at three entry points: `BaconDecomposition.__init__()` (`bacon.py:397`), `bacon_decompose()` convenience function (`bacon.py:1064`), `TwoWayFixedEffects.decompose()` (`twfe.py:684`). The paper-faithful Theorem 1 weights are now the default; the simplified approximate path remains opt-in via explicit `weights="approximate"`. `diff_diff/diagnostic_report.py:1740` (production diagnostic surface) was updated to pass explicit `weights="exact"`.
+3. **Always-treated warn+remap via internal column** (`bacon.py:fit()`, lines ~487-525). Paper footnote 11 puts units with `t_i < 1` in `U`, but `bacon.py` previously only mapped `first_treat ∈ {0, np.inf}` into U. Added detection using ordered-time logic on the **time axis** (`first_treat <= min(time)` while excluding the never-treated sentinels `0` and `np.inf`) with `UserWarning` and automatic remap via an internal column (`__bacon_first_treat_internal__`), preserving the user's `first_treat` column unchanged. Detection handles event-time-encoded panels (`time ∈ [-2,..,3]`) correctly; the `0` sentinel restriction applies only to `first_treat`. Count exposed via new `BaconDecompositionResults.n_always_treated_remapped` field.
+
+**Deviations from R's `bacondecomp::bacon()`:**
+1. **Unbalanced panel acceptance** (library extension): R errors on unbalanced panels; Python emits a `UserWarning` and decomposes. The paper's Appendix A proof assumes balanced panels — decomposition on unbalanced panels is approximate to Theorem 1.
+2. **Approximate weight mode** (Python-only optimization): `weights="approximate"` is a library-only fast path with simplified variance computation, not present in R. Users who want Python-R numerical parity should pass `weights="exact"` (the new default).
+3. **NaN for invalid inference fields not applicable**: the decomposition is deterministic; there are no SE/p-value fields on the comparison output. The `decomposition_error` field is a finite float (zero in well-conditioned cases).
 
 ---
 

diff --git a/TODO.md b/TODO.md
@@ -74,7 +74,7 @@ Deferred items from PR reviews that were not addressed before merge.
 
 | Issue | Location | PR | Priority |
 |-------|----------|----|----------|
-| BaconDecomposition: substantive methodology audit pass against Goodman-Bacon (2021). Paper review on file at `docs/methodology/papers/goodman-bacon-2021-review.md` includes a proposed Methodology Registry Entry, but the `REGISTRY.md` `## BaconDecomposition` section (lines ~2598-2654) still carries the older contract. Audit pass needs to (a) verify `bacon.py` matches Theorem 1 / Eqs. 10a-g exactly, (b) decide how to handle genuinely always-treated units (`0 < first_treat <= min(time)`) — paper convention puts them in `U`, but `bacon.py` currently treats only `first_treat ∈ {0, np.inf}` as the `U` bucket, (c) generate R parity fixtures via `bacondecomp::bacon()`, (d) write `tests/test_methodology_bacon.py`, (e) replace the REGISTRY entry with the proposed text, (f) populate Verified Components / Corrections Made / Deviations in `METHODOLOGY_REVIEW.md` and flip status to Complete. | `diff_diff/bacon.py`, `docs/methodology/REGISTRY.md`, `tests/test_methodology_bacon.py`, `METHODOLOGY_REVIEW.md` | follow-up | Medium |
+| BaconDecomposition R parity goldens: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time (2026-05-16). R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`, writing `benchmarks/data/r_bacondecomp_golden.json`. `tests/test_methodology_bacon.py::TestBaconParityR` (3 tests) skips with a pointer until the JSON lands. The PR-B audit substantiates Theorem 1 (Eqs. 7-9 + 10e-g) via hand-calculable + machine-precision identity tests; R parity is desirable as a cross-language anchor but not the only substantiation. Mirrors StaggeredTripleDifference precedent (PR #245). | `benchmarks/R/generate_bacon_golden.R`, `benchmarks/data/r_bacondecomp_golden.json` (TBD), `tests/test_methodology_bacon.py::TestBaconParityR` | follow-up | Medium |
 | dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low |
 | dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | #408 | Medium |
 | dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low |

diff --git a/benchmarks/R/generate_bacon_golden.R b/benchmarks/R/generate_bacon_golden.R
@@ -0,0 +1,234 @@
+#!/usr/bin/env Rscript
+# Generate R bacondecomp parity goldens for diff-diff BaconDecomposition.
+#
+# Requires: install.packages("bacondecomp")  (CRAN; main function is bacon())
+#           install.packages("jsonlite")
+# Output:   ../data/r_bacondecomp_golden.json
+#
+# The diff-diff BaconDecomposition implementation (`diff_diff/bacon.py`) with
+# the default ``weights="exact"`` is expected to match the values in this JSON
+# to atol=1e-6 on the per-component (treated, control, type) tuples, and to
+# match the TWFE coefficient to the same tolerance. The ``weights="approximate"``
+# path is a library-only optimization and is NOT covered by this parity harness.
+#
+# Three fixtures:
+#   1. uniform_3groups_with_never_treated — 3 timing groups + never-treated U;
+#      exercises all three comparison types (treated/never, earlier/later,
+#      later/earlier).
+#   2. two_groups_no_never_treated — 2 timing groups only; tests the
+#      timing-only decomposition where the s_{kU} terms drop.
+#   3. always_treated_remapped — 3 timing groups + 1 always-treated cohort
+#      (first_treat = 1). Validates that Python's warn+remap of t_i < 1 into
+#      U matches R bacondecomp's native behavior.
+#
+# Run:
+#   cd benchmarks/R && Rscript generate_bacon_golden.R
+
+suppressPackageStartupMessages({
+  library(bacondecomp)
+  library(jsonlite)
+})
+
+stopifnot(packageVersion("bacondecomp") >= "0.1.0")
+
+# ---------------------------------------------------------------------------
+# DGP helpers
+# ---------------------------------------------------------------------------
+
+# Build a balanced panel with absorbing treatment.
+#   n_units       : units per timing group (excluding never-treated)
+#   n_periods     : panel length (1..T)
+#   cohort_times  : vector of first-treatment times, one per cohort
+#   always_treated_count : optional cohort treated at first_treat = 1
+#                          (i.e., always-treated for the observable window)
+#   never_treated_count  : units with first_treat = 0
+#   true_effect          : constant ATT
+#   seed                 : reproducibility
+build_panel <- function(n_units_per_cohort, n_periods, cohort_times,
+                        always_treated_count = 0L, never_treated_count = 0L,
+                        true_effect = 2.0, seed = 42L) {
+  set.seed(seed)
+  units <- list()
+  uid <- 1L
+
+  # Always-treated cohort (first_treat = 1; treated in every observable period)
+  if (always_treated_count > 0L) {
+    for (i in seq_len(always_treated_count)) {
+      units[[length(units) + 1L]] <- data.frame(
+        unit = uid, time = seq_len(n_periods), first_treat = 1L
+      )
+      uid <- uid + 1L
+    }
+  }
+
+  # Never-treated U
+  if (never_treated_count > 0L) {
+    for (i in seq_len(never_treated_count)) {
+      units[[length(units) + 1L]] <- data.frame(
+        unit = uid, time = seq_len(n_periods), first_treat = 0L
+      )
+      uid <- uid + 1L
+    }
+  }
+
+  # Treated cohorts
+  for (g in cohort_times) {
+    for (i in seq_len(n_units_per_cohort)) {
+      units[[length(units) + 1L]] <- data.frame(
+        unit = uid, time = seq_len(n_periods), first_treat = as.integer(g)
+      )
+      uid <- uid + 1L
+    }
+  }
+
+  df <- do.call(rbind, units)
+
+  # Treatment indicator: D_{it} = 1 iff first_treat in {1,..,T} AND time >= first_treat.
+  df$D <- as.integer(df$first_treat > 0L & df$time >= df$first_treat)
+
+  # Outcome: unit FE + linear time + constant treatment effect + noise.
+  unit_fe <- rnorm(uid - 1L, sd = 2.0)
+  df$y <- unit_fe[df$unit] +
+          0.1 * df$time +
+          true_effect * df$D +
+          rnorm(nrow(df), sd = 0.5)
+
+  df
+}
+
+# ---------------------------------------------------------------------------
+# Extract bacondecomp::bacon() output into a fixture-shaped list.
+# ---------------------------------------------------------------------------
+
+extract_bacon <- function(df, fixture_name) {
+  # bacondecomp::bacon takes the OUTCOME ~ TREATMENT formula plus id_var/time_var.
+  # It returns a data.frame with columns: treated, untreated, estimate, weight,
+  # plus a `type` column (e.g. "Both Treated", "Treated vs Untreated"), and an
+  # attribute beta_hat_w (the weighted sum, which equals the TWFE coefficient).
+  res <- bacondecomp::bacon(
+    formula  = y ~ D,
+    data     = df,
+    id_var   = "unit",
+    time_var = "time"
+  )
+
+  # When the data contains a never-treated group, bacon() returns a list with
+  # $two_by_twos (the per-component table) and $Omega (the variance-weighted
+  # contributions). Without never-treated, it returns the data.frame directly.
+  if (is.list(res) && !is.data.frame(res)) {
+    components_df <- res$two_by_twos
+    twfe_coef <- as.numeric(attr(res, "beta_hat_w"))
+    # Fallback: re-derive TWFE from the components if attr is missing.
+    if (is.null(twfe_coef) || length(twfe_coef) == 0L) {
+      twfe_coef <- sum(components_df$estimate * components_df$weight)
+    }
+  } else {
+    components_df <- res
+    twfe_coef <- sum(components_df$estimate * components_df$weight)
+  }
+
+  # Components vary across bacondecomp versions; normalize the column names.
+  cols <- names(components_df)
+  treated_col <- if ("treated" %in% cols) "treated" else "g1"
+  untreated_col <- if ("untreated" %in% cols) "untreated" else "g2"
+  estimate_col <- if ("estimate" %in% cols) "estimate" else "Estimate"
+  weight_col <- if ("weight" %in% cols) "weight" else "Weight"
+  type_col <- if ("type" %in% cols) "type" else NA_character_
+
+  components <- lapply(seq_len(nrow(components_df)), function(i) {
+    list(
+      treated_group   = as.numeric(components_df[[treated_col]][i]),
+      control_group   = as.numeric(components_df[[untreated_col]][i]),
+      estimate        = as.numeric(components_df[[estimate_col]][i]),
+      weight          = as.numeric(components_df[[weight_col]][i]),
+      type            = if (!is.na(type_col))
+                          as.character(components_df[[type_col]][i])
+                        else NA_character_
+    )
+  })
+
+  weights_sum <- sum(sapply(components, function(c) c$weight))
+
+  list(
+    panel = list(
+      unit        = as.integer(df$unit),
+      time        = as.integer(df$time),
+      y           = as.numeric(df$y),
+      first_treat = as.integer(df$first_treat),
+      treated     = as.integer(df$D)
+    ),
+    r_twfe_coef    = twfe_coef,
+    r_components   = components,
+    r_weights_sum  = weights_sum,
+    n_components   = length(components)
+  )
+}
+
+# ---------------------------------------------------------------------------
+# Fixtures
+# ---------------------------------------------------------------------------
+
+cat("Building fixture 1: uniform_3groups_with_never_treated...\n")
+df1 <- build_panel(
+  n_units_per_cohort   = 30L,
+  n_periods            = 6L,
+  cohort_times         = c(3L, 4L, 5L),
+  always_treated_count = 0L,
+  never_treated_count  = 30L,
+  true_effect          = 2.0,
+  seed                 = 101L
+)
+fixture_1 <- extract_bacon(df1, "uniform_3groups_with_never_treated")
+
+cat("Building fixture 2: two_groups_no_never_treated...\n")
+df2 <- build_panel(
+  n_units_per_cohort   = 30L,
+  n_periods            = 6L,
+  cohort_times         = c(3L, 5L),
+  always_treated_count = 0L,
+  never_treated_count  = 0L,
+  true_effect          = 2.0,
+  seed                 = 202L
+)
+fixture_2 <- extract_bacon(df2, "two_groups_no_never_treated")
+
+cat("Building fixture 3: always_treated_remapped...\n")
+# 3 timing-cohorts + 5 always-treated units (first_treat = 1, i.e., treated
+# in every observable period) + 30 never-treated. R's bacondecomp natively
+# groups the first_treat=1 cohort with U (since they are treated throughout
+# every observable period and never serve as a within-window control), which
+# matches what diff-diff's warn+remap does in Python.
+df3 <- build_panel(
+  n_units_per_cohort   = 25L,
+  n_periods            = 6L,
+  cohort_times         = c(3L, 4L, 5L),
+  always_treated_count = 5L,
+  never_treated_count  = 25L,
+  true_effect          = 2.0,
+  seed                 = 303L
+)
+fixture_3 <- extract_bacon(df3, "always_treated_remapped")
+
+# ---------------------------------------------------------------------------
+# Write JSON
+# ---------------------------------------------------------------------------
+
+out <- list(
+  meta = list(
+    generated_at         = format(Sys.Date()),
+    bacondecomp_version  = as.character(packageVersion("bacondecomp")),
+    r_version            = R.version.string,
+    description          = paste(
+      "Goodman-Bacon (2021) decomposition parity goldens for diff-diff",
+      "BaconDecomposition. Parity target: atol=1e-6 on per-component",
+      "(treated, control, type) tuples plus the TWFE coefficient."
+    )
+  ),
+  uniform_3groups_with_never_treated = fixture_1,
+  two_groups_no_never_treated        = fixture_2,
+  always_treated_remapped            = fixture_3
+)
+
+out_path <- "../data/r_bacondecomp_golden.json"
+write_json(out, out_path, pretty = TRUE, digits = NA, auto_unbox = TRUE)
+cat(sprintf("Wrote %s\n", out_path))