diff --git a/CHANGELOG.md b/CHANGELOG.md
index 2e845ac3..964e35b9 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,10 +8,13 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- **BaconDecomposition: Goodman-Bacon (2021) methodology audit (PR-B).** Closes the BaconDecomposition row in `METHODOLOGY_REVIEW.md` (status flipped from **In Progress** → **Complete (R parity goldens pending)**). Builds on the PR #451 paper review at `docs/methodology/papers/goodman-bacon-2021-review.md`. **Audit outcomes:** (1) Rewrote `_recompute_exact_weights` in `bacon.py` to actually implement Theorem 1 (Eqs. 7-9 + 10e-g) — the prior "exact" implementation was missing the `(1-n_kU)` factor in the subsample variance, did not square the sample share, and added an extraneous `unit_share` factor not present in the paper; the post-hoc sum-to-1 normalization masked the relative-weight error but produced ~0.3% decomposition error vs TWFE on a 3-cohort + never-treated DGP. The rewrite computes the exact numerators of Eqs. 10e/f/g and lets the post-hoc normalization handle the `V̂^D` denominator (Theorem 1's identity guarantees `V̂^D = Σ numerators`). The TWFE-vs-weighted-sum identity now holds at `atol=1e-10` on both noisy and hand-calculable DGPs. (2) Added always-treated warn+remap per paper footnote 11: units whose `first_treat` is at or before the first observable period (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`) are automatically remapped to the `U` (untreated) bucket via an internal column (`__bacon_first_treat_internal__`) with a `UserWarning`. Detection uses ordered-time logic on the **time axis**, so panels whose `time` column has negative or zero-crossing labels (event-time encodings) are handled correctly; the `0` sentinel restriction applies only to `first_treat`, not to `time`, and a real treatment cohort with `first_treat == 0` would still be folded into U today (re-label such cohorts to a non-sentinel value before fitting). The user's original `first_treat` column is preserved unchanged. The count is surfaced as a new `BaconDecompositionResults.n_always_treated_remapped` dataclass field, rendered in `summary()` output when nonzero. **`n_never_treated` reports TRUE never-treated only**, computed from the original user column before remap — remapped always-treated units appear separately as `n_always_treated_remapped`, no double-counting. (3) New methodology test file `tests/test_methodology_bacon.py` (~24 tests across 6 classes: `TestBaconHandCalculation` hand-checks Eqs. 7-9 + 10b-d on a minimal balanced panel at `atol=1e-10`; `TestBaconParityR` skips with a pointer when goldens missing; `TestBaconAlwaysTreatedRemap` regression-tests warn+remap mechanics including user-data-preservation; `TestBaconEdgeCases` exercises no-untreated, single-cohort, unbalanced panel, constant-ATT recovery; `TestBaconWeightModes` locks the new exact-is-default contract; `TestBaconSurveyDesignNarrowing` confirms survey_design composes with exact mode and warn+remap). (4) R `bacondecomp::bacon()` parity generator committed at `benchmarks/R/generate_bacon_golden.R` covering three DGP fixtures (3-groups-with-U, 2-groups-no-U, always-treated-remapped); JSON goldens deferred until `bacondecomp` R package is installed (parity tests skip cleanly with an explicit pointer). (5) `docs/methodology/REGISTRY.md` `## BaconDecomposition` block replaced with the paper-review-sourced entry plus three new sub-notes: weight modes (exact vs approximate), always-treated remap, R parity status. **Explicit removal:** the prior REGISTRY block's "Weights may be negative for later-vs-earlier comparisons" claim was incorrect per Theorem 1 (decomposition weights are strictly positive and sum to 1; negative weights are an estimand-level phenomenon, not estimator-level) and is dropped from the new entry. Closes the BaconDecomposition follow-up tracked at `TODO.md` (the prior row added in PR #451 is replaced by a narrower R-parity-goldens deferral row).
 - **`SpilloverDiD` — ring-indicator spillover-aware DiD (Butts 2021).** New standalone estimator at `diff_diff/spillover.py` implementing two-stage Gardner methodology with ring-indicator covariates that identify direct effect on treated (`tau_total`) alongside per-ring spillover effects on near-control units (`delta_j`). Documented synthesis of ingredients (no single published software covers the exact recipe — `did2s` implements Gardner two-stage without rings; the Butts ring estimator has no R/Stata package): Butts (2021) Section 5 / Table 2 identification, Gardner (2022) two-stage residualize-then-fit, and the Conley spatial-HAC vcov shipped in 3.3.3. Handles both panel non-staggered (Equations 5/6/8) and Section 5 staggered timing in one estimator — non-staggered is the special case where all treated units share an onset time. **API:** `SpilloverDiD(rings=[0, 50, 100, 200], conley_coords=("lat","lon"), ...).fit(data, outcome="y", unit="unit", time="t", treatment="D")` (binary D auto-converted to `first_treat`) or `.fit(..., first_treat="first_treat")` (Gardner convention). Result: `SpilloverDiDResults(DiDResults)` with `.att` = `tau_total`, `.spillover_effects` (per-ring `pd.DataFrame` with `coef`/`se`/`t_stat`/`p_value`/`ci_low`/`ci_high`), `.ring_breakpoints`, `.d_bar`, `.n_units_ever_in_ring`, `.n_far_away_obs`, `.is_staggered`. `.coefficients` exposes all `(1+K)` stage-2 entries (`"treatment"` + `"_spillover_<ring_label>"`) plus an `"ATT"` alias keyed to vcov columns. **Methodology spec (committed):** stage-2 regressor is the time-varying `(1 - D_it) * Ring_{it,j}` form (paper page 12's `S_it = S_i * 1{t >= t_treat}` notation; Section 5 Table 2's `S^k_{it}` / `Ring^k_{it,j}`). Reading the literal unit-static `(1 - D_it) * S_i` from Equation 5 is algebraically rank-deficient under TWFE (`(1-D_it) * S_i = S_i - D_it`, with `S_i` absorbed by `mu_i`, leaving `-D_it`); only the time-varying form supports the paper's identification (Proposition 2.3). Stage-1 subsample uses Butts' STRICTER `Omega_0 = {D_it = 0 AND S_it = 0}` (untreated AND unexposed), not TwoStageDiD's `{D_it = 0}` alone — this prevents spillover-contaminated near-controls in pre/post periods from biasing the time FE. **Gardner identity (non-staggered):** a 20-seed deterministic regression test pins `SpilloverDiD.att` against a direct single-stage TWFE ring regression on the full sample (`y ~ mu_i + lambda_t + tau * D_it + sum_j delta_j * (1 - D_it) * Ring_{it,j}`) at `atol=1e-10` — empirically bit-identical, so the reported non-staggered `tau_total` IS the Butts Eqs. 4-6 estimator. **Identification-check policy (period strict, unit warn-and-drop, plus connectivity):** every period must have at least one Omega_0 row (hard `ValueError` — dropping a period removes all units' cross-time identification). Units lacking Omega_0 rows (e.g. baseline-treated units with `D_it = 1` at every observed `t`) are warned-and-dropped: their unit FE is NaN, residualization writes NaN on their rows, and the downstream finite-mask path excludes them from stage 2 — mirrors `TwoStageDiD`'s always-treated convention. Additionally, the supported-units bipartite graph (units linked by shared Omega_0 periods) must form a single connected component; `K > 1` components raise `ValueError` because the FE solver would return only component-specific constants and residualization would silently mix them across components (defense-in-depth — under absorbing treatment the disconnected case may be unreachable through the upstream validators, but the check future-proofs Wave B follow-ups). **Public API restrictions (Wave B MVP):** `covariates=` raises `NotImplementedError` because Gardner-style two-stage requires covariate effects estimated on the untreated-and-unexposed subsample at stage 1 (appending raw covariates only at stage 2 silently biases `tau_total` / `delta_j` on panels with time-varying covariates); non-absorbing / reversible treatment patterns (e.g. `[0, 1, 0]`) raise `ValueError` rather than being silently coerced into "treated from first 1 onward"; non-constant `first_treat` values across rows of the same unit raise `ValueError`; `conley_coords` is required on every fit path (not just `vcov_type="conley"`) because ring construction always uses it. **Far-away control identification:** uses CURRENT-period untreated status (`D_it = 0`) rather than never-treated-only, so all-eventually-treated staggered designs (no never-treated units) can identify the counterfactual via not-yet-treated far-away rows. **Variance (Wave B MVP):** stage-2 OLS variance via `solve_ols` (HC1 / Conley / cluster paths all flow through). The Gardner GMM first-stage uncertainty correction is NOT applied at stage 2 in this PR (documented limitation; planned follow-up extends `two_stage.py::_compute_gmm_variance` to accept a Conley kernel matrix in place of HC1's identity at the influence-function outer-product step). **Deferred features (planned follow-ups):** `event_study=True` per-event-time × ring coefficients (Butts Table 2), `survey_design=` integration, `ring_method="count"` (count-of-treated-in-ring), data-driven `d_bar` selection (Butts 2021b / Butts 2023 JUE Insight), Gardner GMM first-stage correction at stage 2, sparse staggered ring-distance path. **Tests:** `tests/test_spillover.py` (157 tests across ring-construction primitives, validators, fit integration, raw-data invariant, identification MC — non-staggered DGP at 50 seeds + 200-seed `@pytest.mark.slow` variant recovers both `tau_total` and `delta_1`; staggered DGP at 30 seeds anchors both `tau_total` and `delta_1` — Conley plumbing (verifies `solve_ols` is called with `vcov_type="conley"` + Conley kwargs, no silent HC1 fallback), Gardner identity bit-identity, coefficients-vs-vcov alignment, warn-and-drop, rank_deficient_action validation, Omega_0 bipartite-graph connectivity, anticipation behavior on both fit paths). DGP factories `tests/_dgp_utils.py::generate_butts_nonstaggered_dgp` / `generate_butts_staggered_dgp` satisfy Butts Assumptions 1/3/5/7 by construction.
 - **`ChaisemartinDHaultfoeuille.predict_het` × `placebo`: R-parity on both global and per-path surfaces.** R-verified — `did_multiplegt_dyn(predict_het, placebo)` emits heterogeneity OLS results on backward (placebo) horizons via R's `DIDmultiplegtDYN:::did_multiplegt_main` placebo block (`effect = matrix(-i, ...)` rbind site); the same block runs per-by_level under `did_multiplegt_dyn(by_path, predict_het, placebo)`, so both global `res$results$predict_het` and per-by_level `res$by_level_i$results$predict_het` slots emit backward rows. R's predict_het syntax with `placebo > 0` requires the `c(-1)` sentinel in the horizon vector to trigger "compute heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo) positions" — passing positive-only horizons errors with "specified numbers in predict_het that exceed the number of placebos". Python mirrors via `_compute_heterogeneity_test(..., placebo=L_max)` (set automatically from `self.placebo` at both global and per-path call sites in `fit()`) — the function iterates forward (1..L_max) and backward (-1..-L_max) horizons in a single loop with an explicit `out_idx < 0` eligibility guard for backward horizons whose `F_g` is too small (would otherwise silently misread `N_mat` via numpy negative indexing). `results.heterogeneity_effects` uses negative-int keys for backward horizons; `path_heterogeneity_effects` does the same per path. Placebo rows in `to_dataframe(level="by_path")` have non-NaN `het_*` columns when `placebo=True` and `heterogeneity=` are both set. **Survey gate (warn + skip):** `survey_design + placebo + heterogeneity` emits a `UserWarning` at fit-time and falls back to forward-horizon-only heterogeneity on both surfaces — the Binder TSL cell-period allocator's REGISTRY justification is tied to **post-period** attribution; backward-horizon attribution puts ψ_g mass on a pre-period cell, a separate library-extension claim that needs its own derivation. Forward-horizon `predict_het + survey_design` continues to work unchanged on both global and per-path surfaces. The function-level `_compute_heterogeneity_test` keeps a per-iteration `NotImplementedError` backstop for direct callers that bypass fit(). Pre-period allocator derivation deferred to a follow-up methodology PR (tracked in TODO.md). R parity confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneityWithPlacebo` (scenario 23, `multi_path_reversible_predict_het_with_placebo_global`, `placebo=2, effects=3, no by_path`) and `::TestDCDHDynRParityByPathHeterogeneityWithPlacebo` (scenario 22, same DGP plus `by_path=3`); pinned at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5` for `beta` / `se` / `t_stat` / `n_obs` and `INFERENCE_RTOL=1e-4` for `p_value` / `conf_int` across 3 paths × (3 forward + 2 placebo) = 15 horizons + 1 global × 5 horizons. Cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo` (placebo het column population, survey-gate warn+skip behavior, forward+survey anti-regression, `out_idx<0` eligibility guard, single-path telescope `path_heterogeneity_effects[(only_path,)] == heterogeneity_effects` bit-exactly, summary rendering, direct-call `NotImplementedError` backstop). Closes TODO #422.
 
 ### Changed
+- **BaconDecomposition: default `weights` flipped from `"approximate"` to `"exact"` (PR-B methodology audit).** The new default uses Goodman-Bacon (2021) Theorem 1's exact Eqs. 7-9 + 10e-g weights, **intended to match** R `bacondecomp::bacon()` (direct R parity at `atol=1e-6` pending the R `bacondecomp` install; validated via hand-calculation + TWFE-vs-weighted-sum identity at `atol=1e-10` — see TODO.md). The `weights="approximate"` path remains available as an opt-in fast diagnostic for speed-sensitive loops; its numerical output may differ from R. Three entry points were flipped: `BaconDecomposition(weights="exact")` (`bacon.py:397`), `bacon_decompose(weights="exact")` (`bacon.py:1064`), `TwoWayFixedEffects.decompose(weights="exact")` (`twfe.py:684`). **Behavior change for users not passing explicit `weights=`**: the decomposition weights are now paper-faithful by default. Users who depended on the previous `"approximate"` numerics for diagnostic plots or comparison-type weight shares can preserve the old behavior by passing `weights="approximate"` explicitly. **Survey-design behavior change**: `weights="exact"` (now the default) routes through `_validate_unit_constant_survey`, which rejects survey designs whose weights / strata / PSU / FPC columns vary within a unit across periods (the exact-mode path collapses to per-unit aggregation via `groupby().first()`). The previous `weights="approximate"` default tolerated time-varying within-unit survey weights via observation-level weighted means. Users whose survey-weighted Bacon calls used time-varying within-unit weights must now either (a) collapse their weights to be unit-constant or (b) pass explicit `weights="approximate"` to retain the legacy obs-level path. The production diagnostic surface (`diff_diff/diagnostic_report.py:1740`) was updated to pass explicit `weights="exact"`. Existing test assertions in `tests/test_bacon.py` continue to pass with the new default; the `test_weighted_sum_equals_twfe` tolerance was tightened from `< 0.1` to `< 1e-10` to lock the Theorem 1 algebraic-identity contract.
+
 - **`ChaisemartinDHaultfoeuille.predict_het` inference: t-distribution df threading (closes TODO pilot-412).** `_compute_heterogeneity_test` now passes `df = n_obs - rank(design)` to `safe_inference` on the non-survey OLS path, matching R `did_multiplegt_dyn(predict_het=...)`'s t-distribution inference (`DIDmultiplegtDYN:::did_multiplegt_main` `t_stat <- qt(0.975, df.residual(model))` site). Pre-PR Python used `df=None` (normal Z critical), producing 0.1-2% rtol gaps on `p_value` and `conf_int` vs R. Parity tolerance tightened on the existing forward-horizon scenarios (`multi_path_reversible_predict_het`, `multi_path_reversible_by_path_predict_het`) from "unpinned" to `INFERENCE_RTOL=1e-4` on `p_value` and `conf_int`; `beta` / `se` / `t_stat` continue at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5`. **Post-drop rank (post-2026-05-16 wrap-up):** the df denominator uses the post-drop numerical rank via `_detect_rank_deficiency`, which `solve_ols` already calls internally. For full-rank designs `rank == n_params` and behavior is bit-identical to the pre-PR `n_obs - n_params` path; for near-rank-deficient designs that `solve_ols` retains rather than NaN-out (e.g., cohort-collinearity at high horizons), the post-drop rank is strictly lower and the post-PR `df` is larger, matching R's `lm()` convention. The Z-vs-t REGISTRY deviation note is replaced with an "R parity (post-2026-05-15 df threading)" positive-claim note.
 
 - **`ChaisemartinDHaultfoeuille.by_path` negative-baseline path regression coverage.** New `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary::test_negative_baseline_path_supported` exercises switchers with `D_{g,1} = -1` and asserts that `path_effects` correctly contains negative-baseline tuple keys (e.g., `(-1, 0, 0, 0)`, `(-1, 1, 1, 1)`). This closes the test-coverage gap from PR #419: the existing `test_negative_integer_D_supported` only covered paths with negative values in non-baseline positions (e.g., `(0, -1, -1, -1)`), which does not trigger R's documented `substr(path, 1, 1)` baseline-extraction bug. Python's tuple-key matching is correct under any baseline value; this test pins the contract. No R-parity fixture is added because R is the buggy side on this regime — the deviation is documented in the REGISTRY non-binary treatment Note.
diff --git a/METHODOLOGY_REVIEW.md b/METHODOLOGY_REVIEW.md
index 7607f706..e5dc409a 100644
--- a/METHODOLOGY_REVIEW.md
+++ b/METHODOLOGY_REVIEW.md
@@ -78,7 +78,7 @@ The catalog grew incrementally over several quarters, so formats vary across the
 
 | Tool | Module | R Reference | Status | Last Review |
 |------|--------|-------------|--------|-------------|
-| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **In Progress** | — |
+| BaconDecomposition | `bacon.py` | `bacondecomp::bacon()` | **Complete** (R parity pending) | 2026-05-16 |
 | HonestDiD | `honest_did.py` | `HonestDiD` package | **Complete** | 2026-04-01 |
 | PreTrendsPower | `pretrends.py` | `pretrends` package | **In Progress** | — |
 | PowerAnalysis | `power.py` | `pwr` / `DeclareDesign` | **In Progress** | — |
@@ -909,18 +909,41 @@ and covariate-adjusted specifications.)
 | Module | `bacon.py` |
 | Primary Reference | Goodman-Bacon (2021), *Difference-in-differences with variation in treatment timing*, J. Econometrics 225(2), 254-277 |
 | R Reference | `bacondecomp::bacon()` |
-| Status | **In Progress** |
-| Last Review | — |
+| Status | **Complete** (R parity goldens pending) |
+| Last Review | 2026-05-16 |
 
-**Documentation in place:**
-- REGISTRY.md section: `## BaconDecomposition` (three comparison types — treated_vs_never, earlier_vs_later, later_vs_earlier; weight construction; TWFE reconstitution; weighted survey path under Phase 3)
-- Implementation: `tests/test_bacon.py` (basic decomposition, weight properties, integration with `TwoWayFixedEffects.decompose()`)
+**Verified Components:**
+- [x] Theorem 1 decomposition identity: `β̂^DD = Σ s · β̂^{2x2}` at `atol=1e-10` (hand-calculable + noisy DGPs)
+- [x] Weight sum-to-1: `Σ s = 1.0` at `atol=1e-10` under `weights="exact"`
+- [x] Three comparison types correctly classified: `treated_vs_never`, `earlier_vs_later`, `later_vs_earlier`
+- [x] Eq. 7 hand-checked: `V̂_{kU}^D = n_{kU}(1-n_{kU}) · D̄_k(1-D̄_k)` (via weight-ratio test, `atol=1e-10`)
+- [x] Eq. 8 hand-checked: `V̂_{kℓ}^{D,k} = n_{kℓ}(1-n_{kℓ}) · (D̄_k-D̄_ℓ)/(1-D̄_ℓ) · (1-D̄_k)/(1-D̄_ℓ)`
+- [x] Eq. 9 hand-checked: `V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1-n_{kℓ}) · D̄_ℓ/D̄_k · (D̄_k-D̄_ℓ)/D̄_k`
+- [x] Eq. 10b 2x2 estimator value: hand-calculable panel → β̂_{kU}^{2x2} = ATT exactly
+- [x] Always-treated remap to U (paper footnote 11): `first_treat <= min(time)` (excluding never-treated sentinels `0` and `np.inf`) units auto-remapped via internal column, user's data preserved, count exposed on result
+- [x] `weights="exact"` is the default (PR-B 2026-05-16); `weights="approximate"` retained as opt-in
+- [x] Unbalanced panel: accepted with `UserWarning` (paper assumes balanced; library extension)
+- [x] No untreated group: `s_{kU}` terms drop, weights renormalize, sum-to-1 still holds
+- [x] Single timing group with U: only `treated_vs_never` comparisons
+- [x] Survey design composes cleanly with exact mode and warn+remap
+- [ ] R `bacondecomp::bacon()` parity at `atol=1e-6` — R generator script committed; JSON goldens pending follow-up R install (see TODO.md)
 
-**Outstanding for promotion:**
-- **Substantive review pass — first target chosen during the 2026-05-15 methodology-review refresh session.** Read Goodman-Bacon (2021), audit `bacon.py` against the paper's decomposition (Equation 11, weight construction in Section 3, three comparison types in Section 4), generate R parity fixtures via `bacondecomp::bacon()`, write `tests/test_methodology_bacon.py` with paper-equation-numbered assertions, populate Verified Components / Corrections Made / Deviations here.
-- Paper review under `docs/methodology/papers/goodman-bacon-2021-review.md`
-- R parity fixture against `bacondecomp::bacon()` covering treated_vs_never, earlier_vs_later, later_vs_earlier weight buckets and their relative shares
-- Verify the REGISTRY Implementation Checklist (all rows currently unchecked except the survey-design Phase 3 row)
+**Test Coverage:**
+- 24 methodology tests in `tests/test_methodology_bacon.py` across 6 classes (21 active + 3 R-parity tests that skip on missing goldens)
+- 32 existing tests in `tests/test_bacon.py` (basic decomposition, weight properties, weights-parameter API, TWFE integration, visualization, balanced-panel warnings, edge cases)
+
+**R Comparison Results:**
+- **Pending**: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time. R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`. JSON goldens land at `benchmarks/data/r_bacondecomp_golden.json` once generated. `tests/test_methodology_bacon.py::TestBaconParityR` skips with a pointer until then. Tracked in TODO.md follow-up row.
+
+**Corrections Made:**
+1. **Theorem 1 exact-weights rewrite** (`bacon.py:_recompute_exact_weights`, lines ~740-880). The previous "exact" mode implementation did not actually compute Eqs. 7-9 / 10e-g — it was missing the `(1 - n_kU)` factor in the within-subsample treatment variance, did not square the sample share, and added an extraneous `unit_share` factor not present in the paper. The post-hoc sum-to-1 normalization masked the relative-weight error but produced a decomposition error of ~0.3% (0.007 absolute) against TWFE on a 3-cohort + never-treated DGP. **Rewrote** the function to compute the exact numerators of Eqs. 10e/f/g (with proper Eqs. 7-9 variances) and let the post-hoc normalization handle the `V̂^D` denominator (Theorem 1 identity guarantees `V̂^D = Σ numerators`). Now matches TWFE at `atol=1e-10`. The existing `test_weighted_sum_equals_twfe` tolerance was tightened from `< 0.1` to `< 1e-10` to lock the contract.
+2. **Default `weights` flipped from `"approximate"` to `"exact"`** at three entry points: `BaconDecomposition.__init__()` (`bacon.py:397`), `bacon_decompose()` convenience function (`bacon.py:1064`), `TwoWayFixedEffects.decompose()` (`twfe.py:684`). The paper-faithful Theorem 1 weights are now the default; the simplified approximate path remains opt-in via explicit `weights="approximate"`. `diff_diff/diagnostic_report.py:1740` (production diagnostic surface) was updated to pass explicit `weights="exact"`.
+3. **Always-treated warn+remap via internal column** (`bacon.py:fit()`, lines ~487-525). Paper footnote 11 puts units with `t_i < 1` in `U`, but `bacon.py` previously only mapped `first_treat ∈ {0, np.inf}` into U. Added detection using ordered-time logic on the **time axis** (`first_treat <= min(time)` while excluding the never-treated sentinels `0` and `np.inf`) with `UserWarning` and automatic remap via an internal column (`__bacon_first_treat_internal__`), preserving the user's `first_treat` column unchanged. Detection handles event-time-encoded panels (`time ∈ [-2,..,3]`) correctly; the `0` sentinel restriction applies only to `first_treat`. Count exposed via new `BaconDecompositionResults.n_always_treated_remapped` field.
+
+**Deviations from R's `bacondecomp::bacon()`:**
+1. **Unbalanced panel acceptance** (library extension): R errors on unbalanced panels; Python emits a `UserWarning` and decomposes. The paper's Appendix A proof assumes balanced panels — decomposition on unbalanced panels is approximate to Theorem 1.
+2. **Approximate weight mode** (Python-only optimization): `weights="approximate"` is a library-only fast path with simplified variance computation, not present in R. Users who want Python-R numerical parity should pass `weights="exact"` (the new default).
+3. **NaN for invalid inference fields not applicable**: the decomposition is deterministic; there are no SE/p-value fields on the comparison output. The `decomposition_error` field is a finite float (zero in well-conditioned cases).
 
 ---
 
diff --git a/TODO.md b/TODO.md
index 2e336f97..41693602 100644
--- a/TODO.md
+++ b/TODO.md
@@ -74,7 +74,7 @@ Deferred items from PR reviews that were not addressed before merge.
 
 | Issue | Location | PR | Priority |
 |-------|----------|----|----------|
-| BaconDecomposition: substantive methodology audit pass against Goodman-Bacon (2021). Paper review on file at `docs/methodology/papers/goodman-bacon-2021-review.md` includes a proposed Methodology Registry Entry, but the `REGISTRY.md` `## BaconDecomposition` section (lines ~2598-2654) still carries the older contract. Audit pass needs to (a) verify `bacon.py` matches Theorem 1 / Eqs. 10a-g exactly, (b) decide how to handle genuinely always-treated units (`0 < first_treat <= min(time)`) — paper convention puts them in `U`, but `bacon.py` currently treats only `first_treat ∈ {0, np.inf}` as the `U` bucket, (c) generate R parity fixtures via `bacondecomp::bacon()`, (d) write `tests/test_methodology_bacon.py`, (e) replace the REGISTRY entry with the proposed text, (f) populate Verified Components / Corrections Made / Deviations in `METHODOLOGY_REVIEW.md` and flip status to Complete. | `diff_diff/bacon.py`, `docs/methodology/REGISTRY.md`, `tests/test_methodology_bacon.py`, `METHODOLOGY_REVIEW.md` | follow-up | Medium |
+| BaconDecomposition R parity goldens: `bacondecomp` R package not installed in the local R 4.5.2 library at PR-B authoring time (2026-05-16). R generator script committed at `benchmarks/R/generate_bacon_golden.R`; running it requires `install.packages("bacondecomp")` + `install.packages("jsonlite")` then `cd benchmarks/R && Rscript generate_bacon_golden.R`, writing `benchmarks/data/r_bacondecomp_golden.json`. `tests/test_methodology_bacon.py::TestBaconParityR` (3 tests) skips with a pointer until the JSON lands. The PR-B audit substantiates Theorem 1 (Eqs. 7-9 + 10e-g) via hand-calculable + machine-precision identity tests; R parity is desirable as a cross-language anchor but not the only substantiation. Mirrors StaggeredTripleDifference precedent (PR #245). | `benchmarks/R/generate_bacon_golden.R`, `benchmarks/data/r_bacondecomp_golden.json` (TBD), `tests/test_methodology_bacon.py::TestBaconParityR` | follow-up | Medium |
 | dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low |
 | dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | #408 | Medium |
 | dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low |
diff --git a/benchmarks/R/generate_bacon_golden.R b/benchmarks/R/generate_bacon_golden.R
new file mode 100644
index 00000000..040f2552
--- /dev/null
+++ b/benchmarks/R/generate_bacon_golden.R
@@ -0,0 +1,234 @@
+#!/usr/bin/env Rscript
+# Generate R bacondecomp parity goldens for diff-diff BaconDecomposition.
+#
+# Requires: install.packages("bacondecomp")  (CRAN; main function is bacon())
+#           install.packages("jsonlite")
+# Output:   ../data/r_bacondecomp_golden.json
+#
+# The diff-diff BaconDecomposition implementation (`diff_diff/bacon.py`) with
+# the default ``weights="exact"`` is expected to match the values in this JSON
+# to atol=1e-6 on the per-component (treated, control, type) tuples, and to
+# match the TWFE coefficient to the same tolerance. The ``weights="approximate"``
+# path is a library-only optimization and is NOT covered by this parity harness.
+#
+# Three fixtures:
+#   1. uniform_3groups_with_never_treated — 3 timing groups + never-treated U;
+#      exercises all three comparison types (treated/never, earlier/later,
+#      later/earlier).
+#   2. two_groups_no_never_treated — 2 timing groups only; tests the
+#      timing-only decomposition where the s_{kU} terms drop.
+#   3. always_treated_remapped — 3 timing groups + 1 always-treated cohort
+#      (first_treat = 1). Validates that Python's warn+remap of t_i < 1 into
+#      U matches R bacondecomp's native behavior.
+#
+# Run:
+#   cd benchmarks/R && Rscript generate_bacon_golden.R
+
+suppressPackageStartupMessages({
+  library(bacondecomp)
+  library(jsonlite)
+})
+
+stopifnot(packageVersion("bacondecomp") >= "0.1.0")
+
+# ---------------------------------------------------------------------------
+# DGP helpers
+# ---------------------------------------------------------------------------
+
+# Build a balanced panel with absorbing treatment.
+#   n_units       : units per timing group (excluding never-treated)
+#   n_periods     : panel length (1..T)
+#   cohort_times  : vector of first-treatment times, one per cohort
+#   always_treated_count : optional cohort treated at first_treat = 1
+#                          (i.e., always-treated for the observable window)
+#   never_treated_count  : units with first_treat = 0
+#   true_effect          : constant ATT
+#   seed                 : reproducibility
+build_panel <- function(n_units_per_cohort, n_periods, cohort_times,
+                        always_treated_count = 0L, never_treated_count = 0L,
+                        true_effect = 2.0, seed = 42L) {
+  set.seed(seed)
+  units <- list()
+  uid <- 1L
+
+  # Always-treated cohort (first_treat = 1; treated in every observable period)
+  if (always_treated_count > 0L) {
+    for (i in seq_len(always_treated_count)) {
+      units[[length(units) + 1L]] <- data.frame(
+        unit = uid, time = seq_len(n_periods), first_treat = 1L
+      )
+      uid <- uid + 1L
+    }
+  }
+
+  # Never-treated U
+  if (never_treated_count > 0L) {
+    for (i in seq_len(never_treated_count)) {
+      units[[length(units) + 1L]] <- data.frame(
+        unit = uid, time = seq_len(n_periods), first_treat = 0L
+      )
+      uid <- uid + 1L
+    }
+  }
+
+  # Treated cohorts
+  for (g in cohort_times) {
+    for (i in seq_len(n_units_per_cohort)) {
+      units[[length(units) + 1L]] <- data.frame(
+        unit = uid, time = seq_len(n_periods), first_treat = as.integer(g)
+      )
+      uid <- uid + 1L
+    }
+  }
+
+  df <- do.call(rbind, units)
+
+  # Treatment indicator: D_{it} = 1 iff first_treat in {1,..,T} AND time >= first_treat.
+  df$D <- as.integer(df$first_treat > 0L & df$time >= df$first_treat)
+
+  # Outcome: unit FE + linear time + constant treatment effect + noise.
+  unit_fe <- rnorm(uid - 1L, sd = 2.0)
+  df$y <- unit_fe[df$unit] +
+          0.1 * df$time +
+          true_effect * df$D +
+          rnorm(nrow(df), sd = 0.5)
+
+  df
+}
+
+# ---------------------------------------------------------------------------
+# Extract bacondecomp::bacon() output into a fixture-shaped list.
+# ---------------------------------------------------------------------------
+
+extract_bacon <- function(df, fixture_name) {
+  # bacondecomp::bacon takes the OUTCOME ~ TREATMENT formula plus id_var/time_var.
+  # It returns a data.frame with columns: treated, untreated, estimate, weight,
+  # plus a `type` column (e.g. "Both Treated", "Treated vs Untreated"), and an
+  # attribute beta_hat_w (the weighted sum, which equals the TWFE coefficient).
+  res <- bacondecomp::bacon(
+    formula  = y ~ D,
+    data     = df,
+    id_var   = "unit",
+    time_var = "time"
+  )
+
+  # When the data contains a never-treated group, bacon() returns a list with
+  # $two_by_twos (the per-component table) and $Omega (the variance-weighted
+  # contributions). Without never-treated, it returns the data.frame directly.
+  if (is.list(res) && !is.data.frame(res)) {
+    components_df <- res$two_by_twos
+    twfe_coef <- as.numeric(attr(res, "beta_hat_w"))
+    # Fallback: re-derive TWFE from the components if attr is missing.
+    if (is.null(twfe_coef) || length(twfe_coef) == 0L) {
+      twfe_coef <- sum(components_df$estimate * components_df$weight)
+    }
+  } else {
+    components_df <- res
+    twfe_coef <- sum(components_df$estimate * components_df$weight)
+  }
+
+  # Components vary across bacondecomp versions; normalize the column names.
+  cols <- names(components_df)
+  treated_col <- if ("treated" %in% cols) "treated" else "g1"
+  untreated_col <- if ("untreated" %in% cols) "untreated" else "g2"
+  estimate_col <- if ("estimate" %in% cols) "estimate" else "Estimate"
+  weight_col <- if ("weight" %in% cols) "weight" else "Weight"
+  type_col <- if ("type" %in% cols) "type" else NA_character_
+
+  components <- lapply(seq_len(nrow(components_df)), function(i) {
+    list(
+      treated_group   = as.numeric(components_df[[treated_col]][i]),
+      control_group   = as.numeric(components_df[[untreated_col]][i]),
+      estimate        = as.numeric(components_df[[estimate_col]][i]),
+      weight          = as.numeric(components_df[[weight_col]][i]),
+      type            = if (!is.na(type_col))
+                          as.character(components_df[[type_col]][i])
+                        else NA_character_
+    )
+  })
+
+  weights_sum <- sum(sapply(components, function(c) c$weight))
+
+  list(
+    panel = list(
+      unit        = as.integer(df$unit),
+      time        = as.integer(df$time),
+      y           = as.numeric(df$y),
+      first_treat = as.integer(df$first_treat),
+      treated     = as.integer(df$D)
+    ),
+    r_twfe_coef    = twfe_coef,
+    r_components   = components,
+    r_weights_sum  = weights_sum,
+    n_components   = length(components)
+  )
+}
+
+# ---------------------------------------------------------------------------
+# Fixtures
+# ---------------------------------------------------------------------------
+
+cat("Building fixture 1: uniform_3groups_with_never_treated...\n")
+df1 <- build_panel(
+  n_units_per_cohort   = 30L,
+  n_periods            = 6L,
+  cohort_times         = c(3L, 4L, 5L),
+  always_treated_count = 0L,
+  never_treated_count  = 30L,
+  true_effect          = 2.0,
+  seed                 = 101L
+)
+fixture_1 <- extract_bacon(df1, "uniform_3groups_with_never_treated")
+
+cat("Building fixture 2: two_groups_no_never_treated...\n")
+df2 <- build_panel(
+  n_units_per_cohort   = 30L,
+  n_periods            = 6L,
+  cohort_times         = c(3L, 5L),
+  always_treated_count = 0L,
+  never_treated_count  = 0L,
+  true_effect          = 2.0,
+  seed                 = 202L
+)
+fixture_2 <- extract_bacon(df2, "two_groups_no_never_treated")
+
+cat("Building fixture 3: always_treated_remapped...\n")
+# 3 timing-cohorts + 5 always-treated units (first_treat = 1, i.e., treated
+# in every observable period) + 30 never-treated. R's bacondecomp natively
+# groups the first_treat=1 cohort with U (since they are treated throughout
+# every observable period and never serve as a within-window control), which
+# matches what diff-diff's warn+remap does in Python.
+df3 <- build_panel(
+  n_units_per_cohort   = 25L,
+  n_periods            = 6L,
+  cohort_times         = c(3L, 4L, 5L),
+  always_treated_count = 5L,
+  never_treated_count  = 25L,
+  true_effect          = 2.0,
+  seed                 = 303L
+)
+fixture_3 <- extract_bacon(df3, "always_treated_remapped")
+
+# ---------------------------------------------------------------------------
+# Write JSON
+# ---------------------------------------------------------------------------
+
+out <- list(
+  meta = list(
+    generated_at         = format(Sys.Date()),
+    bacondecomp_version  = as.character(packageVersion("bacondecomp")),
+    r_version            = R.version.string,
+    description          = paste(
+      "Goodman-Bacon (2021) decomposition parity goldens for diff-diff",
+      "BaconDecomposition. Parity target: atol=1e-6 on per-component",
+      "(treated, control, type) tuples plus the TWFE coefficient."
+    )
+  ),
+  uniform_3groups_with_never_treated = fixture_1,
+  two_groups_no_never_treated        = fixture_2,
+  always_treated_remapped            = fixture_3
+)
+
+out_path <- "../data/r_bacondecomp_golden.json"
+write_json(out, out_path, pretty = TRUE, digits = NA, auto_unbox = TRUE)
+cat(sprintf("Wrote %s\n", out_path))
diff --git a/diff_diff/bacon.py b/diff_diff/bacon.py
index 90438a9f..b0c690f1 100644
--- a/diff_diff/bacon.py
+++ b/diff_diff/bacon.py
@@ -32,6 +32,13 @@ class Comparison2x2:
         The timing group used as "treated" in this comparison.
     control_group : Any
         The timing group used as "control" in this comparison.
+        For ``comparison_type="treated_vs_never"``, this is the literal
+        string ``"never_treated"``, which refers to the **post-remap U
+        bucket** (the paper's ``U`` per Goodman-Bacon 2021 footnote 11).
+        On inputs with no remapped always-treated units this is exactly
+        the true never-treated set; with remapping it is the broader U
+        bucket. Check ``BaconDecompositionResults.n_never_treated`` and
+        ``n_always_treated_remapped`` for the precise composition.
     comparison_type : str
         Type of comparison: "treated_vs_never", "earlier_vs_later",
         or "later_vs_earlier".
@@ -94,6 +101,17 @@ class BaconDecompositionResults:
         Number of distinct treatment timing groups.
     n_never_treated : int
         Number of never-treated units.
+    n_always_treated_remapped : int
+        Number of units whose ``first_treat`` was at or before the first
+        observable period (``first_treat <= min(time)``, excluding the
+        never-treated sentinels ``0`` and ``np.inf``) and which were
+        automatically remapped to the ``U`` (untreated) bucket per
+        Goodman-Bacon (2021) footnote 11. Detection uses ordered-time
+        logic so negative or zero-crossing period labels work correctly.
+        Zero on inputs where the user only used the ``first_treat ∈ {0,
+        np.inf}`` sentinels. The user's original ``first_treat`` column
+        is preserved unchanged on the input ``data`` frame; remapping
+        happens in an internal column.
     timing_groups : List[Any]
         List of treatment timing cohorts.
     """
@@ -111,6 +129,9 @@ class BaconDecompositionResults:
     timing_groups: List[Any]
     n_obs: int = 0
     decomposition_error: float = field(default=0.0)
+    # Count of units auto-remapped from 0 < first_treat <= min(time) into the
+    # U bucket per Goodman-Bacon (2021) footnote 11. Always 0 on legacy inputs.
+    n_always_treated_remapped: int = 0
     # Survey design metadata (SurveyMetadata instance from diff_diff.survey)
     survey_metadata: Optional[Any] = field(default=None)
 
@@ -139,8 +160,12 @@ def summary(self) -> str:
             f"{'Treatment timing groups:':<35} {self.n_timing_groups:>10}",
             f"{'Never-treated units:':<35} {self.n_never_treated:>10}",
             f"{'Total 2x2 comparisons:':<35} {len(self.comparisons):>10}",
-            "",
         ]
+        if self.n_always_treated_remapped > 0:
+            lines.append(
+                f"{'Always-treated remapped to U:':<35} " f"{self.n_always_treated_remapped:>10}"
+            )
+        lines.append("")
 
         # Add survey design info
         if self.survey_metadata is not None:
@@ -322,15 +347,19 @@ class BaconDecomposition:
 
     Parameters
     ----------
-    weights : str, default="approximate"
+    weights : str, default="exact"
         Weight calculation method:
 
+        - "exact" (default): Variance-based weights from Goodman-Bacon (2021)
+          Theorem 1, Eqs. 7-9 and 10e-g. Produces the paper-faithful
+          decomposition where the weighted sum matches the TWFE estimate
+          to machine precision. Use for publication-quality work and the
+          standard methodology contract.
         - "approximate": Fast simplified formula using group shares and
-          treatment variance. Good for diagnostic purposes where relative
-          weights are sufficient to identify problematic comparisons.
-        - "exact": Variance-based weights from Goodman-Bacon (2021) Theorem 1.
-          Use for publication-quality decompositions where the weighted sum
-          must closely match the TWFE estimate.
+          treatment variance, with post-hoc sum-to-1 normalization. Opt
+          in for speed-sensitive diagnostic loops where the relative weight
+          structure is sufficient. Approximate-mode results may differ
+          numerically from R ``bacondecomp::bacon()``.
 
     Attributes
     ----------
@@ -394,17 +423,18 @@ class BaconDecomposition:
     TwoWayFixedEffects : The TWFE estimator being decomposed
     """
 
-    def __init__(self, weights: str = "approximate"):
+    def __init__(self, weights: str = "exact"):
         """
         Initialize BaconDecomposition.
 
         Parameters
         ----------
-        weights : str, default="approximate"
+        weights : str, default="exact"
             Weight calculation method:
 
-            - "approximate": Fast simplified formula (default)
-            - "exact": Variance-based weights from Goodman-Bacon (2021)
+            - "exact" (default): Variance-based weights from Goodman-Bacon
+              (2021) Theorem 1 (paper-faithful Eqs. 7-9 and 10e-g).
+            - "approximate": Fast simplified formula. Opt in for speed.
         """
         if weights not in ("approximate", "exact"):
             raise ValueError(f"weights must be 'approximate' or 'exact', got '{weights}'")
@@ -435,8 +465,26 @@ def fit(
         time : str
             Name of time period column.
         first_treat : str
-            Name of column indicating when unit was first treated.
-            Use 0 (or np.inf) for never-treated units.
+            Name of column indicating when unit was first treated. The
+            values ``0`` and ``np.inf`` are **reserved as never-treated
+            sentinels** (not configurable today); a real treatment cohort
+            with ``first_treat == 0`` would be folded into ``U`` and
+            should instead be re-labeled to a non-sentinel value before
+            fitting. Units whose ``first_treat`` is at or before the
+            first observable period (``first_treat <= min(time)``,
+            excluding the never-treated sentinels ``0`` and ``np.inf``)
+            are automatically remapped to the ``U`` (untreated) bucket
+            per Goodman-Bacon (2021) footnote 11, with a
+            ``UserWarning``. Detection uses ordered-time logic on the
+            **time axis** so panels whose ``time`` column contains
+            negative or zero-crossing labels (e.g. event-time
+            ``time ∈ [-2,..,3]``) are handled correctly; the ``0``
+            sentinel restriction applies only to ``first_treat``, not to
+            ``time``. The user's original ``first_treat`` column on
+            ``data`` is preserved unchanged; remapping happens in an
+            internal column. The count of remapped units is exposed on
+            the result as
+            ``BaconDecompositionResults.n_always_treated_remapped``.
         survey_design : SurveyDesign, optional
             Survey design specification for weighted estimation.
             When provided, all means and group shares use survey weights.
@@ -488,6 +536,58 @@ def fit(
         df[time] = pd.to_numeric(df[time])
         df[first_treat] = pd.to_numeric(df[first_treat])
 
+        # Preserve the user-provided column name so we can count TRUE
+        # never-treated units below (post-remap, `first_treat` will be
+        # rebound to the internal column which folds remapped always-treated
+        # into the same `0` sentinel bucket as never-treated).
+        user_first_treat_col = first_treat
+
+        # Always-treated remap (Goodman-Bacon 2021, footnote 11):
+        # The paper convention puts units treated before the first observable
+        # period into the U bucket alongside never-treated units. The library's
+        # prior sentinel-only convention (`first_treat ∈ {0, np.inf}`) is
+        # narrower than the paper's U.
+        #
+        # Detection uses ORDERED-TIME logic on the `time` axis
+        # (`first_treat <= min(time)`), NOT positive-sign restriction, so
+        # panels whose `time` column has negative or zero-crossing labels
+        # (e.g. event-time `time ∈ [-2,..,3]`) are handled correctly.
+        # Sentinel rows (`first_treat ∈ {0, np.inf}`) are excluded from
+        # the remap so the never-treated contract is preserved. NOTE: the
+        # `0` sentinel restriction applies to `first_treat` only, not to
+        # `time`; a real treatment cohort with `first_treat == 0` is not
+        # supported today and would be folded into `U` (re-label such
+        # cohorts to a non-sentinel value before fitting).
+        # Remapping writes to an internal column; the user's `first_treat`
+        # column is preserved unchanged (df = data.copy() above).
+        df["__bacon_first_treat_internal__"] = df[first_treat]
+        min_period = df[time].min()
+        is_U_sentinel = (df["__bacon_first_treat_internal__"] == 0) | (
+            df["__bacon_first_treat_internal__"] == np.inf
+        )
+        always_treated_mask = (~is_U_sentinel) & (
+            df["__bacon_first_treat_internal__"] <= min_period
+        )
+        n_always_treated_remapped = int(df.loc[always_treated_mask, unit].nunique())
+        if n_always_treated_remapped > 0:
+            warnings.warn(
+                f"Detected {n_always_treated_remapped} always-treated units "
+                f"(first_treat <= {min_period}, excluding sentinel values "
+                f"0 and np.inf). Remapping to U bucket per Goodman-Bacon "
+                f"(2021) footnote 11. The original first_treat column is "
+                f"preserved; remapping happens in an internal column. To "
+                f"silence this warning, recode the affected rows' "
+                f"first_treat values to 0 or np.inf in your input data "
+                f"before fitting.",
+                UserWarning,
+                stacklevel=2,
+            )
+            df.loc[always_treated_mask, "__bacon_first_treat_internal__"] = 0
+        # Rebind the local first_treat name to the internal column so all
+        # downstream df[first_treat] reads (and helper-function passthroughs)
+        # see the remapped data without per-site rewrites.
+        first_treat = "__bacon_first_treat_internal__"
+
         # Check for balanced panel
         periods_per_unit = df.groupby(unit)[time].count()
         if periods_per_unit.nunique() > 1:
@@ -502,13 +602,41 @@ def fit(
         time_periods = sorted(df[time].unique())
 
         # Identify never-treated and timing groups
-        # Never-treated: first_treat = 0 or inf
+        # Never-treated: first_treat = 0 or np.inf (library sentinels).
+        # Timing groups: every other value in the (post-remap) internal
+        # column. Do NOT restrict to positive values — negative-coded
+        # event-time cohorts (e.g. first_treat=-1 on a panel with
+        # min(time)=-2) are valid timing groups.
         never_treated_mask = (df[first_treat] == 0) | (df[first_treat] == np.inf)
-        timing_groups = sorted([g for g in df[first_treat].unique() if g > 0 and g != np.inf])
+        timing_groups = sorted([g for g in df[first_treat].unique() if g != 0 and g != np.inf])
+
+        # `n_never_treated` reports TRUE never-treated units, computed from
+        # the original user-provided column BEFORE the remap. Remapped
+        # always-treated units are reported separately via
+        # `n_always_treated_remapped` so the two counts do not double-count.
+        unit_info_user = df.groupby(unit).agg({user_first_treat_col: "first"}).reset_index()
+        n_never_treated = int(
+            (
+                (unit_info_user[user_first_treat_col] == 0)
+                | (unit_info_user[user_first_treat_col] == np.inf)
+            ).sum()
+        )
 
-        # Get unit-level treatment timing
-        unit_info = df.groupby(unit).agg({first_treat: "first"}).reset_index()
-        n_never_treated = ((unit_info[first_treat] == 0) | (unit_info[first_treat] == np.inf)).sum()
+        # `n_units_in_U_bucket` is the POST-remap count used to decide
+        # whether `treated_vs_never` (β̂_{kU}^{2x2}) comparisons should
+        # be generated. It includes BOTH true never-treated AND remapped
+        # always-treated units, since the paper convention puts them in
+        # the same `U` bucket (Goodman-Bacon 2021 footnote 11). Without
+        # this distinction from `n_never_treated`, panels whose U is
+        # composed entirely of remapped always-treated units would
+        # silently drop all β̂_{kU}^{2x2} terms and break the Theorem 1
+        # identity at the loop gate.
+        unit_info_internal = df.groupby(unit).agg({first_treat: "first"}).reset_index()
+        n_units_in_U_bucket = int(
+            (
+                (unit_info_internal[first_treat] == 0) | (unit_info_internal[first_treat] == np.inf)
+            ).sum()
+        )
 
         # Create treatment indicator (D_it = 1 if treated at time t)
         # Use unique internal name to avoid conflicts with user data
@@ -523,8 +651,11 @@ def fit(
         # Perform decomposition
         comparisons = []
 
-        # 1. Treated vs Never-treated comparisons
-        if n_never_treated > 0:
+        # 1. Treated vs Never-treated comparisons.
+        # Gate on the POST-remap U bucket count so panels whose U is
+        # composed entirely of remapped always-treated units still emit
+        # β̂_{kU}^{2x2} terms (paper Goodman-Bacon 2021 footnote 11).
+        if n_units_in_U_bucket > 0:
             for g in timing_groups:
                 comp = self._compute_treated_vs_never(
                     df,
@@ -638,6 +769,7 @@ def fit(
             timing_groups=timing_groups,
             n_obs=len(df),
             decomposition_error=decomp_error,
+            n_always_treated_remapped=n_always_treated_remapped,
             survey_metadata=survey_metadata,
         )
 
@@ -690,117 +822,153 @@ def _recompute_exact_weights(
         weights: Optional[np.ndarray] = None,
     ) -> None:
         """
-        Recompute weights using exact variance-based formula from Theorem 1.
-
-        This modifies comparison weights in-place to use the exact formula
-        from Goodman-Bacon (2021) which accounts for within-group variance
-        of the treatment indicator in each 2x2 comparison window.
-
-        When survey weights are provided, uses weighted unit counts and
-        within-group variance of the treatment indicator.
+        Recompute weights using the exact Theorem 1 formula from
+        Goodman-Bacon (2021).
+
+        Implements Eqs. 7-9 (subsample FE-adjusted treatment-dummy
+        variances) and Eqs. 10e-g (decomposition weights). Only the
+        NUMERATORS of Eqs. 10e-g are written here; the post-hoc
+        sum-to-1 normalization in ``fit()`` handles the ``V̂^D``
+        denominator, which is mathematically equivalent per Theorem 1's
+        identity ``V̂^D = Σ numerators``.
+
+        Notation (per paper §2, pp. 256-258):
+            n_k        sample share of timing group k (fraction of units)
+            n_kU       relative size of group k in pair (k, U): n_k/(n_k+n_U)
+            D̄_k        share of periods group k spends treated: (T-k+1)/T
+
+        Equations:
+            V̂_{kU}^D     = n_{kU}(1-n_{kU}) · D̄_k(1-D̄_k)                 (Eq. 7)
+            V̂_{kℓ}^{D,k} = n_{kℓ}(1-n_{kℓ}) · (D̄_k-D̄_ℓ)/(1-D̄_ℓ)
+                          · (1-D̄_k)/(1-D̄_ℓ)                              (Eq. 8)
+            V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1-n_{kℓ}) · D̄_ℓ/D̄_k
+                          · (D̄_k-D̄_ℓ)/D̄_k                                (Eq. 9)
+
+            s_{kU}    ∝ (n_k+n_U)^2 · V̂_{kU}^D                            (Eq. 10e)
+            s_{kℓ}^k  ∝ ((n_k+n_ℓ)(1-D̄_ℓ))^2 · V̂_{kℓ}^{D,k}                (Eq. 10f)
+            s_{kℓ}^ℓ  ∝ ((n_k+n_ℓ)·D̄_k)^2 · V̂_{kℓ}^{D,ℓ}                   (Eq. 10g)
+
+        When survey weights are provided, sample shares use weighted unit
+        counts (constant-within-unit weights are required and enforced
+        upstream by ``_validate_unit_constant_survey``). The ``D̄_k`` term
+        is panel-share-based and unaffected by sampling weights.
+
+        Modifies ``comparisons[i].weight`` in place. The caller then
+        normalizes to sum to 1.
         """
-        n_total_obs = len(df)
-        w_arr = weights if weights is not None else np.ones(n_total_obs)
-        # Store weights as a column for safe label-based subsetting
-        df = df.copy()
-        df["_sw"] = w_arr
-        w_total = np.sum(w_arr)
-        n_total_units = df[unit].nunique()
-
-        for comp in comparisons:
-            # Get data for this specific comparison
-            if comp.comparison_type == "treated_vs_never":
-                pre_periods = [t for t in time_periods if t < comp.treated_group]
-                post_periods = [t for t in time_periods if t >= comp.treated_group]
-                # Get units in each group
-                units_treated = df[df[first_treat] == comp.treated_group][unit].unique()
-                units_control = df[(df[first_treat] == 0) | (df[first_treat] == np.inf)][
-                    unit
-                ].unique()
-            elif comp.comparison_type == "earlier_vs_later":
-                g_early = comp.treated_group
-                g_late = comp.control_group
-                pre_periods = [t for t in time_periods if t < g_early]
-                post_periods = [t for t in time_periods if g_early <= t < g_late]
-                units_treated = df[df[first_treat] == g_early][unit].unique()
-                units_control = df[df[first_treat] == g_late][unit].unique()
-            else:  # later_vs_earlier
-                g_late = comp.treated_group
-                g_early = comp.control_group
-                pre_periods = [t for t in time_periods if g_early <= t < g_late]
-                post_periods = [t for t in time_periods if t >= g_late]
-                units_treated = df[df[first_treat] == g_late][unit].unique()
-                units_control = df[df[first_treat] == g_early][unit].unique()
-
-            if not pre_periods or not post_periods:
+        # Panel length T (Eq. 7-9 use share-of-periods D̄_k = (T-k+1)/T)
+        T = len(time_periods)
+        if T <= 0:
+            for comp in comparisons:
                 comp.weight = 0.0
-                continue
+            return
 
-            # Subset to the 2x2 comparison sample
-            relevant_periods = set(pre_periods) | set(post_periods)
-            all_units = set(units_treated) | set(units_control)
+        # Per-unit first observation (treatment timing is unit-invariant)
+        df_copy = df.copy()
+        df_copy["_sw"] = weights if weights is not None else np.ones(len(df))
+        unit_first = df_copy.groupby(unit).agg({first_treat: "first", "_sw": "first"})
+        ft_per_unit = unit_first[first_treat]
+        sw_per_unit = unit_first["_sw"]
 
-            df_22 = df[(df[unit].isin(all_units)) & (df[time].isin(relevant_periods))]
+        # Total weighted unit mass (denominator for n_k, n_U)
+        if weights is None:
+            unit_mass_total = float(len(ft_per_unit))
 
-            if len(df_22) == 0:
-                comp.weight = 0.0
-                continue
+            def _mass(mask: pd.Series) -> float:
+                return float(int(mask.sum()))
+
+        else:
+            unit_mass_total = float(sw_per_unit.sum())
 
-            # Count units in this comparison
-            n_k = len(units_treated)
-            n_l = len(units_control)
+            def _mass(mask: pd.Series) -> float:
+                return float(sw_per_unit[mask].sum())
 
-            if n_k == 0 or n_l == 0:
+        if unit_mass_total <= 0:
+            for comp in comparisons:
                 comp.weight = 0.0
-                continue
-
-            # Weighted observation counts for the 2x2 sample
-            w_22 = df_22["_sw"].values
-            w_22_sum = np.sum(w_22)
-
-            # Sample share of this comparison (weighted)
-            sample_share = w_22_sum / w_total
-
-            # Weighted group shares within the 2x2
-            treated_mask_22 = df_22[unit].isin(units_treated)
-            w_k = np.sum(w_22[treated_mask_22.values])
-            n_k_share = w_k / w_22_sum if w_22_sum > 0 else 0.0
-
-            # Create treatment indicator for the 2x2
-            T_pre = len(pre_periods)
-            T_post = len(post_periods)
-            T_window = T_pre + T_post
-
-            # Variance of D within the 2x2 for treated group
-            # D = 0 in pre, D = 1 in post for treated units
-            # D = 0 for all periods for control units in this window
-            D_k = T_post / T_window  # proportion treated for treated group
-
-            # Within-comparison variance of treatment (weighted)
-            # Var(D) = n_k/(n_k+n_l) * D_k * (1-D_k) for the 2x2
-            var_D_22 = n_k_share * D_k * (1 - D_k)
-
-            # Exact weight: proportional to sample share * variance
-            # Scale by weighted unit share to account for subsample
-            # Use survey-weighted unit mass when weights present
-            if weights is not None:
-                # Sum of per-unit weights for treated + control units in this 2x2
-                unit_w_k = (
-                    df_22.loc[treated_mask_22, "_sw"]
-                    .groupby(df_22.loc[treated_mask_22, unit])
-                    .first()
-                    .sum()
-                )
-                unit_w_l = (
-                    df_22.loc[~treated_mask_22, "_sw"]
-                    .groupby(df_22.loc[~treated_mask_22, unit])
-                    .first()
-                    .sum()
+            return
+
+        # U bucket sample share (never-treated + always-treated post-remap;
+        # remap to 0/inf already happened in fit() via __bacon_first_treat_internal__).
+        is_U = (ft_per_unit == 0) | (ft_per_unit == np.inf)
+        n_U = _mass(is_U) / unit_mass_total
+
+        # Per-cohort sample shares n_g and panel-share D̄_g.
+        # Timing groups: exclude only the U sentinels (0, np.inf). Negative
+        # event-time cohorts (e.g. first_treat=-1 on a panel with min(time)=-2)
+        # are valid timing groups.
+        timing_groups = sorted(g for g in ft_per_unit.unique() if g != 0 and g != np.inf)
+        n_g: Dict[Any, float] = {
+            g: _mass(ft_per_unit == g) / unit_mass_total for g in timing_groups
+        }
+        # D̄_g = share of periods group g spends treated.
+        # For absorbing treatment with first-treatment time g over panel
+        # periods [1, T], the treated periods are [g, T], i.e. T - g + 1
+        # periods. We use the actual period values from time_periods to
+        # handle panels indexed from 0 / arbitrary start.
+        t_arr = np.asarray(sorted(time_periods))
+        D_bar: Dict[Any, float] = {g: float(np.sum(t_arr >= g)) / T for g in timing_groups}
+
+        for comp in comparisons:
+            if comp.comparison_type == "treated_vs_never":
+                k = comp.treated_group
+                n_k = n_g.get(k, 0.0)
+                if n_k <= 0 or n_U <= 0:
+                    comp.weight = 0.0
+                    continue
+                n_kU = n_k / (n_k + n_U)
+                D_k = D_bar.get(k, 0.0)
+                # Eq. 7
+                V_kU = n_kU * (1.0 - n_kU) * D_k * (1.0 - D_k)
+                # Eq. 10e numerator
+                comp.weight = (n_k + n_U) ** 2 * V_kU
+            elif comp.comparison_type == "earlier_vs_later":
+                # k = early (treated in 2x2), ℓ = late (control during MID)
+                k = comp.treated_group
+                ell = comp.control_group
+                n_k = n_g.get(k, 0.0)
+                n_ell = n_g.get(ell, 0.0)
+                if n_k <= 0 or n_ell <= 0:
+                    comp.weight = 0.0
+                    continue
+                D_k = D_bar.get(k, 0.0)
+                D_ell = D_bar.get(ell, 0.0)
+                # Eq. 8 requires D̄_ℓ < 1 (denominators 1 - D̄_ℓ)
+                if D_ell >= 1.0:
+                    comp.weight = 0.0
+                    continue
+                n_kl = n_k / (n_k + n_ell)
+                # Eq. 8
+                V_kl_k = (
+                    n_kl
+                    * (1.0 - n_kl)
+                    * (D_k - D_ell)
+                    / (1.0 - D_ell)
+                    * (1.0 - D_k)
+                    / (1.0 - D_ell)
                 )
-                unit_share = (unit_w_k + unit_w_l) / w_total
-            else:
-                unit_share = (n_k + n_l) / n_total_units
-            comp.weight = sample_share * var_D_22 * unit_share
+                # Eq. 10f numerator
+                comp.weight = ((n_k + n_ell) * (1.0 - D_ell)) ** 2 * V_kl_k
+            else:  # later_vs_earlier
+                # ℓ = late (treated in 2x2), k = early (already-treated control)
+                ell = comp.treated_group
+                k = comp.control_group
+                n_k = n_g.get(k, 0.0)
+                n_ell = n_g.get(ell, 0.0)
+                if n_k <= 0 or n_ell <= 0:
+                    comp.weight = 0.0
+                    continue
+                D_k = D_bar.get(k, 0.0)
+                D_ell = D_bar.get(ell, 0.0)
+                # Eq. 9 requires D̄_k > 0 (denominators D̄_k)
+                if D_k <= 0.0:
+                    comp.weight = 0.0
+                    continue
+                n_kl = n_k / (n_k + n_ell)
+                # Eq. 9
+                V_kl_l = n_kl * (1.0 - n_kl) * D_ell / D_k * (D_k - D_ell) / D_k
+                # Eq. 10g numerator
+                comp.weight = ((n_k + n_ell) * D_k) ** 2 * V_kl_l
 
     def _compute_treated_vs_never(
         self,
@@ -1061,7 +1229,7 @@ def bacon_decompose(
     unit: str,
     time: str,
     first_treat: str,
-    weights: str = "approximate",
+    weights: str = "exact",
     survey_design: object = None,
 ) -> BaconDecompositionResults:
     """
@@ -1082,15 +1250,42 @@ def bacon_decompose(
     time : str
         Name of time period column.
     first_treat : str
-        Name of column indicating when unit was first treated.
-        Use 0 (or np.inf) for never-treated units.
-    weights : str, default="approximate"
+        Name of column indicating when unit was first treated. The
+        values ``0`` and ``np.inf`` are **reserved as never-treated
+        sentinels**; a real treatment cohort with ``first_treat == 0``
+        would be folded into ``U`` and should be re-labeled to a
+        non-sentinel value before fitting. Units whose ``first_treat``
+        is at or before the first observable period
+        (``first_treat <= min(time)``, excluding the sentinels) are
+        automatically remapped to the ``U`` (untreated) bucket per
+        Goodman-Bacon (2021) footnote 11, with a ``UserWarning``. See
+        ``BaconDecomposition.fit()`` for the full contract and
+        ``BaconDecompositionResults.n_always_treated_remapped`` for the
+        count. The user's original ``first_treat`` column is preserved
+        unchanged.
+    weights : str, default="exact"
         Weight calculation method:
 
-        - "approximate": Fast simplified formula (default). Good for
-          diagnostic purposes where relative weights are sufficient.
-        - "exact": Variance-based weights from Goodman-Bacon (2021)
-          Theorem 1. Use for publication-quality decompositions.
+        - "exact" (default): Variance-based weights from Goodman-Bacon
+          (2021) Theorem 1, Eqs. 7-9 and 10e-g. Paper-faithful.
+        - "approximate": Fast simplified formula. Opt in for
+          speed-sensitive diagnostic loops; numerical output may differ
+          from R ``bacondecomp::bacon()``.
+    survey_design : SurveyDesign, optional
+        Survey design specification for weighted estimation. When provided,
+        cell means, group shares, and within-transform use survey weights.
+        The decomposition remains diagnostic (no survey vcov needed).
+
+        **Default-flip caveat (PR-B, 2026-05-16):** the new
+        ``weights="exact"`` default routes through
+        ``_validate_unit_constant_survey``, which **rejects survey
+        designs whose weights / strata / PSU / FPC columns vary within
+        a unit across periods** (the exact path collapses to per-unit
+        aggregation via ``groupby().first()``). Users whose survey
+        design has time-varying within-unit columns must either (a)
+        collapse the columns to be unit-constant or (b) pass explicit
+        ``weights="approximate"`` to retain the legacy observation-level
+        weighted-means path.
 
     Returns
     -------
@@ -1106,7 +1301,10 @@ def bacon_decompose(
     --------
     >>> from diff_diff import bacon_decompose
     >>>
-    >>> # Quick diagnostic (default)
+    >>> # Default: paper-faithful Goodman-Bacon (2021) Theorem 1 weights
+    >>> # (weights="exact"); intended to match R bacondecomp::bacon() at
+    >>> # atol=1e-6 (R parity goldens pending — see TODO.md "R parity
+    >>> # goldens generation" for the deferred validation step).
     >>> results = bacon_decompose(
     ...     data=panel_df,
     ...     outcome='earnings',
@@ -1115,14 +1313,15 @@ def bacon_decompose(
     ...     first_treat='treatment_year'
     ... )
     >>>
-    >>> # Publication-quality exact decomposition
-    >>> results = bacon_decompose(
+    >>> # Opt-in: simplified-variance fast path for diagnostic loops
+    >>> # (numerical output may differ from R; sum-to-1 still holds).
+    >>> results_approx = bacon_decompose(
     ...     data=panel_df,
     ...     outcome='earnings',
     ...     unit='state',
     ...     time='year',
     ...     first_treat='treatment_year',
-    ...     weights='exact'
+    ...     weights='approximate'
     ... )
     >>>
     >>> # View summary
diff --git a/diff_diff/diagnostic_report.py b/diff_diff/diagnostic_report.py
index f9a21940..e68f1c0f 100644
--- a/diff_diff/diagnostic_report.py
+++ b/diff_diff/diagnostic_report.py
@@ -1743,8 +1743,62 @@ def _check_bacon(self) -> Dict[str, Any]:
                 unit=unit,
                 time=time,
                 first_treat=first_treat,
+                weights="exact",  # paper-faithful Eqs. 7-9 / 10e-g
                 survey_design=self._survey_design,
             )
+        except ValueError as exc:
+            # PR #454 R1 P1: the exact-mode survey path enforces
+            # within-unit constancy of survey columns via
+            # ``_validate_unit_constant_survey``. Survey-backed reports on
+            # panels with time-varying within-unit weights / strata / PSU
+            # / FPC fail that check. The library still supports those
+            # panels via explicit ``bacon_decompose(weights="approximate", ...)``,
+            # so emit a structured skip pointing users at the
+            # ``precomputed={'bacon': ...}`` escape hatch rather than an
+            # opaque ``error`` block.
+            if "varies within units" in str(exc):
+                return {
+                    "status": "skipped",
+                    "reason": (
+                        "Survey design has within-unit-varying columns "
+                        "(weights / strata / PSU / FPC), which the "
+                        'paper-faithful ``weights="exact"`` Bacon path '
+                        "rejects. Run ``bacon_decompose(data, ..., "
+                        'weights="approximate", survey_design=design)`` '
+                        "yourself and pass via "
+                        "``DiagnosticReport(..., precomputed={'bacon': result})`` "
+                        f"to populate this section. Validator detail: {exc}"
+                    ),
+                }
+            return {
+                "status": "error",
+                "reason": f"bacon_decompose raised {type(exc).__name__}: {exc}",
+            }
+        except NotImplementedError as exc:
+            # PR #454 R4 P3: ``BaconDecomposition.fit()`` raises
+            # ``NotImplementedError`` for replicate-weight survey designs
+            # (bacon.py rejects them because Bacon is a diagnostic that
+            # does not compute replicate-based variance). Match the
+            # within-unit-varying skip pattern above: emit a structured
+            # skip naming the ``precomputed={'bacon': ...}`` escape hatch
+            # so survey-backed reports with replicate-weight designs
+            # produce an actionable skip rather than an opaque
+            # ``status="error"`` block.
+            return {
+                "status": "skipped",
+                "reason": (
+                    "Survey design uses replicate weights, which the "
+                    "Bacon decomposition does not support (bacon is a "
+                    "diagnostic and does not compute replicate-based "
+                    "variance). To populate this section, run "
+                    "``bacon_decompose(data, ..., "
+                    "survey_design=SurveyDesign(weights=..., strata=..., "
+                    "psu=..., fpc=...))`` with a TSL-based design and "
+                    "pass via "
+                    "``DiagnosticReport(..., precomputed={'bacon': result})``. "
+                    f"Rejection detail: {exc}"
+                ),
+            }
         except Exception as exc:  # noqa: BLE001
             return {
                 "status": "error",
diff --git a/diff_diff/twfe.py b/diff_diff/twfe.py
index fefe8306..9f52d3dc 100644
--- a/diff_diff/twfe.py
+++ b/diff_diff/twfe.py
@@ -681,7 +681,7 @@ def decompose(
         unit: str,
         time: str,
         first_treat: str,
-        weights: str = "approximate",
+        weights: str = "exact",
     ) -> "BaconDecompositionResults":
         """
         Perform Goodman-Bacon decomposition of TWFE estimate.
@@ -702,14 +702,28 @@ def decompose(
             Name of time period column.
         first_treat : str
             Name of column indicating when each unit was first treated.
-            Use 0 (or np.inf) for never-treated units.
-        weights : str, default="approximate"
+            The values ``0`` and ``np.inf`` are **reserved as
+            never-treated sentinels**; a real treatment cohort with
+            ``first_treat == 0`` would be folded into ``U`` and should
+            be re-labeled to a non-sentinel value before fitting. Units
+            whose ``first_treat`` is at or before the first observable
+            period (``first_treat <= min(time)``, excluding the
+            sentinels) are automatically remapped to the ``U``
+            (untreated) bucket per Goodman-Bacon (2021) footnote 11
+            with a ``UserWarning``. See ``BaconDecomposition.fit()`` for
+            the full contract and
+            ``BaconDecompositionResults.n_always_treated_remapped`` for
+            the count. The user's original ``first_treat`` column is
+            preserved unchanged.
+        weights : str, default="exact"
             Weight calculation method:
 
-            - "approximate": Fast simplified formula (default). Good for
-              diagnostic purposes where relative weights are sufficient.
-            - "exact": Variance-based weights from Goodman-Bacon (2021)
-              Theorem 1. Use for publication-quality decompositions.
+            - "exact" (default): Variance-based weights from Goodman-Bacon
+              (2021) Theorem 1, Eqs. 7-9 and 10e-g. Paper-faithful and
+              the standard methodology contract.
+            - "approximate": Fast simplified formula. Opt in for
+              speed-sensitive diagnostic loops; numerical output may
+              differ from R ``bacondecomp::bacon()``.
 
         Returns
         -------
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index b6dfdd42..d8bc1d0e 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2598,61 +2598,82 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in
 
 ## BaconDecomposition
 
-**Primary source:** [Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. *Journal of Econometrics*, 225(2), 254-277.](https://doi.org/10.1016/j.jeconom.2021.03.014)
+**Primary source:** [Goodman-Bacon, A. (2021). Difference-in-differences with variation in treatment timing. *Journal of Econometrics*, 225(2), 254-277.](https://doi.org/10.1016/j.jeconom.2021.03.014). Paper review on file: `docs/methodology/papers/goodman-bacon-2021-review.md`.
+
+**Scope:** Decomposes the two-way fixed-effects DD (TWFEDD) estimator in Equation (2) when treatment timing varies across units. Theorem 1 expresses `β̂^DD` as a **positively-weighted average of all possible 2x2 DD estimators** in the data, with weights summing to 1. The decomposition is a **diagnostic tool**, not a treatment-effect estimator: it explains *which comparisons drive* the TWFEDD coefficient and *why* the estimator can fail to identify an interpretable causal parameter when treatment effects vary over time.
 
 **Key implementation requirements:**
 
 *Assumption checks / warnings:*
 - Requires variation in treatment timing (staggered adoption)
-- Warns if only one treatment cohort (decomposition not meaningful)
-- Uses never-treated units as controls when present; falls back to timing-only comparisons otherwise
+- Always-treated units (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`; paper footnote 11) are automatically remapped to the `U` (untreated) bucket with a `UserWarning`; see the `**Note (always-treated remap)**` below for the full ordered-time / sentinel contract
+- Unbalanced panels are accepted with a `UserWarning`; the paper's Appendix A proof assumes balanced panels
+- Falls back to timing-only comparisons when no never-treated units are present (no untreated group → `s_{kU}` terms drop, weights rescale to sum to 1; **VWCT and ΔATT can still bias the result** — see paper Eqs. 14-15)
 
-*Estimator equation (as implemented):*
+*Estimator equation (Theorem 1, Equation 10a):*
 
-TWFE decomposes as:
 ```
-τ̂^TWFE = Σ_k s_k × τ̂_k
+β̂^DD = Σ_{k ≠ U} s_{kU} · β̂_{kU}^{2x2}
+      + Σ_{k ≠ U} Σ_{ℓ > k} [ s_{kℓ}^k · β̂_{kℓ}^{2x2,k} + s_{kℓ}^ℓ · β̂_{kℓ}^{2x2,ℓ} ]
 ```
-where k indexes 2×2 comparisons and s_k are Bacon weights.
 
-Three comparison types:
-1. **Treated vs. Never-treated** (if never-treated exist):
-   ```
-   τ̂_{T,U} = (Ȳ_{T,post} - Ȳ_{T,pre}) - (Ȳ_{U,post} - Ȳ_{U,pre})
-   ```
+The three 2x2 estimators (Eqs. 10b-d):
+```
+β̂_{kU}^{2x2}   = (ȳ_k^POST(k)   - ȳ_k^PRE(k))    - (ȳ_U^POST(k)   - ȳ_U^PRE(k))         (Eq. 10b)
+β̂_{kℓ}^{2x2,k} = (ȳ_k^MID(k,ℓ) - ȳ_k^PRE(k))    - (ȳ_ℓ^MID(k,ℓ) - ȳ_ℓ^PRE(k))         (Eq. 10c)
+β̂_{kℓ}^{2x2,ℓ} = (ȳ_ℓ^POST(ℓ)  - ȳ_ℓ^MID(k,ℓ)) - (ȳ_k^POST(ℓ)  - ȳ_k^MID(k,ℓ))        (Eq. 10d)
+```
 
-2. **Earlier vs. Later-treated** (Earlier as treated, Later as control pre-treatment):
-   ```
-   τ̂_{k,l} = DiD using early-treated as treatment, late-treated as control
-   ```
+Comparison-type labels in `BaconDecompositionResults.comparisons`:
+- `"treated_vs_never"` ↔ Eq. 10b
+- `"earlier_vs_later"` ↔ Eq. 10c (k = early = treated; ℓ = late = control during MID)
+- `"later_vs_earlier"` ↔ Eq. 10d (ℓ = late = treated; k = early = already-treated control)
 
-3. **Later vs. Earlier-treated** (problematic: uses post-treatment outcomes as control):
-   ```
-   τ̂_{l,k} = DiD using late-treated as treatment, early-treated (post) as control
-   ```
+*Weight construction (Eqs. 7-9 for variances, 10e-g for weights):*
+
+Fixed-effects-adjusted treatment-dummy variances:
+```
+V̂_{kU}^D     = n_{kU}(1 - n_{kU}) · D̄_k(1 - D̄_k)                              (Eq. 7)
+V̂_{kℓ}^{D,k} = n_{kℓ}(1 - n_{kℓ}) · (D̄_k - D̄_ℓ)/(1 - D̄_ℓ) · (1 - D̄_k)/(1 - D̄_ℓ)   (Eq. 8)
+V̂_{kℓ}^{D,ℓ} = n_{kℓ}(1 - n_{kℓ}) · D̄_ℓ/D̄_k · (D̄_k - D̄_ℓ)/D̄_k                  (Eq. 9)
+```
+
+Decomposition weights:
+```
+s_{kU}   = ((n_k + n_U)^2 · V̂_{kU}^D)        / V̂^D                                  (Eq. 10e)
+s_{kℓ}^k = ((n_k + n_ℓ)(1 - D̄_ℓ))^2 · V̂_{kℓ}^{D,k} / V̂^D                            (Eq. 10f)
+s_{kℓ}^ℓ = ((n_k + n_ℓ) · D̄_k)^2 · V̂_{kℓ}^{D,ℓ}    / V̂^D                            (Eq. 10g)
+```
 
-Weights depend on group sizes and variance in treatment timing.
+Where `n_k` is the sample share of timing group `k`, `n_{kℓ} = n_k / (n_k + n_ℓ)`, and `D̄_k = (T - k + 1)/T` is the share of periods group `k` spends treated. Weights are **strictly positive** and sum to 1 (Theorem 1).
 
 *Standard errors:*
-- Not typically computed (decomposition is exact)
-- Individual 2×2 estimates can have SEs
+- The decomposition is a deterministic algebraic identity; standard errors are not the paper's focus. The point estimates and weights are exact given the data, and on **balanced** panels `β̂^DD` from the decomposition matches the OLS coefficient from TWFEDD to machine precision under `weights="exact"`. The machine-precision claim does **not** extend to unbalanced panels (see edge cases below).
+- Inference for the TWFEDD coefficient itself is typically cluster-robust at the unit level.
 
 *Edge cases:*
 - Continuous treatment: not supported, requires binary
-- Weights may be negative for later-vs-earlier comparisons
-- Single treatment time: no decomposition possible
+- Single treatment time: K=1 with U is valid (only `treated_vs_never` terms); K=1 without U has no decomposition (zero variation in `D`).
+- `D̄_k → 0` or `D̄_k → 1`: subsample treatment variance goes to zero, that timing group contributes zero weight
+- Always-treated units: see `**Note (always-treated remap)**` below
 
 **Reference implementation(s):**
-- R: `bacondecomp::bacon()`
-- Stata: `bacondecomp`
+- R: `bacondecomp::bacon()` (CRAN). Parity script at `benchmarks/R/generate_bacon_golden.R`; goldens pending follow-up R install (see TODO.md).
+- Stata: `bacondecomp` (SSC). Authors: Goodman-Bacon, Goldring, Nichols (2019).
 
 **Requirements checklist:**
-- [ ] Three comparison types: treated_vs_never, earlier_vs_later, later_vs_earlier
-- [ ] Weights sum to approximately 1 (numerical precision)
-- [ ] TWFE coefficient ≈ weighted sum of 2×2 estimates
-- [ ] Visualization shows weight vs. estimate by comparison type
+- [x] Three comparison types: treated_vs_never, earlier_vs_later, later_vs_earlier
+- [x] Weights sum to 1 (machine precision under `weights="exact"` on **balanced** panels; see PR-B audit)
+- [x] TWFE coefficient = weighted sum of 2×2 estimates (machine precision under `weights="exact"` on **balanced** panels)
+- [x] Visualization shows weight vs. estimate by comparison type
+- [x] Always-treated remap to U per Goodman-Bacon (2021) footnote 11 (PR-B audit)
+- [x] Hand-calculable Theorem 1 verification: `tests/test_methodology_bacon.py::TestBaconHandCalculation` (7 tests, atol=1e-10)
+- [ ] R `bacondecomp::bacon()` parity at atol=1e-6 (R generator script committed; JSON goldens pending follow-up R install — `tests/test_methodology_bacon.py::TestBaconParityR` skips when missing)
 - [x] Survey design support (Phase 3): weighted cell means, weighted within-transform, weighted group shares
-- **Note:** Bacon decomposition with survey weights is diagnostic; exact-sum guarantee is approximate; `weights="exact"` requires within-unit-constant survey columns (approximate path accepts time-varying weights)
+- **Note (weight modes):** `weights="exact"` (default, paper-faithful Eqs. 7-9 + 10e-g) vs `weights="approximate"` (simplified variance, opt-in for speed-sensitive diagnostic loops). The PR-A paper review (#451) and PR-B audit established `"exact"` as the default with the **intent** to match R `bacondecomp::bacon()` and the paper's Theorem 1 contract; R parity is validated by hand-calculation (atol=1e-10) and TWFE-vs-weighted-sum identity (atol=1e-10) but the direct R bit-by-bit parity at atol=1e-6 is still pending the R `bacondecomp` install — see Test Coverage checklist above. The approximate path is retained for backward compatibility; numerical output may differ from R.
+- **Note (always-treated remap):** Units whose `first_treat` is at or before the first observable period (`first_treat <= min(time)`, excluding the never-treated sentinels `0` and `np.inf`) are automatically remapped to the `U` bucket via an internal column (`__bacon_first_treat_internal__`) with a `UserWarning` — per paper footnote 11. Detection uses ordered-time logic on the **time axis**, so panels whose `time` column has negative or zero-crossing labels (e.g. event-time `time ∈ [-2,..,3]`) are handled correctly: a cohort at `first_treat=-1` on such a panel is a valid timing group; a cohort at `first_treat=-3` is remapped to U. The user's original `first_treat` column on the input `data` frame is preserved unchanged. The count of remapped units is surfaced via `BaconDecompositionResults.n_always_treated_remapped`. **Sentinel restriction:** `first_treat ∈ {0, np.inf}` is reserved as the never-treated marker and is not configurable today; a real treatment cohort with `first_treat == 0` would be folded into `U` and should be re-labeled to a non-sentinel value before fitting. The `0` reservation applies to `first_treat` only, not to `time`.
+- **Note (Bacon survey diagnostic):** Bacon decomposition with survey weights is diagnostic; exact-sum guarantee holds at machine precision under `weights="exact"` **on balanced panels**. `weights="exact"` requires within-unit-constant survey columns (approximate path accepts time-varying weights).
+- **Deviation (unbalanced-panel library extension):** Unbalanced panels are accepted with a `UserWarning` ("Unbalanced panel detected. Bacon decomposition assumes balanced panels. Results may be inaccurate."). Goodman-Bacon (2021) Appendix A's proof assumes a balanced panel; under unbalance, the Theorem 1 identity holds only approximately. The decomposition still returns finite, well-defined outputs but `weights="exact"` does NOT achieve the machine-precision algebraic identity that the balanced-panel claims above describe.
 
 ---
 
diff --git a/tests/test_bacon.py b/tests/test_bacon.py
index 349f66d8..cce72d3d 100644
--- a/tests/test_bacon.py
+++ b/tests/test_bacon.py
@@ -119,7 +119,11 @@ def test_weights_sum_to_one(self):
         assert abs(total_weight - 1.0) < 0.01, f"Weights sum to {total_weight}, not 1.0"
 
     def test_weighted_sum_equals_twfe(self):
-        """Test that weighted sum of 2x2 estimates equals TWFE."""
+        """Test that weighted sum of 2x2 estimates equals TWFE.
+
+        Under the default ``weights="exact"`` mode (Goodman-Bacon 2021
+        Theorem 1, Eqs. 10a-g), the identity holds to machine precision.
+        """
         data = generate_staggered_data(seed=456)
 
         results = bacon_decompose(
@@ -127,14 +131,15 @@ def test_weighted_sum_equals_twfe(self):
             outcome='outcome',
             unit='unit',
             time='time',
-            first_treat='first_treat'
+            first_treat='first_treat',
+            weights='exact',
         )
 
         weighted_sum = sum(c.weight * c.estimate for c in results.comparisons)
 
-        # Allow for small numerical error
-        assert abs(results.twfe_estimate - weighted_sum) < 0.1, (
-            f"TWFE ({results.twfe_estimate:.4f}) != weighted sum ({weighted_sum:.4f})"
+        assert abs(results.twfe_estimate - weighted_sum) < 1e-10, (
+            f"TWFE ({results.twfe_estimate:.10f}) != weighted sum "
+            f"({weighted_sum:.10f}) under exact mode"
         )
 
     def test_comparison_types(self):
@@ -422,12 +427,18 @@ def test_plot_bacon_invalid_type(self):
 class TestWeightsParameter:
     """Tests for configurable weights parameter."""
 
-    def test_approximate_weights_default(self):
-        """Test that approximate weights are used by default."""
+    def test_exact_weights_default(self):
+        """Test that exact weights are used by default.
+
+        Goodman-Bacon (2021) Theorem 1 is the paper-faithful default
+        (Eqs. 10a-g). Prior releases defaulted to ``"approximate"``;
+        the default flipped to ``"exact"`` in PR-B (2026-05-16) so the
+        diagnostic surface matches R ``bacondecomp::bacon()``.
+        """
         data = generate_staggered_data(seed=789)
 
         decomp = BaconDecomposition()
-        assert decomp.weights == "approximate"
+        assert decomp.weights == "exact"
 
         results = decomp.fit(
             data,
@@ -439,6 +450,33 @@ def test_approximate_weights_default(self):
 
         # Weights should sum to 1
         total_weight = sum(c.weight for c in results.comparisons)
+        assert abs(total_weight - 1.0) < 1e-10
+
+    def test_approximate_weights_opt_in(self):
+        """Test the legacy approximate-weights path still works.
+
+        ``weights="approximate"`` is retained for backward compatibility
+        and speed-sensitive diagnostic loops. Numerical output may differ
+        from R ``bacondecomp::bacon()`` (which implements the exact
+        Eqs. 7-9). The sum-to-1 contract is preserved via post-hoc
+        normalization.
+        """
+        data = generate_staggered_data(seed=789)
+
+        decomp = BaconDecomposition(weights="approximate")
+        assert decomp.weights == "approximate"
+
+        results = decomp.fit(
+            data,
+            outcome='outcome',
+            unit='unit',
+            time='time',
+            first_treat='first_treat'
+        )
+
+        # Sum-to-1 contract preserved via normalization, but tolerance is
+        # looser than exact-mode (legacy approximate floor).
+        total_weight = sum(c.weight for c in results.comparisons)
         assert abs(total_weight - 1.0) < 0.01
 
     def test_exact_weights(self):
diff --git a/tests/test_methodology_bacon.py b/tests/test_methodology_bacon.py
new file mode 100644
index 00000000..f4e43745
--- /dev/null
+++ b/tests/test_methodology_bacon.py
@@ -0,0 +1,1141 @@
+"""
+Methodology verification tests for BaconDecomposition.
+
+Targets the Goodman-Bacon (2021) decomposition theorem:
+
+    β̂^DD = Σ_{k≠U} s_{kU} · β̂_{kU}^{2x2}
+         + Σ_{k≠U} Σ_{ℓ>k} [s_{kℓ}^k · β̂_{kℓ}^{2x2,k} + s_{kℓ}^ℓ · β̂_{kℓ}^{2x2,ℓ}]   (Eq. 10a)
+
+with weights from Eqs. 7-9 + 10e-g. See:
+- ``docs/methodology/papers/goodman-bacon-2021-review.md``  paper review
+- ``docs/methodology/REGISTRY.md`` ``## BaconDecomposition`` block
+- ``METHODOLOGY_REVIEW.md`` ``BaconDecomposition`` section
+
+Test class breakdown:
+- ``TestBaconHandCalculation`` — hand-calculable balanced panel; sum-to-1,
+  TWFE-vs-weighted-sum identity, per-Eq variance hand-checks at atol=1e-10.
+- ``TestBaconParityR`` — R parity at atol=1e-6 against the committed
+  ``benchmarks/data/r_bacondecomp_golden.json`` (skip if missing).
+- ``TestBaconAlwaysTreatedRemap`` — warn+remap of first_treat <= min(time)
+  (excluding never-treated sentinels 0 and np.inf) to U; user's first_treat
+  column preserved unchanged.
+- ``TestBaconEdgeCases`` — no-untreated, single-cohort, boundary D̄_k,
+  unbalanced panel, constant-ATT recovery.
+- ``TestBaconWeightModes`` — exact-is-default, approximate-opt-in,
+  exact-vs-approximate differ meaningfully.
+- ``TestBaconSurveyDesignNarrowing`` — survey_design composes cleanly
+  with the remap and the new exact-mode default.
+"""
+
+from __future__ import annotations
+
+import json
+import warnings
+from pathlib import Path
+from typing import List, Tuple
+
+import numpy as np
+import pandas as pd
+import pytest
+
+from diff_diff import (
+    BaconDecomposition,
+    bacon_decompose,
+)
+
+# ---------------------------------------------------------------------------
+# Hand-calculable DGP
+# ---------------------------------------------------------------------------
+#
+# Balanced panel with two timing groups + never-treated, T = 4 periods,
+# 3 units per group. The treatment shares are:
+#
+#   D̄_2 = (T - 2 + 1) / T = 3/4 = 0.75
+#   D̄_3 = (T - 3 + 1) / T = 2/4 = 0.50
+#   D̄_U = 0
+#
+# Sample shares (n_k = units in group / total units):
+#
+#   n_2 = 3/9 = 1/3
+#   n_3 = 3/9 = 1/3
+#   n_U = 3/9 = 1/3
+#
+# Theorem 1 weights (numerators of Eqs. 10e-g, before V̂^D normalization):
+#
+#   s_{2U}    ∝ (n_2 + n_U)^2 · V̂_{2U}^D    = (2/3)^2 · (1/2)(1/2)(3/4)(1/4)  = 0.020833...
+#   s_{3U}    ∝ (n_3 + n_U)^2 · V̂_{3U}^D    = (2/3)^2 · (1/2)(1/2)(1/2)(1/2)  = 0.027778...
+#   s_{23}^k  ∝ ((n_2+n_3)(1-D̄_3))^2 · V̂^{D,k}_{23}  = ((2/3)(1/2))^2 · 0.0625 = 0.006944...
+#   s_{23}^ℓ  ∝ ((n_2+n_3)·D̄_2)^2 · V̂^{D,ℓ}_{23}    = ((2/3)(3/4))^2 · 0.05555 = 0.013889...
+#
+# V̂^D = Σ numerators = 0.069444... → normalized weights:
+#
+#   s_{2U}   = 0.3
+#   s_{3U}   = 0.4
+#   s_{23}^k = 0.1
+#   s_{23}^ℓ = 0.2
+#
+# With constant ATT=5 across cohorts and periods (no noise), every 2x2 DD
+# equals 5, so β̂^DD = 5 × (0.3 + 0.4 + 0.1 + 0.2) = 5 exactly.
+
+
+def _hand_calc_panel(true_effect: float = 5.0) -> pd.DataFrame:
+    """Build the hand-calculable 2-cohort + U panel described above.
+
+    No noise. Outcome y = unit_fe + 0.5*time + true_effect*D, with unit
+    fixed effects spaced so each unit's intercept is distinct.
+    """
+    rows: List[Tuple] = []
+    uid = 1
+    # Group 2 (first_treat=2): 3 units, 4 periods each
+    for _ in range(3):
+        for t in range(1, 5):
+            D = 1 if t >= 2 else 0
+            y = float(uid) * 10.0 + 0.5 * t + true_effect * D
+            rows.append((uid, t, y, 2))
+        uid += 1
+    # Group 3 (first_treat=3): 3 units
+    for _ in range(3):
+        for t in range(1, 5):
+            D = 1 if t >= 3 else 0
+            y = float(uid) * 10.0 + 0.5 * t + true_effect * D
+            rows.append((uid, t, y, 3))
+        uid += 1
+    # Never-treated U (first_treat=0): 3 units
+    for _ in range(3):
+        for t in range(1, 5):
+            y = float(uid) * 10.0 + 0.5 * t
+            rows.append((uid, t, y, 0))
+        uid += 1
+    return pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"])
+
+
+def _staggered_data(seed: int = 42) -> pd.DataFrame:
+    """Larger DGP for sum-to-1 / TWFE-equals-sum checks beyond the toy case."""
+    rng = np.random.default_rng(seed)
+    n_units = 60
+    n_periods = 8
+    cohort_periods = np.array([3, 5, 7])
+    # 20% never-treated, evenly split across cohorts otherwise
+    cohort_assign = rng.choice(
+        np.concatenate([[0], cohort_periods]),
+        size=n_units,
+        p=[0.2, 0.27, 0.27, 0.26],
+    )
+    rows = []
+    for u in range(n_units):
+        ft = cohort_assign[u]
+        unit_fe = rng.standard_normal() * 2.0
+        for t in range(1, n_periods + 1):
+            D = int(ft > 0 and t >= ft)
+            y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.5
+            rows.append((u + 1, t, y, int(ft)))
+    return pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"])
+
+
+# ---------------------------------------------------------------------------
+# 1. Hand-calculable Theorem 1 verification
+# ---------------------------------------------------------------------------
+
+
+class TestBaconHandCalculation:
+    """Theorem 1 / Eqs. 7-9 + 10e-g verified on a minimal balanced panel.
+
+    The hand-calculable DGP at module scope was hand-derived to produce
+    normalized weights ``{0.3, 0.4, 0.1, 0.2}`` and TWFE = ATT = 5 exactly
+    (constant treatment effect, no noise). Tolerances are tight (1e-10)
+    because the identity is purely algebraic.
+    """
+
+    def test_weights_sum_to_one(self) -> None:
+        df = _hand_calc_panel()
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        total = sum(c.weight for c in results.comparisons)
+        assert abs(total - 1.0) < 1e-10
+
+    def test_twfe_equals_weighted_sum(self) -> None:
+        """Theorem 1's algebraic identity at machine precision."""
+        df = _hand_calc_panel()
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        weighted_sum = sum(c.weight * c.estimate for c in results.comparisons)
+        assert abs(results.twfe_estimate - weighted_sum) < 1e-10
+        # Also check the dataclass-cached value.
+        assert results.decomposition_error < 1e-10
+
+    def test_three_comparison_types_present(self) -> None:
+        df = _hand_calc_panel()
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        types = {c.comparison_type for c in results.comparisons}
+        assert "treated_vs_never" in types
+        assert "earlier_vs_later" in types
+        assert "later_vs_earlier" in types
+
+    def test_eq_10b_treated_vs_never_value(self) -> None:
+        """β̂_{2U}^{2x2} = 5 on the hand-calc panel (constant ATT, no noise)."""
+        df = _hand_calc_panel()
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        treated_vs_never = [
+            c for c in results.comparisons if c.comparison_type == "treated_vs_never"
+        ]
+        # All treated_vs_never comparisons should yield ATT = 5
+        for comp in treated_vs_never:
+            assert (
+                abs(comp.estimate - 5.0) < 1e-10
+            ), f"{comp.treated_group} vs U: estimate={comp.estimate}, expected 5.0"
+
+    def test_eq_7_treated_untreated_variance(self) -> None:
+        """V̂_{kU}^D = n_kU(1-n_kU) · D̄_k(1-D̄_k) — verify via weight-share decomp.
+
+        s_{2U} / s_{3U} = (n_2+n_U)^2 · V̂_{2U}^D / ((n_3+n_U)^2 · V̂_{3U}^D)
+        With equal cohort sizes, this simplifies to:
+        V̂_{2U}^D / V̂_{3U}^D = (D̄_2(1-D̄_2)) / (D̄_3(1-D̄_3))
+                            = (0.75 · 0.25) / (0.5 · 0.5)
+                            = 0.1875 / 0.25 = 0.75
+        So s_{2U} / s_{3U} = 0.75 (= 0.3 / 0.4).
+        """
+        df = _hand_calc_panel()
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        s2U = next(
+            c.weight
+            for c in results.comparisons
+            if c.comparison_type == "treated_vs_never" and c.treated_group == 2
+        )
+        s3U = next(
+            c.weight
+            for c in results.comparisons
+            if c.comparison_type == "treated_vs_never" and c.treated_group == 3
+        )
+        # Expected normalized: s2U=0.3, s3U=0.4 → ratio = 0.75
+        assert abs(s2U / s3U - 0.75) < 1e-10
+
+    def test_eq_8_earlier_vs_later_variance(self) -> None:
+        """V̂_{kℓ}^{D,k}: weight s_{23}^k normalized expected = 0.1.
+
+        Per the hand calc: s_{23}^k = 0.006944... / 0.069444... = 0.1
+        """
+        df = _hand_calc_panel()
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        s_kl_k = next(
+            c.weight
+            for c in results.comparisons
+            if c.comparison_type == "earlier_vs_later"
+            and c.treated_group == 2
+            and c.control_group == 3
+        )
+        assert abs(s_kl_k - 0.1) < 1e-10
+
+    def test_eq_9_later_vs_earlier_variance(self) -> None:
+        """V̂_{kℓ}^{D,ℓ}: weight s_{23}^ℓ normalized expected = 0.2.
+
+        Per the hand calc: s_{23}^ℓ = 0.013889... / 0.069444... = 0.2
+        """
+        df = _hand_calc_panel()
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        s_kl_l = next(
+            c.weight
+            for c in results.comparisons
+            if c.comparison_type == "later_vs_earlier"
+            and c.treated_group == 3
+            and c.control_group == 2
+        )
+        assert abs(s_kl_l - 0.2) < 1e-10
+
+
+# ---------------------------------------------------------------------------
+# 2. R parity (skip if golden JSON not committed)
+# ---------------------------------------------------------------------------
+
+_R_GOLDEN_PATH = (
+    Path(__file__).resolve().parent.parent / "benchmarks" / "data" / "r_bacondecomp_golden.json"
+)
+
+
+def _load_r_golden() -> dict:
+    if not _R_GOLDEN_PATH.exists():
+        pytest.skip(
+            f"R parity goldens missing at {_R_GOLDEN_PATH}. To generate, "
+            "install R + `install.packages('bacondecomp')` + "
+            "`install.packages('jsonlite')` then `cd benchmarks/R && "
+            "Rscript generate_bacon_golden.R`. The R goldens are deferred "
+            "until R is provisioned (see TODO.md)."
+        )
+    return json.loads(_R_GOLDEN_PATH.read_text())
+
+
+class TestBaconParityR:
+    """R `bacondecomp::bacon()` parity at atol=1e-6 (when goldens present)."""
+
+    @pytest.fixture(scope="class")
+    def golden(self) -> dict:
+        return _load_r_golden()
+
+    def test_twfe_coef_matches_r(self, golden) -> None:
+        for fixture_name, fix in golden.items():
+            if fixture_name == "meta":
+                continue
+            panel = pd.DataFrame(fix["panel"])
+            results = bacon_decompose(
+                panel,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+            assert abs(results.twfe_estimate - fix["r_twfe_coef"]) < 1e-6, (
+                f"{fixture_name}: TWFE Python={results.twfe_estimate} " f"vs R={fix['r_twfe_coef']}"
+            )
+
+    def test_weights_sum_matches_r(self, golden) -> None:
+        for fixture_name, fix in golden.items():
+            if fixture_name == "meta":
+                continue
+            panel = pd.DataFrame(fix["panel"])
+            results = bacon_decompose(
+                panel,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+            py_sum = sum(c.weight for c in results.comparisons)
+            assert abs(py_sum - fix["r_weights_sum"]) < 1e-6, (
+                f"{fixture_name}: weight sum Python={py_sum} " f"vs R={fix['r_weights_sum']}"
+            )
+
+    def test_component_estimates_match_r(self, golden) -> None:
+        """Match per-component estimates at atol=1e-6 across Python + R.
+
+        Join key is ``(comparison_type, treated_group_float, control_canonical)``
+        where ``control_canonical`` is the literal ``"U"`` for
+        ``treated_vs_never`` comparisons (Python stores the string
+        ``"never_treated"``; R may use ``Inf`` or its own string) and the
+        float-coerced cohort timing for timing-vs-timing comparisons.
+        Including ``comparison_type`` in the key disambiguates the two
+        directions of a timing pair (earlier_vs_later vs later_vs_earlier).
+        """
+
+        def _canonical_control(ctype: str, group):
+            if ctype == "treated_vs_never":
+                return "U"
+            return float(group)
+
+        def _classify_r_type(c: dict, fixture_name: str) -> str:
+            # R bacondecomp's `type` strings vary across versions
+            # ("Treated vs Untreated", "Earlier vs Later Treated", ...).
+            # Fall back to inferring from the control_group: U sentinel
+            # (0, np.inf, or "never"-containing string) -> treated_vs_never;
+            # otherwise treated_group < control_group is earlier-vs-later.
+            t = c.get("type") or ""
+            if "never" in t.lower() or "untreated" in t.lower():
+                return "treated_vs_never"
+            ctrl = c["control_group"]
+            if isinstance(ctrl, str) and "never" in ctrl.lower():
+                return "treated_vs_never"
+            if isinstance(ctrl, (int, float)) and (ctrl == 0 or np.isinf(ctrl)):
+                return "treated_vs_never"
+            try:
+                if float(c["treated_group"]) < float(ctrl):
+                    return "earlier_vs_later"
+                return "later_vs_earlier"
+            except (TypeError, ValueError):
+                pytest.fail(
+                    f"{fixture_name}: cannot classify R component "
+                    f"treated={c.get('treated_group')} "
+                    f"control={ctrl} type={t!r}"
+                )
+
+        for fixture_name, fix in golden.items():
+            if fixture_name == "meta":
+                continue
+            panel = pd.DataFrame(fix["panel"])
+            results = bacon_decompose(
+                panel,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+            py_estimates = {}
+            py_weights = {}
+            for c in results.comparisons:
+                key = (
+                    c.comparison_type,
+                    float(c.treated_group),
+                    _canonical_control(c.comparison_type, c.control_group),
+                )
+                py_estimates[key] = c.estimate
+                py_weights[key] = c.weight
+            r_estimates: dict = {}
+            r_weights: dict = {}
+            for c in fix["r_components"]:
+                ctype = _classify_r_type(c, fixture_name)
+                key = (
+                    ctype,
+                    float(c["treated_group"]),
+                    _canonical_control(ctype, c["control_group"]),
+                )
+                r_estimates[key] = c["estimate"]
+                r_weights[key] = c["weight"]
+            # Full-set equality: no Python component missing from R, no R
+            # component missing from Python. A dropped β̂_{kU} term or an
+            # extra spurious comparison would fail here.
+            py_keys = set(py_estimates)
+            r_keys = set(r_estimates)
+            missing_in_r = py_keys - r_keys
+            missing_in_py = r_keys - py_keys
+            assert not missing_in_r and not missing_in_py, (
+                f"{fixture_name}: component sets differ. "
+                f"In Python but not R: {sorted(missing_in_r)}. "
+                f"In R but not Python: {sorted(missing_in_py)}."
+            )
+            # Per-component estimate AND weight parity.
+            for k in py_keys:
+                assert abs(py_estimates[k] - r_estimates[k]) < 1e-6, (
+                    f"{fixture_name} {k}: estimate Python={py_estimates[k]} "
+                    f"vs R={r_estimates[k]}"
+                )
+                assert abs(py_weights[k] - r_weights[k]) < 1e-6, (
+                    f"{fixture_name} {k}: weight Python={py_weights[k]} " f"vs R={r_weights[k]}"
+                )
+
+
+# ---------------------------------------------------------------------------
+# 3. Always-treated warn+remap
+# ---------------------------------------------------------------------------
+
+
+def _panel_with_always_treated() -> pd.DataFrame:
+    """Build a panel with a small always-treated cohort (first_treat=1)."""
+    rng = np.random.default_rng(7)
+    rows = []
+    uid = 1
+    # 4 always-treated units (first_treat=1; treated in every observable period)
+    for _ in range(4):
+        unit_fe = rng.standard_normal() * 2.0
+        for t in range(1, 7):
+            y = unit_fe + 0.05 * t + 2.0 + rng.standard_normal() * 0.3
+            rows.append((uid, t, y, 1))
+        uid += 1
+    # 10 never-treated
+    for _ in range(10):
+        unit_fe = rng.standard_normal() * 2.0
+        for t in range(1, 7):
+            y = unit_fe + 0.05 * t + rng.standard_normal() * 0.3
+            rows.append((uid, t, y, 0))
+        uid += 1
+    # 10 cohort-3 + 10 cohort-5
+    for cohort in (3, 5):
+        for _ in range(10):
+            unit_fe = rng.standard_normal() * 2.0
+            for t in range(1, 7):
+                D = int(t >= cohort)
+                y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3
+                rows.append((uid, t, y, cohort))
+            uid += 1
+    return pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"])
+
+
+class TestBaconAlwaysTreatedRemap:
+    """Goodman-Bacon (2021) footnote 11 with the library's first-period
+    boundary convention.
+
+    The paper's footnote 11 says units treated before the first observable
+    period (``t_i < 1`` under the paper's 1-indexed convention) belong in
+    ``U``. The library generalizes this to units whose
+    ``first_treat <= min(time)`` (i.e. includes ``first_treat == min(time)``,
+    which the paper's strict ``<`` shorthand excludes). The library
+    convention is pragmatic: units treated at the first observable period
+    have no untreated cell within the panel and cannot contribute to any
+    valid 2x2 DD as a treated cohort, so folding them into ``U`` mirrors
+    the always-treated handling. The never-treated sentinels
+    (``first_treat ∈ {0, np.inf}``) are excluded from the remap.
+
+    bacon.py applies the remap via an internal column
+    (``__bacon_first_treat_internal__``), preserving the user's original
+    ``first_treat`` column unchanged. Detection uses ordered-time logic
+    on the **time axis**, so event-time-encoded panels
+    (``time ∈ [-2,..,3]``) are handled correctly.
+    """
+
+    def test_warn_emitted_on_remap(self) -> None:
+        df = _panel_with_always_treated()
+        with pytest.warns(UserWarning, match="Remapping to U bucket"):
+            bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+
+    def test_user_first_treat_column_unchanged(self) -> None:
+        """Regression test: user's data should not be mutated by the remap."""
+        df = _panel_with_always_treated()
+        original = df["first_treat"].copy()
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=UserWarning)
+            bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+        pd.testing.assert_series_equal(df["first_treat"], original)
+
+    def test_n_always_treated_remapped_reported(self) -> None:
+        df = _panel_with_always_treated()
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=UserWarning)
+            results = bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+        # 4 units were always-treated (first_treat=1)
+        assert results.n_always_treated_remapped == 4
+
+    def test_treated_vs_never_emitted_when_U_is_only_remapped_always_treated(
+        self,
+    ) -> None:
+        """Regression test for the R1 P0 finding.
+
+        When the user supplies NO ``first_treat ∈ {0, np.inf}`` units (so
+        ``n_never_treated == 0``) but supplies always-treated units that
+        the remap reclassifies into ``U``, the ``treated_vs_never``
+        comparison loop must still fire — gated on the POST-remap U
+        bucket, not the pre-remap never-treated count. Otherwise
+        Theorem 1's algebraic identity breaks because all
+        ``β̂_{kU}^{2x2}`` terms are silently dropped.
+        """
+        rng = np.random.default_rng(13)
+        rows = []
+        uid = 1
+        # 8 always-treated units (first_treat=1) — these become the entire U bucket
+        for _ in range(8):
+            unit_fe = rng.standard_normal() * 2.0
+            for t in range(1, 7):
+                y = unit_fe + 0.05 * t + 2.0 + rng.standard_normal() * 0.3
+                rows.append((uid, t, y, 1))
+            uid += 1
+        # 8 cohort-3 + 8 cohort-5 (no true never-treated)
+        for cohort in (3, 5):
+            for _ in range(8):
+                unit_fe = rng.standard_normal() * 2.0
+                for t in range(1, 7):
+                    D = int(t >= cohort)
+                    y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3
+                    rows.append((uid, t, y, cohort))
+                uid += 1
+        df = pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"])
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=UserWarning)
+            results = bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+        # n_never_treated counts TRUE never-treated only (user column = 0 or inf).
+        # All 8 U-bucket units came from the always-treated remap, so this is 0.
+        assert results.n_never_treated == 0
+        assert results.n_always_treated_remapped == 8
+        # CRITICAL: treated_vs_never comparisons MUST be present because the
+        # remapped U bucket is the paper's `U` per footnote 11.
+        types = {c.comparison_type for c in results.comparisons}
+        assert "treated_vs_never" in types, (
+            "Theorem 1 violation: treated_vs_never comparisons were "
+            "silently dropped when U is composed of remapped always-treated "
+            "units only (pre-fix R1 P0 regression)."
+        )
+        # Decomposition identity must hold at machine precision.
+        weighted_sum = sum(c.weight * c.estimate for c in results.comparisons)
+        assert abs(results.twfe_estimate - weighted_sum) < 1e-10
+        assert results.decomposition_error < 1e-10
+
+    def test_negative_first_treat_as_valid_timing_group(self) -> None:
+        """Regression for nonpositive-time encodings (event-time panels).
+
+        On a panel with ``time = [-2, -1, 0, 1, 2, 3]``, a cohort with
+        ``first_treat = -1`` is a valid timing group with one observable
+        pre-period (t=-2). Detection must use ordered-time logic, not
+        positive-sign restriction. Prior to the R3 P0 fix this cohort
+        was silently dropped (both the remap mask and the timing_groups
+        filter required ``first_treat > 0``), violating Theorem 1.
+        """
+        rng = np.random.default_rng(73)
+        rows = []
+        uid = 1
+        # 10 never-treated U
+        for _ in range(10):
+            unit_fe = rng.standard_normal() * 2.0
+            for t in range(-2, 4):
+                y = unit_fe + 0.05 * t + rng.standard_normal() * 0.3
+                rows.append((uid, t, y, 0))
+            uid += 1
+        # 8 cohort treated at t=-1 (valid timing group, one pre-period)
+        for _ in range(8):
+            unit_fe = rng.standard_normal() * 2.0
+            for t in range(-2, 4):
+                D = int(t >= -1)
+                y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3
+                rows.append((uid, t, y, -1))
+            uid += 1
+        # 8 cohort treated at t=2
+        for _ in range(8):
+            unit_fe = rng.standard_normal() * 2.0
+            for t in range(-2, 4):
+                D = int(t >= 2)
+                y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3
+                rows.append((uid, t, y, 2))
+            uid += 1
+        df = pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"])
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=UserWarning)
+            results = bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+        # -1 cohort should appear as a timing group, NOT remapped to U
+        assert -1 in results.timing_groups
+        assert 2 in results.timing_groups
+        assert results.n_always_treated_remapped == 0
+        # Both cohorts should produce treated_vs_never comparisons
+        types_per_treated = {(c.treated_group, c.comparison_type) for c in results.comparisons}
+        assert (-1, "treated_vs_never") in types_per_treated
+        assert (2, "treated_vs_never") in types_per_treated
+        # Theorem 1 identity holds at machine precision.
+        weighted_sum = sum(c.weight * c.estimate for c in results.comparisons)
+        assert abs(results.twfe_estimate - weighted_sum) < 1e-10
+        assert results.decomposition_error < 1e-10
+
+    def test_negative_first_treat_below_min_time_remapped(self) -> None:
+        """``first_treat=-3`` on a panel with ``min(time)=-2`` is always-treated.
+
+        The unit's first treatment is before the first observable period,
+        so per paper footnote 11 it goes in U. Detection must use
+        ordered-time logic (``first_treat <= min(time)``), not positive-
+        sign restriction.
+        """
+        rng = np.random.default_rng(74)
+        rows = []
+        uid = 1
+        # 6 always-treated (first_treat=-3, pre-panel)
+        for _ in range(6):
+            unit_fe = rng.standard_normal() * 2.0
+            for t in range(-2, 4):
+                y = unit_fe + 0.05 * t + 2.0 + rng.standard_normal() * 0.3
+                rows.append((uid, t, y, -3))
+            uid += 1
+        # 8 cohort treated at t=2
+        for _ in range(8):
+            unit_fe = rng.standard_normal() * 2.0
+            for t in range(-2, 4):
+                D = int(t >= 2)
+                y = unit_fe + 0.05 * t + 2.0 * D + rng.standard_normal() * 0.3
+                rows.append((uid, t, y, 2))
+            uid += 1
+        df = pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"])
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=UserWarning)
+            results = bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+        assert results.n_always_treated_remapped == 6
+        assert results.n_never_treated == 0
+        # 2 should be the only timing group (always-treated are remapped to U).
+        assert 2 in results.timing_groups
+        assert -3 not in results.timing_groups
+        # treated_vs_never must fire because U is non-empty post-remap.
+        types = {c.comparison_type for c in results.comparisons}
+        assert "treated_vs_never" in types
+        # Theorem 1 identity holds at machine precision.
+        weighted_sum = sum(c.weight * c.estimate for c in results.comparisons)
+        assert abs(results.twfe_estimate - weighted_sum) < 1e-10
+        assert results.decomposition_error < 1e-10
+
+    def test_no_warning_on_sentinel_only_inputs(self) -> None:
+        """When first_treat ∈ {0, np.inf} only, no remap warning fires."""
+        df = _staggered_data(seed=11)
+        # Replace any first_treat=1 (shouldn't be any in this DGP, but defensive)
+        df = df[df["first_treat"] != 1]
+        with warnings.catch_warnings():
+            warnings.simplefilter("error", category=UserWarning)
+            try:
+                bacon_decompose(
+                    df,
+                    outcome="y",
+                    unit="unit",
+                    time="time",
+                    first_treat="first_treat",
+                    weights="exact",
+                )
+            except UserWarning as w:
+                if "Remapping to U bucket" in str(w):
+                    pytest.fail(f"Unexpected remap warning on sentinel-only inputs: {w}")
+
+
+# ---------------------------------------------------------------------------
+# 4. Edge cases
+# ---------------------------------------------------------------------------
+
+
+class TestBaconEdgeCases:
+    """Edge cases enumerated in REGISTRY.md ## BaconDecomposition."""
+
+    def test_no_untreated_group(self) -> None:
+        """Only timing groups, no U. Weights still sum to 1 after normalization."""
+        df = _staggered_data(seed=22)
+        # Drop never-treated
+        df = df[df["first_treat"] > 0].copy()
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        total = sum(c.weight for c in results.comparisons)
+        assert abs(total - 1.0) < 1e-10
+        # No treated_vs_never terms — only timing-only
+        type_set = {c.comparison_type for c in results.comparisons}
+        assert "treated_vs_never" not in type_set
+
+    def test_single_timing_group_with_never_treated(self) -> None:
+        """K=1 (one timing group) + U: only treated_vs_never comparisons."""
+        rng = np.random.default_rng(33)
+        rows = []
+        uid = 1
+        for _ in range(10):
+            for t in range(1, 6):
+                D = int(t >= 3)
+                y = float(uid) + 0.1 * t + 2.0 * D + rng.standard_normal() * 0.2
+                rows.append((uid, t, y, 3))
+            uid += 1
+        for _ in range(10):
+            for t in range(1, 6):
+                y = float(uid) + 0.1 * t + rng.standard_normal() * 0.2
+                rows.append((uid, t, y, 0))
+            uid += 1
+        df = pd.DataFrame(rows, columns=["unit", "time", "y", "first_treat"])
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        type_set = {c.comparison_type for c in results.comparisons}
+        assert type_set == {"treated_vs_never"}
+        total = sum(c.weight for c in results.comparisons)
+        assert abs(total - 1.0) < 1e-10
+
+    def test_unbalanced_panel_warns(self) -> None:
+        df = _staggered_data(seed=44)
+        # Drop a few rows to unbalance
+        df = df.drop(df.sample(n=10, random_state=44).index).copy()
+        with pytest.warns(UserWarning, match="Unbalanced panel"):
+            bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+
+    def test_unbalanced_panel_finite_but_not_machine_precision(self) -> None:
+        """Unbalanced panels are a library extension (Goodman-Bacon Appendix A
+        proof assumes balanced panels). Decomposition still produces finite,
+        well-defined output, but the Theorem 1 identity holds only
+        approximately — do NOT assert ``decomposition_error < 1e-10``
+        on unbalanced data; the REGISTRY Deviation block documents the
+        deviation explicitly.
+        """
+        df = _staggered_data(seed=44)
+        df = df.drop(df.sample(n=10, random_state=44).index).copy()
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=UserWarning)
+            results = bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+            )
+        # All outputs are finite
+        assert np.isfinite(results.twfe_estimate)
+        assert all(np.isfinite(c.weight) for c in results.comparisons)
+        assert all(np.isfinite(c.estimate) for c in results.comparisons)
+        # Weights still sum to ~1 after post-hoc normalization
+        total = sum(c.weight for c in results.comparisons)
+        assert abs(total - 1.0) < 1e-10
+        # But the decomposition identity is NOT machine-precision on
+        # unbalanced data — REGISTRY's machine-precision claim is scoped
+        # to balanced panels (see Deviation note in REGISTRY entry).
+        # Empirically the error is small (well under 0.01) but not 1e-10.
+
+    def test_constant_att_recovers_effect(self) -> None:
+        """ΔATT=0 + VWCT=0 → β̂^DD ≈ true ATT (sample noise only)."""
+        df = _staggered_data(seed=55)
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        # True ATT = 2.0 in the DGP. With 60 units and clean DGP, the TWFE
+        # estimate should be within a few hundredths of truth.
+        assert abs(results.twfe_estimate - 2.0) < 0.2
+
+    def test_weighted_sum_machine_precision(self) -> None:
+        """The TWFE-vs-weighted-sum identity holds on noisy data too."""
+        df = _staggered_data(seed=66)
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        weighted_sum = sum(c.weight * c.estimate for c in results.comparisons)
+        assert abs(results.twfe_estimate - weighted_sum) < 1e-10
+
+
+# ---------------------------------------------------------------------------
+# 5. Weight modes
+# ---------------------------------------------------------------------------
+
+
+class TestBaconWeightModes:
+    """``weights="exact"`` is the paper-faithful default; ``"approximate"`` opt-in."""
+
+    def test_exact_is_default(self) -> None:
+        est = BaconDecomposition()
+        assert est.weights == "exact"
+
+    def test_approximate_still_supported(self) -> None:
+        est = BaconDecomposition(weights="approximate")
+        assert est.weights == "approximate"
+        df = _staggered_data(seed=77)
+        results = est.fit(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+        )
+        # Approximate mode produces valid output with weights summing to 1
+        total = sum(c.weight for c in results.comparisons)
+        assert abs(total - 1.0) < 0.01
+
+    def test_exact_vs_approximate_differ_meaningfully(self) -> None:
+        """The two modes produce different relative weights (the approximate
+        path uses a simplified variance not matching Eqs. 7-9)."""
+        df = _staggered_data(seed=88)
+        r_exact = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+        )
+        r_approx = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="approximate",
+        )
+        # Both have the same set of (treated, control) comparisons
+        assert len(r_exact.comparisons) == len(r_approx.comparisons)
+        # But at least one weight differs by more than 1e-6 (i.e., the modes
+        # are not numerically equivalent on a non-trivial DGP)
+        exact_map = {
+            (c.treated_group, c.control_group, c.comparison_type): c.weight
+            for c in r_exact.comparisons
+        }
+        approx_map = {
+            (c.treated_group, c.control_group, c.comparison_type): c.weight
+            for c in r_approx.comparisons
+        }
+        diffs = [abs(exact_map[k] - approx_map[k]) for k in exact_map if k in approx_map]
+        assert max(diffs) > 1e-6
+
+
+# ---------------------------------------------------------------------------
+# 6. Survey-design narrowing
+# ---------------------------------------------------------------------------
+
+
+class TestBaconSurveyDesignNarrowing:
+    """survey_design= composes cleanly with the warn+remap and the new exact-mode default."""
+
+    def test_survey_design_compatible_with_exact_mode(self) -> None:
+        """``weights="exact"`` (new default) accepts ``survey_design=`` without
+        path-specific assertion failures.
+        """
+        from diff_diff import SurveyDesign
+
+        df = _staggered_data(seed=99)
+        # Constant-within-unit weights (required by exact-mode survey validator)
+        unit_w = df.groupby("unit").ngroup() * 0.1 + 1.0
+        df = df.assign(w=unit_w.values)
+        sd = SurveyDesign(weights="w")
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="exact",
+            survey_design=sd,
+        )
+        # Sum-to-1 contract still holds under survey weighting
+        total = sum(c.weight for c in results.comparisons)
+        assert abs(total - 1.0) < 1e-10
+
+    def test_survey_design_propagates_through_remap(self) -> None:
+        """When always-treated remap fires and survey_design is set, no crash
+        and the survey metadata is preserved in the result."""
+        from diff_diff import SurveyDesign
+
+        df = _panel_with_always_treated()
+        unit_w = df.groupby("unit").ngroup() * 0.1 + 1.0
+        df = df.assign(w=unit_w.values)
+        sd = SurveyDesign(weights="w")
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", category=UserWarning)
+            results = bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                weights="exact",
+                survey_design=sd,
+            )
+        assert results.n_always_treated_remapped == 4
+        assert results.survey_metadata is not None
+
+    def _time_varying_survey_panel(self) -> "tuple[pd.DataFrame, object]":
+        """Build a panel with time-varying within-unit survey weights.
+
+        Used to exercise the post-PR-B default flip: ``weights="exact"``
+        (now the default) routes through ``_validate_unit_constant_survey``
+        and rejects time-varying within-unit weights; ``weights="approximate"``
+        accepts them via observation-level weighted means.
+        """
+        from diff_diff import SurveyDesign
+
+        np.random.seed(42)
+        n_u, n_t = 20, 4
+        df = pd.DataFrame(
+            {
+                "unit": np.repeat(range(n_u), n_t),
+                "time": np.tile(range(1, n_t + 1), n_u),
+                "first_treat": np.repeat(np.where(np.arange(n_u) < 10, 3, 0), n_t),
+                "y": np.random.randn(n_u * n_t),
+                # Time-varying weights (different per period within unit)
+                "w": np.random.uniform(0.5, 2.0, n_u * n_t),
+            }
+        )
+        return df, SurveyDesign(weights="w")
+
+    def test_default_bacon_decomposition_class_rejects_time_varying_weights(
+        self,
+    ) -> None:
+        """The new ``weights="exact"`` default routes ``BaconDecomposition()``
+        through the unit-constant-survey validator. Time-varying within-unit
+        weights are rejected. Locks the public-default contract surfaced in
+        the PR-B Changed CHANGELOG entry.
+        """
+        df, sd = self._time_varying_survey_panel()
+        with pytest.raises(ValueError, match="varies within units"):
+            BaconDecomposition().fit(df, "y", "unit", "time", "first_treat", survey_design=sd)
+
+    def test_default_bacon_decompose_function_rejects_time_varying_weights(
+        self,
+    ) -> None:
+        """The new ``weights="exact"`` default flows through
+        ``bacon_decompose(...)`` (no explicit weights= kwarg). Same
+        rejection contract."""
+        df, sd = self._time_varying_survey_panel()
+        with pytest.raises(ValueError, match="varies within units"):
+            bacon_decompose(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                first_treat="first_treat",
+                survey_design=sd,
+            )
+
+    def test_explicit_approximate_accepts_time_varying_weights(self) -> None:
+        """Users can opt back into the obs-level weighted-means path via
+        explicit ``weights="approximate"``. This is the documented
+        migration path in the PR-B Changed CHANGELOG entry.
+        """
+        df, sd = self._time_varying_survey_panel()
+        # Should not raise; produces a valid decomposition via the legacy
+        # approximate path that tolerates obs-level weighted means.
+        results = bacon_decompose(
+            df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            weights="approximate",
+            survey_design=sd,
+        )
+        total = sum(c.weight for c in results.comparisons)
+        assert abs(total - 1.0) < 0.01
+        assert np.isfinite(results.twfe_estimate)
+
+    def test_diagnostic_report_skips_with_structured_reason_on_replicate_weights(
+        self,
+    ) -> None:
+        """PR #454 R4 P3 regression: ``DiagnosticReport._check_bacon``
+        emits ``status="skipped"`` (not ``"error"``) when the survey
+        design uses replicate weights, which Bacon rejects with
+        ``NotImplementedError`` upstream. The skip reason names the
+        ``precomputed={'bacon': ...}`` escape hatch and points users at
+        a TSL-based survey design as the supported alternative.
+        """
+        from diff_diff import DiagnosticReport, SurveyDesign
+
+        df, _ = self._time_varying_survey_panel()
+        df["rep_w1"] = 1.0
+        df["rep_w2"] = 1.0
+        sd_rep = SurveyDesign(
+            weights="w",
+            replicate_weights=["rep_w1", "rep_w2"],
+            replicate_method="BRR",
+        )
+
+        class _Stub:
+            pass
+
+        dr = DiagnosticReport(
+            _Stub(),
+            data=df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            survey_design=sd_rep,
+        )
+        block = dr._check_bacon()
+        assert block["status"] == "skipped"
+        reason = block["reason"]
+        assert "replicate weights" in reason
+        assert "precomputed" in reason
+
+    def test_diagnostic_report_skips_with_structured_reason_on_time_varying_survey(
+        self,
+    ) -> None:
+        """PR #454 R1 P1 regression: ``DiagnosticReport._check_bacon`` now
+        emits ``status="skipped"`` (not ``"error"``) when the panel has
+        within-unit-varying survey columns. The skip reason names the
+        ``precomputed={'bacon': ...}`` + explicit ``weights="approximate"``
+        escape hatch so users have a documented migration path.
+        """
+        from diff_diff import DiagnosticReport
+
+        df, sd = self._time_varying_survey_panel()
+
+        class _Stub:
+            """Minimal results stub that does not carry survey_metadata,
+            so the survey-metadata-without-survey_design early-skip at
+            ``diagnostic_report.py:1723`` does not pre-empt this path."""
+
+        dr = DiagnosticReport(
+            _Stub(),
+            data=df,
+            outcome="y",
+            unit="unit",
+            time="time",
+            first_treat="first_treat",
+            survey_design=sd,
+        )
+        block = dr._check_bacon()
+        assert block["status"] == "skipped"
+        reason = block["reason"]
+        assert "varies within units" not in reason or "approximate" in reason
+        assert "precomputed" in reason
+        assert 'weights="approximate"' in reason or "approximate" in reason