From fb82c5d58dc415c74d5a72ad5ca35d582d912bf6 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Tue, 19 May 2026 09:01:43 -0400
Subject: [PATCH 1/8] twfe: lift HC2/HC2-BM via inline full-dummy auto-route
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Replaces the unconditional NotImplementedError at twfe.py for
`vcov_type in {"hc2","hc2_bm"}` with an inline full-dummy branch.
TWFE has no absorb=/fixed_effects= parameter to swap (unit + time FE
are baked into the estimator's identity), so the auto-route trick
used for DiD-absorb / MPD-absorb doesn't apply directly. Instead,
`TwoWayFixedEffects.fit()` bypasses the within-transform on
hc2/hc2_bm and stacks [intercept, treated×post, covariates,
factor(unit), factor(time)] explicitly so the leverage correction
and BM DOF compute on the full FE projection (FWL preserves
coefficients and residuals but NOT the hat matrix).

**Auto-cluster default:** preserved on hc2_bm (routes to CR2-BM at
unit) and on hc2 + wild_bootstrap; dropped on explicit hc2 +
analytical to match the one-way contract (the linalg validator
rejects hc2 + cluster_ids).

**Surface change disclosure** (matches DiD-absorb / MPD-absorb):
under vcov_type in {"hc2","hc2_bm"}, result.coefficients,
result.vcov, result.residuals, result.fitted_values, and
result.r_squared reflect the full-dummy fit. FE-dummy entries are
included alongside the "ATT" key (len(coefficients) ==
vcov.shape[0] invariant upheld). result.att, its SE, and analytical
inference are unchanged (FWL-equivalent).

**Rejected combos:** vcov_type in {"hc2","hc2_bm"} + replicate-
weight survey designs raises NotImplementedError because the
replicate path re-demeans per replicate, which doesn't compose with
the full-dummy build. Survey variance precedence: any resolved
SurveyDesign drives variance via TSL/replicate (matches existing
DiD/MPD contract), not the analytical small-sample sandwich.

Verified at atol=1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and
`lm() + clubSandwich::vcovCR(cluster=seq_len(n), type="CR2") +
coef_test()$df_Satt` on a new `twfe_two_period` scenario in
benchmarks/data/clubsandwich_cr2_golden.json. New tests:
- TestFitBehavior (10 behavioral tests including refactor regression
  vs DiD(fixed_effects=[unit, time]) at atol=1e-12, auto-cluster
  distinguishability check vs one-way HC2-BM at 1% gap, replicate-
  weight rejection, coefficients-vs-vcov alignment invariant)
- TestTWFEHC2RParity (3 R-parity tests at atol=1e-10)

Lifts Gate 1 of the six HC2/HC2-BM NotImplementedError gates — the
last absorbed-FE gate. Remaining gates: weighted one-way HC2-BM,
weighted CR2-BM (both blocked on the clubSandwich WLS algebra
derivation).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                                 |   1 +
 TODO.md                                      |   4 +-
 benchmarks/R/generate_clubsandwich_golden.R  |  51 ++
 benchmarks/data/clubsandwich_cr2_golden.json |  18 +-
 diff_diff/twfe.py                            | 237 +++++----
 docs/methodology/REGISTRY.md                 |   4 +-
 tests/test_estimators_vcov_type.py           | 301 ++++++++++-
 tests/test_methodology_twfe.py               | 506 +++++++++++++------
 8 files changed, 863 insertions(+), 259 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 8dc9fb16..d2be9da0 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- **`TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` now supported** (`diff_diff/twfe.py:155`). Lifts Gate 1 of the six HC2/HC2-BM `NotImplementedError` gates — the last absorbed-FE gate (DiD-absorb shipped earlier, MPD-absorb shipped earlier, MPD cluster+contrast-DOF shipped earlier in this release). Unlike DiD / MPD, TWFE has no `absorb=` / `fixed_effects=` parameter to swap (unit + time FEs are baked into the estimator's identity), so the same auto-route trick isn't applicable. Instead, `TwoWayFixedEffects.fit()` bypasses the within-transform when `vcov_type in {"hc2","hc2_bm"}` and stacks the full-dummy design `[intercept, treated×post, covariates, factor(unit), factor(time)]` explicitly, then runs OLS through the standard `solve_ols` path so the leverage correction `h_ii = x_i' (X'X)^{-1} x_i` and CR2 Bell-McCaffrey adjustment `A_g = (I - H_gg)^{-1/2}` compute on the full FE projection (FWL preserves coefficients and residuals but NOT the hat matrix). Verified at `atol=1e-10` vs `lm(y ~ treat_post + factor(unit) + factor(post)) + sandwich::vcovHC(type="HC2")` for HC2, vs `clubSandwich::vcovCR(cluster=seq_len(n), type="CR2") + coef_test()$df_Satt` for the singleton-cluster one-way HC2-BM Satterthwaite DOF, and vs `vcovCR(cluster=unit, type="CR2")` for the auto-cluster CR2-BM path (new `twfe_two_period` scenario in `benchmarks/data/clubsandwich_cr2_golden.json`). **Auto-cluster default:** TWFE's unit auto-cluster is preserved on `hc2_bm` (routes to CR2-BM at unit) and on `hc2 + wild_bootstrap` (the bootstrap consumes the cluster structure for resampling regardless of the analytical sandwich choice); dropped on explicit `hc2 + analytical` to match the one-way contract (the linalg validator rejects `hc2 + cluster_ids`). **User-visible surface change** (matches the DiD-absorb / MPD-absorb disclosures above): under `vcov_type in {"hc2","hc2_bm"}`, `result.coefficients`, `result.vcov`, `result.residuals`, `result.fitted_values`, and `result.r_squared` reflect the full-dummy fit rather than the within-transformed reduced fit (FE-dummy entries are included alongside the `"ATT"` key; `r_squared` is computed on the un-demeaned outcome; residuals / fitted are on the original scale; `len(result.coefficients) == result.vcov.shape[0]` invariant upheld). `result.att`, its SE, and analytical inference are unchanged (FWL-equivalent). HC1 / CR1 / Conley / classical paths remain on the within-transform. **Survey-design scope** (mirrors DiD-absorb): when `survey_design=` is supplied, the existing survey variance path (Taylor-series linearization or replicate-weight variance) takes precedence over the analytical HC2/HC2-BM sandwich; the full-dummy build only changes FE handling. **Rejected combos:** `vcov_type in {"hc2","hc2_bm"}` + replicate-weight survey designs (BRR / Fay / JK1 / JKn / SDR) raises `NotImplementedError` at `twfe.py:~233` because the replicate path re-demeans per replicate, which doesn't compose with the full-dummy build (would require per-replicate full-dummy refit); workaround: use `vcov_type="hc1"` for replicate-weight CR1. `hc2_bm + weights` remains blocked at the linalg validator (same gate as Gates 4-5 — weighted CR2 variants). New tests: `tests/test_estimators_vcov_type.py::TestFitBehavior` (9 tests: rejection flip → behavioral; refactor regression vs `DifferenceInDifferences(fixed_effects=[unit, time])` at `atol=1e-12`; auto-cluster default coverage on `hc2_bm`; explicit `hc2 + analytical` no-auto-cluster; `hc2 + wild_bootstrap` auto-cluster preserved; `hc2 / hc2_bm + replicate` rejection; always-treated unit finite ATT; coefficients-vs-vcov alignment invariant); `tests/test_methodology_twfe.py::TestTWFEHC2RParity` (3 R-parity tests at `atol=1e-10`).
 - **Agent-discoverability contract test (`tests/test_agent_discoverability.py`).** New static-snapshot test pinning the agent-facing surface introduced by PR #464: `__all__` membership of `agent_workflow` / `profile_panel` / `get_llm_guide` / `practitioner_next_steps` / `BusinessReport`; `dir(diff_diff)` head-first ordering against `_AGENT_FACING_ORDER` (catches drift in the `_OrderedName` `__lt__` ordering trick); `_OrderedName` `isinstance(_, str)` + str-method compatibility; `dir()` full-namespace + `inspect.getmembers` parity; top-level `__doc__` first-paragraph mention of `agent_workflow` + named references to the 5-step workflow primitives; `agent_workflow()` script content references each downstream helper by name; canonical estimator class names (CallawaySantAnna, ContinuousDiD, HeterogeneousAdoptionDiD, etc.) remain importable. No live API calls; runs in the default pytest suite. Closes [issue #461](https://github.com/igerber/diff-diff/issues/461) (snapshot variant — live-agent regression test deferred to a separate follow-up that depends on causal-llm-eval packaging its harness). Also closes the `__dir__()` contract-test row from `TODO.md` that PR #464 deferred here.
 - **`diff_diff.agent_workflow(df, unit=..., time=..., treatment=..., outcome=...)` — stateless orchestrator for LLM-agent discoverability** (`diff_diff/agent_workflow.py`). Prints (and returns as dict) a copy-pasteable 5-step workflow with the caller's column names templated in: `profile_panel` → `get_llm_guide("autonomous")` → `<Estimator>(...).fit(df, ...)` → `practitioner_next_steps(result)` → `BusinessReport(result).full_report()`. The function calls nothing internally and does not inspect `df`; it is a guided tour, not a router. Surfaces the canonical workflow primitives (`profile_panel`, `get_llm_guide`, `practitioner_next_steps`, `BusinessReport`) that cold-start agent dry-passes at [igerber/causal-llm-eval](https://github.com/igerber/causal-llm-eval) showed agents practically never reach for on their own. Output structure: `{"profile_call", "guide_call", "fit_candidates", "validation_calls", "reporting_call", "script"}`; `fit_candidates` is a flat list of estimator/diagnostic class names referenced in the workflow patterns (each must remain importable on `diff_diff`, locked by `tests/test_agent_workflow.py::test_fit_candidates_all_importable`). Closes [issue #460](https://github.com/igerber/diff-diff/issues/460).
 - **Top-level `__doc__` rewritten to lead with the agent workflow** (`diff_diff/__init__.py`). `help(diff_diff)` now opens with the `agent_workflow(df, ...)` recommendation as the first non-blank paragraph; `get_llm_guide("full")` and `get_llm_guide("practitioner")` pointers preserved for the existing `tests/test_guides.py::test_module_docstring_mentions_helper` guard.
diff --git a/TODO.md b/TODO.md
index 83f35525..e11b2a73 100644
--- a/TODO.md
+++ b/TODO.md
@@ -99,8 +99,8 @@ Deferred items from PR reviews that were not addressed before merge.
 
 | Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the 8 standalone estimators that expose `cluster=`: `CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added `vcov_type` to the `DifferenceInDifferences` inheritance chain only. | multiple | Phase 1a | Medium |
 | Weighted one-way Bell-McCaffrey (`vcov_type="hc2_bm"` + `weights`, no cluster) currently raises `NotImplementedError`. `_compute_bm_dof_from_contrasts` builds its hat matrix from the unscaled design via `X (X'WX)^{-1} X' W`, but `solve_ols` solves the WLS problem by transforming to `X* = sqrt(w) X`, so the correct symmetric idempotent residual-maker is `M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W)`. Rederive the Satterthwaite `(tr G)^2 / tr(G^2)` ratio on the transformed design and add weighted parity tests before lifting the guard. | `linalg.py::_compute_bm_dof_from_contrasts`, `linalg.py::_validate_vcov_args` | Phase 1a | Medium |
-| HC2 / HC2 + Bell-McCaffrey on absorbed-FE fits — REMAINING sub-gate: `TwoWayFixedEffects` (`twfe.py:154` rejects unconditionally). The DiD sub-gate and the MultiPeriodDiD sub-gate were both lifted via auto-route to `fixed_effects=` internally (DiD: PR #458, ~1e-10 vs clubSandwich; MPD: this release, ~1e-10 vs sandwich::vcovHC and clubSandwich::vcovCR). TWFE has no equivalent `fixed_effects=` code path (always within-transforms), so the same auto-route surgery is not directly applicable — lifting requires either building the full-dummy design inline or refactoring TWFE to delegate to DiD. Within-transformation preserves coefficients and residuals under FWL but not the hat matrix; HC1/CR1 are unaffected (no leverage term). | `twfe.py::fit` | follow-up | Medium |
 | Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
+| `TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` with replicate-weight survey designs raises `NotImplementedError` (`twfe.py:~233`). The replicate path re-demeans per replicate (re-demeaning depends on the per-replicate weight vector), which doesn't compose with the full-dummy HC2/HC2-BM build — a correct implementation would need per-replicate full-dummy refit. Workaround: use `vcov_type="hc1"` for replicate-weight CR1. | `twfe.py::fit` | follow-up | Low |
 | Unify Rust local-method `estimate_model` solver path to `solve_wls_svd` (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (`tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local`) passes at `atol=1e-5` — the residual ~1e-7 gap is roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq`, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (`atol=1e-14` for `lambda_nn=inf`, `atol=1e-10` for finite `lambda_nn` — see `test_local_method_main_fit_parity`); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. | `rust/src/trop.rs::estimate_model`, `rust/src/linalg.rs::solve_wls_svd` | follow-up | Low |
 | Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
 | `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
@@ -193,7 +193,7 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk,
 #### Tier C — Heavy / derivation required
 
 - HonestDiD Δ^RM ARP conditional/hybrid confidence sets (`honest_did.py`)
-- Weighted one-way Bell-McCaffrey + weighted CR2 Bell-McCaffrey + HC2/CR2 on absorbed-FE (linalg derivations + R parity harness) (`linalg.py`, `estimators.py::DifferenceInDifferences.fit`, `estimators.py::MultiPeriodDiD.fit`, `twfe.py::fit`)
+- Weighted one-way Bell-McCaffrey + weighted CR2 Bell-McCaffrey (linalg derivations + R parity harness) (`linalg.py::_compute_bm_dof_from_contrasts`, `linalg.py::_compute_cr2_bm`)
 - Multi-absorb weighted demeaning: alternating-projection iteration for N>1 absorb + weights (`estimators.py`)
 - ImputationDiD dense `(A0'A0).toarray()` OOM: alternative dense fallback or richer sparse strategy (`imputation.py:1531`)
 - HAD mass-point `vcov_type ∈ {hc2, hc2_bm}`: 2SLS-specific leverage derivation (`had.py::_fit_mass_point_2sls`)
diff --git a/benchmarks/R/generate_clubsandwich_golden.R b/benchmarks/R/generate_clubsandwich_golden.R
index 74aaf88d..a2dc1338 100644
--- a/benchmarks/R/generate_clubsandwich_golden.R
+++ b/benchmarks/R/generate_clubsandwich_golden.R
@@ -232,6 +232,57 @@ output$mpd_clustered_avg_att_dof <- list(
   n_post_periods = length(post_names)
 )
 
+# --- TwoWayFixedEffects HC2 / HC2-BM scenario (Gate 1 lift PR) ---------------
+# Mirrors TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"}) on a 2-period
+# panel (binary post indicator). TWFE's `time` parameter is the post
+# indicator, so the FE design is factor(unit) + factor(post), NOT
+# factor(period). HC2 SE pinned via sandwich::vcovHC; one-way HC2-BM DOF
+# via the singleton-cluster CR2 trick (Pustejovsky-Tipton 2018 Section 3.3
+# — CR2 with cluster=seq_len(n) reduces to Imbens-Kolesar BM). CR2-BM
+# clustered at unit pinned separately for the auto-cluster path.
+
+set.seed(20260518)
+n_twfe_units <- 8
+n_twfe_periods <- 4
+twfe_treated_units <- c(1, 3, 5, 7)
+twfe_post_start <- 3
+d_twfe <- expand.grid(unit = seq_len(n_twfe_units),
+                      period = seq_len(n_twfe_periods))
+d_twfe$treated <- as.integer(d_twfe$unit %in% twfe_treated_units)
+d_twfe$post <- as.integer(d_twfe$period >= twfe_post_start)
+d_twfe$treat_post <- d_twfe$treated * d_twfe$post
+twfe_alpha_unit <- rnorm(n_twfe_units, mean = 0, sd = 1)
+twfe_gamma_time <- rnorm(n_twfe_periods, mean = 0, sd = 0.5)
+d_twfe$y <- 1.0 + 0.7 * d_twfe$treat_post +
+            twfe_alpha_unit[d_twfe$unit] +
+            twfe_gamma_time[d_twfe$period] +
+            rnorm(nrow(d_twfe), sd = 0.4)
+fit_twfe <- lm(y ~ treat_post + factor(unit) + factor(post), data = d_twfe)
+vcov_twfe_hc2 <- sandwich::vcovHC(fit_twfe, type = "HC2")
+# Singleton-cluster CR2 trick for one-way HC2-BM DOF.
+vcov_twfe_cr2_one_way <- vcovCR(fit_twfe, cluster = seq_len(nrow(d_twfe)),
+                                type = "CR2")
+ct_twfe_one_way <- coef_test(fit_twfe, vcov = vcov_twfe_cr2_one_way)
+# CR2-BM clustered at unit (the TWFE auto-cluster default).
+vcov_twfe_cr2_unit <- vcovCR(fit_twfe, cluster = d_twfe$unit, type = "CR2")
+ct_twfe_unit <- coef_test(fit_twfe, vcov = vcov_twfe_cr2_unit)
+output$twfe_two_period <- list(
+  unit = d_twfe$unit,
+  period = d_twfe$period,
+  treated = d_twfe$treated,
+  post = d_twfe$post,
+  treat_post = d_twfe$treat_post,
+  y = d_twfe$y,
+  coef = as.numeric(coef(fit_twfe)),
+  coef_names = names(coef(fit_twfe)),
+  vcov_hc2 = as.numeric(vcov_twfe_hc2),
+  vcov_hc2_shape = dim(vcov_twfe_hc2),
+  vcov_cr2_one_way = as.numeric(vcov_twfe_cr2_one_way),
+  dof_bm_one_way = as.numeric(ct_twfe_one_way$df_Satt),
+  vcov_cr2_unit = as.numeric(vcov_twfe_cr2_unit),
+  dof_bm_unit = as.numeric(ct_twfe_unit$df_Satt)
+)
+
 output$meta <- list(
   source = "clubSandwich",
   clubSandwich_version = as.character(packageVersion("clubSandwich")),
diff --git a/benchmarks/data/clubsandwich_cr2_golden.json b/benchmarks/data/clubsandwich_cr2_golden.json
index 5e406154..539f5efb 100644
--- a/benchmarks/data/clubsandwich_cr2_golden.json
+++ b/benchmarks/data/clubsandwich_cr2_golden.json
@@ -82,11 +82,27 @@
     "reference_period": 1,
     "n_post_periods": 3
   },
+  "twfe_two_period": {
+    "unit": [1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8, 1, 2, 3, 4, 5, 6, 7, 8],
+    "period": [1, 1, 1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3, 3, 3, 3, 4, 4, 4, 4, 4, 4, 4, 4],
+    "treated": [1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
+    "post": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1, 1],
+    "treat_post": [0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0, 1, 0],
+    "y": [2.650364154679115, 0.5250992296125498, 2.538325280436974, 1.803468688851707, 0.390388546319389, 0.9019986445013289, 0.8232587072837394, 0.8653454679955752, 2.08429332526299, 0.0638119452881384, 2.722749318503621, 0.8169670582027474, 0.2459517838623851, 0.6804813712532132, 0.8556684840526531, 1.398839980876758, 2.620848972130513, 0.09909203941612038, 2.832338679868128, 0.7704342845402335, 0.8560980445011976, 0.5425351511582146, 0.9299248903311188, 1.744787275814005, 3.087638383603313, -0.7232315532492211, 2.27084735211901, 0.7045197493403264, 0.9856491992396943, 0.2839259051193889, 0.6881762347785356, 0.1525997107657043],
+    "coef": [2.488253574158318, 0.6802339974809865, -2.279476294911592, -0.0197210511870492, -1.246821764944735, -1.991264315438316, -1.668433942170453, -1.78652912980747, -1.230276101315478, -0.4351687279596561],
+    "coef_names": ["(Intercept)", "treat_post", "factor(unit)2", "factor(unit)3", "factor(unit)4", "factor(unit)5", "factor(unit)6", "factor(unit)7", "factor(unit)8", "factor(post)1"],
+    "vcov_hc2": [0.03700174102669025, -0.01209920750084357, -0.03700174102669036, -0.034148276882185, -0.0370017410266902, -0.03090579095064369, -0.03700174102669018, -0.03160431930844195, -0.03700174102669018, -7.600244650009306e-18, -0.01209920750084357, 0.07671968417768542, 0.03015788065425394, 0.008722820785758987, 0.06270949142009662, 0.002237848922676195, 0.04028563730391647, 0.003634905638272913, -0.01045609013314296, -0.05718235232384992, -0.03700174102669031, 0.03015788065425393, 0.08384338199488832, 0.03414827688218509, 0.0570406314820595, 0.03090579095064373, 0.04582870442396941, 0.03160431930844203, 0.02045784070543967, -0.01805867315341034, -0.034148276882185, 0.008722820785758943, 0.03414827688218509, 0.05520728093577265, 0.03414827688218493, 0.02978686648930553, 0.03414827688218493, 0.02978686648930548, 0.03414827688218493, 3.26088479668859e-17, -0.03700174102669024, 0.06270949142009664, 0.05704063148205959, 0.034148276882185, 0.1194701811546974, 0.0309057909506437, 0.06210450980689071, 0.03160431930844199, 0.03673364608836099, -0.05061028391925308, -0.03090579095064366, 0.002237848922676207, 0.03090579095064374, 0.02978686648930551, 0.03090579095064364, 0.04312575858084388, 0.03090579095064361, 0.02978686648930553, 0.03090579095064363, 9.288215089373668e-18, -0.03700174102669024, 0.04028563730391648, 0.04582870442396951, 0.034148276882185, 0.06210450980689072, 0.03090579095064367, 0.0564599920123676, 0.03160431930844197, 0.02552171903027089, -0.02818642980307293, -0.03160431930844198, 0.003634905638272877, 0.03160431930844206, 0.0297868664893055, 0.03160431930844194, 0.02978686648930557, 0.03160431930844192, 0.03939002087733653, 0.03160431930844193, 1.566395541000208e-17, -0.03700174102669019, -0.01045609013314297, 0.02045784070543973, 0.034148276882185, 0.03673364608836095, 0.03090579095064368, 0.02552171903027083, 0.03160431930844194, 0.1340805551581075, 0.02255529763398656, 4.385790535446187e-19, -0.05718235232384991, -0.01805867315341033, -1.35757095501444e-17, -0.05061028391925306, 1.208677012866869e-17, -0.02818642980307294, -4.970499585804175e-17, 0.02255529763398653, 0.05718235232384995],
+    "vcov_hc2_shape": [10, 10],
+    "vcov_cr2_one_way": [0.03700174102669018, -0.01209920750084357, -0.03700174102669025, -0.0341482768821849, -0.03700174102669003, -0.03090579095064359, -0.03700174102669005, -0.03160431930844183, -0.03700174102669001, 4.444145962929492e-17, -0.01209920750084359, 0.07671968417768549, 0.03015788065425398, 0.00872282078575895, 0.06270949142009666, 0.002237848922676209, 0.04028563730391651, 0.003634905638272878, -0.01045609013314298, -0.05718235232384999, -0.03700174102669025, 0.03015788065425397, 0.08384338199488831, 0.03414827688218498, 0.05704063148205939, 0.03090579095064365, 0.04582870442396931, 0.03160431930844191, 0.02045784070543955, -0.01805867315341042, -0.03414827688218493, 0.008722820785758943, 0.03414827688218502, 0.05520728093577257, 0.0341482768821848, 0.02978686648930545, 0.03414827688218481, 0.02978686648930537, 0.03414827688218482, -9.024515456557491e-18, -0.0370017410266902, 0.06270949142009669, 0.05704063148205952, 0.03414827688218489, 0.1194701811546973, 0.03090579095064362, 0.06210450980689061, 0.03160431930844186, 0.03673364608836086, -0.05061028391925316, -0.0309057909506436, 0.002237848922676207, 0.03090579095064367, 0.02978686648930542, 0.03090579095064351, 0.0431257585808438, 0.0309057909506435, 0.02978686648930542, 0.03090579095064351, -3.928404223797695e-17, -0.0370017410266902, 0.0402856373039165, 0.04582870442396944, 0.03414827688218489, 0.06210450980689061, 0.0309057909506436, 0.05645999201236749, 0.03160431930844185, 0.02552171903027075, -0.02818642980307301, -0.03160431930844192, 0.00363490563827287, 0.03160431930844198, 0.02978686648930541, 0.03160431930844181, 0.02978686648930549, 0.03160431930844181, 0.03939002087733642, 0.03160431930844181, -3.290830191734854e-17, -0.03700174102669015, -0.01045609013314295, 0.02045784070543967, 0.03414827688218488, 0.03673364608836084, 0.0309057909506436, 0.02552171903027073, 0.03160431930844181, 0.1340805551581074, 0.0225552976339865, 2.819415466917353e-17, -0.05718235232384999, -0.01805867315341039, 4.638886947612098e-18, -0.05061028391925316, -2.260769939086745e-17, -0.02818642980307301, -4.623554890608812e-17, 0.0225552976339865, 0.05718235232385],
+    "dof_bm_one_way": [3.425821064552667, 21.999999999999837, 6.851642129105291, 5.761904761904771, 6.851642129105294, 5.761904761904765, 6.851642129105291, 5.761904761904764, 6.85164212910529, 10.999999999999979],
+    "vcov_cr2_unit": [0.007651392098640002, -0.01530278419727998, -0.007651392098640009, -1.340972185906478e-17, -0.007651392098640004, -1.786362460978703e-17, -0.007651392098640002, -1.652260459528918e-17, -0.007651392098640006, 8.815980097977497e-19, -0.01530278419727999, 0.04018425503992974, 0.02009212751996491, 2.723300676932653e-17, 0.0200921275199649, 3.747012207819093e-17, 0.02009212751996489, 3.337015158794735e-17, 0.0200921275199649, -0.009578686645369807, -0.00765139209864001, 0.02009212751996491, 0.01004606375998247, 1.361650338466344e-17, 0.01004606375998247, 1.873506103909563e-17, 0.01004606375998246, 1.668507579397384e-17, 0.01004606375998247, -0.00478934332268491, -1.340972185906478e-17, 2.723300676932653e-17, 1.361650338466343e-17, 1.668199508743466e-31, 1.361650338466343e-17, 1.656912369884294e-31, 1.361650338466342e-17, 1.631551769724017e-31, 1.361650338466343e-17, -4.135630511972947e-19, -0.007651392098640005, 0.0200921275199649, 0.01004606375998247, 1.361650338466343e-17, 0.01004606375998247, 1.873506103909562e-17, 0.01004606375998246, 1.668507579397383e-17, 0.01004606375998247, -0.004789343322684908, -1.786362460978703e-17, 3.747012207819092e-17, 1.873506103909563e-17, 1.656912369884295e-31, 1.873506103909562e-17, 1.697016749718632e-31, 1.873506103909561e-17, 1.615308081491078e-31, 1.873506103909562e-17, -1.742872858617179e-18, -0.007651392098640003, 0.02009212751996489, 0.01004606375998247, 1.361650338466342e-17, 0.01004606375998246, 1.873506103909562e-17, 0.01004606375998246, 1.668507579397383e-17, 0.01004606375998246, -0.004789343322684904, -1.652260459528918e-17, 3.337015158794737e-17, 1.668507579397385e-17, 1.631551769724017e-31, 1.668507579397384e-17, 1.615308081491077e-31, 1.668507579397383e-17, 1.665183639542917e-31, 1.668507579397384e-17, -3.249423973693037e-19, -0.007651392098640007, 0.0200921275199649, 0.01004606375998247, 1.361650338466343e-17, 0.01004606375998247, 1.873506103909563e-17, 0.01004606375998246, 1.668507579397384e-17, 0.01004606375998246, -0.004789343322684903, 3.612539513184754e-18, -0.009578686645369814, -0.004789343322684914, -4.135630511972954e-19, -0.004789343322684916, -1.742872858617175e-18, -0.004789343322684907, -3.249423973692947e-19, -0.004789343322684909, 0.009578686645369807],
+    "dof_bm_unit": [3, 6.000000000000002, 5.999999999999998, 1.027080069278844, 6.000000000000001, 1.038147635656014, 6.000000000000003, 1.078234257225623, 5.999999999999999, 2.999999999999998]
+  },
   "meta": {
     "source": "clubSandwich",
     "clubSandwich_version": "0.7.0",
     "R_version": "R version 4.5.2 (2025-10-31)",
-    "generated_at": "2026-05-18 01:50:55 UTC",
+    "generated_at": "2026-05-19 01:30:25 UTC",
     "note": "CR2 Bell-McCaffrey cluster-robust parity target for diff_diff._compute_cr2_bm"
   }
 }
diff --git a/diff_diff/twfe.py b/diff_diff/twfe.py
index 9f52d3dc..72a4c270 100644
--- a/diff_diff/twfe.py
+++ b/diff_diff/twfe.py
@@ -36,14 +36,16 @@ class TwoWayFixedEffects(DifferenceInDifferences):
         parameter passed to `fit()`). This differs from
         DifferenceInDifferences where cluster=None means no clustering.
 
-        **Exception:** when ``vcov_type="classical"`` and
+        **Exception (one-way analytical):** when
+        ``vcov_type in {"classical", "hc2"}`` is explicit AND
         ``inference="analytical"``, the unit auto-cluster is dropped
-        because the classical family is by construction one-way only and
-        the validator rejects ``cluster_ids + classical``. The user's
-        explicit choice of the classical family wins over the TWFE default
-        in that narrow analytical-inference case. Under
-        ``inference="wild_bootstrap"`` the auto-cluster is preserved (the
-        bootstrap uses the cluster structure to resample residuals).
+        because these families are by construction one-way only and the
+        validator rejects ``cluster_ids + classical`` / ``cluster_ids +
+        hc2``. The user's explicit one-way choice wins over the TWFE
+        default. Under ``inference="wild_bootstrap"`` the auto-cluster
+        is preserved regardless of ``vcov_type`` (the bootstrap uses the
+        cluster structure to resample residuals). On ``hc2_bm`` the
+        auto-cluster is also preserved (routes to CR2-BM at unit).
     alpha : float, default=0.05
         Significance level for confidence intervals.
 
@@ -55,17 +57,22 @@ class TwoWayFixedEffects(DifferenceInDifferences):
 
     where α_i are unit fixed effects and γ_t are time fixed effects.
 
-    **HC2 / Bell-McCaffrey are not available on TWFE.** Because TWFE uses
-    within-transformation (demeaning) to absorb the fixed effects, the
-    reduced design's hat matrix is not the full FE projection; HC2 leverage
-    and CR2 Bell-McCaffrey corrections on the demeaned design would produce
-    silently-wrong small-sample SEs (FWL preserves coefficients, not the
-    hat matrix). ``vcov_type in {"hc2","hc2_bm"}`` therefore raises
-    ``NotImplementedError`` with workarounds: use ``vcov_type="hc1"`` (HC1/
-    CR1 survive FWL), or switch to ``DifferenceInDifferences(fixed_effects=
-    [...])`` where the dummies appear in the full design. Tracked in
-    ``TODO.md`` under Methodology/Correctness; also documented in
-    ``docs/methodology/REGISTRY.md``.
+    **HC2 / Bell-McCaffrey are supported via an internal full-dummy build.**
+    Because TWFE's within-transformation preserves coefficients but not the
+    hat matrix, HC2 leverage and CR2 Bell-McCaffrey corrections on the
+    demeaned design would produce wrong small-sample SEs. When
+    ``vcov_type in {"hc2","hc2_bm"}``, TWFE bypasses the within-transform
+    and builds the full-dummy design ``[intercept, treated×post,
+    covariates, unit_dummies, time_dummies]`` directly, so the leverage
+    correction and BM DOF compute on the full FE projection. Under this
+    path, ``result.coefficients``, ``result.vcov``, ``result.residuals``,
+    ``result.fitted_values``, and ``result.r_squared`` reflect the
+    full-dummy fit rather than the within-transformed reduced fit; the
+    ATT coefficient, its SE, and analytical inference are unchanged.
+    Auto-cluster-at-unit is preserved on ``hc2_bm`` (routes to CR2-BM at
+    unit) and on ``hc2`` + ``wild_bootstrap``; dropped on explicit ``hc2``
+    + ``analytical`` to match the one-way contract. Documented in
+    ``docs/methodology/REGISTRY.md`` under the scope-limitation note.
 
     **Conley spatial-HAC (``vcov_type="conley"``) is supported via the
     block-decomposed panel sandwich (matches R ``conleyreg`` with
@@ -137,34 +144,15 @@ def fit(  # type: ignore[override]
         if unit not in data.columns:
             raise ValueError(f"Unit column '{unit}' not found in data")
 
-        # Reject HC2 / HC2 + Bell-McCaffrey on TWFE (and any absorbed-FE fit).
-        # TWFE demeans outcomes and regressors via within-transformation before
-        # solving OLS, and passes only the reduced (already-residualized)
-        # regressor matrix into ``LinearRegression``. The HC2 leverage
-        # correction ``h_ii = x_i' (X'X)^{-1} x_i`` and the CR2 Bell-McCaffrey
-        # adjustment matrix ``A_g = (I - H_gg)^{-1/2}`` both depend on the
-        # FULL fixed-effects hat matrix, not the residualized one: FWL
-        # preserves coefficients but NOT the hat matrix, so applying HC2 or
-        # CR2 to the demeaned design produces the wrong leverage and the
-        # wrong Bell-McCaffrey DOF. The correct fix (compute leverage from
-        # the full absorbed projection) is deferred to a follow-up PR; until
-        # then, reject fast rather than ship silently-wrong small-sample SEs.
-        # HC1 and CR1 are unaffected (no leverage term, meat uses only the
-        # residuals which FWL preserves). Tracked in TODO.md.
-        if self.vcov_type in ("hc2", "hc2_bm"):
-            raise NotImplementedError(
-                f"TwoWayFixedEffects(vcov_type={self.vcov_type!r}) is not "
-                "yet supported: TWFE uses within-transformation (demeaning) "
-                "before OLS, and the HC2 leverage correction / CR2 Bell-"
-                "McCaffrey DOF depend on the full FE hat matrix, not the "
-                "residualized one (FWL preserves coefficients but not "
-                "leverage). Applying HC2/CR2-BM to the demeaned design "
-                "would produce silently-wrong small-sample inference. Use "
-                "vcov_type='hc1' (HC1/CR1 preserve correctly under FWL), or "
-                "switch to fixed_effects= dummies on DifferenceInDifferences "
-                "for a full-dummy design where HC2/CR2-BM are computed on "
-                "the full projection."
-            )
+        # HC2 / HC2 Bell-McCaffrey are now SUPPORTED via the inline
+        # full-dummy build below (see "use_full_dummy" branch around the
+        # design-construction block). FWL preserves coefficients and
+        # residuals but NOT the hat matrix, so HC2 leverage and CR2-BM
+        # DOF must compute on the full FE projection; building the design
+        # with explicit unit + time dummies routes through ``solve_ols``'s
+        # full-design hat matrix. HC1/CR1 paths remain on the demeaned
+        # design (no leverage term).
+        use_full_dummy = self.vcov_type in ("hc2", "hc2_bm")
 
         # Phase 2 panel block-decomposed Conley (matches R conleyreg).
         # FWL composability: the within-transformed scores S = X_demeaned *
@@ -230,69 +218,127 @@ def fit(  # type: ignore[override]
                 "survey designs. Replicate weights provide their own variance "
                 "estimation."
             )
+        # Replicate weights + HC2 / HC2-BM is incompatible with the
+        # full-dummy auto-route: the replicate path re-demeans per
+        # replicate (re-demeaning depends on the per-replicate weight
+        # vector), which doesn't compose with the full-dummy design
+        # build. A correct implementation would need to re-build the
+        # full-dummy X per replicate and recompute the HC2 leverage,
+        # which is deferred. Mirrors the
+        # ``linalg.py::_validate_vcov_args`` ``hc2_bm + weights`` gate.
+        if _uses_replicate_twfe and self.vcov_type in ("hc2", "hc2_bm"):
+            raise NotImplementedError(
+                f"TwoWayFixedEffects(vcov_type={self.vcov_type!r}) with "
+                "replicate-weight survey designs is not yet supported: the "
+                "replicate path re-demeans per replicate, which does not "
+                "compose with the full-dummy HC2/HC2-BM build (would need "
+                "per-replicate full-dummy refit). Use vcov_type='hc1' for "
+                "replicate-weight CR1, or drop to analytical inference."
+            )
 
         # Unit-level clustering is the TWFE default when `cluster` is not
-        # explicitly provided. But the one-way ``classical`` family is by
-        # construction not cluster-robust and the validator in
-        # ``compute_robust_vcov`` rejects ``cluster_ids + vcov_type=="classical"``.
-        # When the user EXPLICITLY asks for ``classical`` analytical inference
-        # (via ``vcov_type="classical"``) and does NOT set ``cluster=``,
-        # honor that choice by disabling the auto-cluster.
+        # explicitly provided. But the one-way ``classical`` and ``hc2``
+        # families are by construction not cluster-robust and the validator
+        # in ``compute_robust_vcov`` rejects
+        # ``cluster_ids + vcov_type in ("classical","hc2")``. When the user
+        # EXPLICITLY asks for one of these analytical-one-way families AND
+        # does NOT set ``cluster=``, honor that choice by disabling the
+        # auto-cluster.
         #
         # When ``"classical"`` is IMPLICIT (from the legacy alias
         # ``robust=False``), keep the unit auto-cluster so
         # ``_resolve_effective_vcov_type`` below can remap it to ``"hc1"``
         # and preserve the historical CR1-at-unit behavior. Wild-bootstrap
-        # inference also keeps the unit auto-cluster regardless (bootstrap
-        # consumes cluster structure for resampling). ``hc2``/``hc2_bm``
-        # don't reach this block — they are rejected above.
+        # inference also keeps the unit auto-cluster regardless of
+        # ``vcov_type`` (bootstrap consumes cluster structure for
+        # resampling, independent of the analytical sandwich). ``hc2_bm``
+        # also keeps the auto-cluster (routes to CR2-BM at unit).
         if self.cluster is not None:
             cluster_var: Optional[str] = self.cluster
         elif (
-            self.vcov_type == "classical"
+            self.vcov_type in ("classical", "hc2")
             and self._vcov_type_explicit
             and self.inference == "analytical"
         ):
-            # Explicit classical + analytical inference: drop the auto-cluster
-            # so the validator doesn't reject ``cluster_ids + classical``.
+            # Explicit one-way analytical vcov: drop the auto-cluster so
+            # the validator doesn't reject ``cluster_ids`` with these
+            # families. Wild-bootstrap is exempt because the bootstrap
+            # uses the cluster structure for resampling regardless of
+            # the analytical sandwich choice.
             cluster_var = None
         else:
             cluster_var = unit
 
-        # Create treatment × post interaction from raw data before demeaning.
-        # This must be within-transformed alongside the outcome and covariates
-        # so that the regression uses demeaned regressors (FWL theorem).
+        # Create treatment × post interaction from raw data.
         data = data.copy()
         data["_treatment_post"] = data[treatment] * data[time]
 
-        # Demean outcome, covariates, AND interaction in a single pass
-        all_vars = [outcome] + (covariates or []) + ["_treatment_post"]
-        data_demeaned = _within_transform_util(
-            data,
-            all_vars,
-            unit,
-            time,
-            suffix="_demeaned",
-            weights=survey_weights,
-        )
-
-        # Extract variables for regression
-        y = data_demeaned[f"{outcome}_demeaned"].values
-        X_list = [data_demeaned["_treatment_post_demeaned"].values]
-
-        if covariates:
-            for cov in covariates:
-                X_list.append(data_demeaned[f"{cov}_demeaned"].values)
-
-        X = np.column_stack([np.ones(len(y))] + X_list)
-
-        # ATT is the coefficient on treatment_post (index 1)
-        att_idx = 1
-
-        # Degrees of freedom adjustment for fixed effects
         n_units = data[unit].nunique()
         n_times = data[time].nunique()
-        df_adjustment = n_units + n_times - 2
+
+        if use_full_dummy:
+            # HC2 / HC2-BM full-dummy build: bypass the within-transform
+            # and stack [intercept, treated×post, covariates, unit_dummies,
+            # time_dummies] explicitly. FWL preserves the ATT coefficient
+            # and residuals, but NOT the hat matrix — so the leverage
+            # correction `h_ii = x_i' (X'X)^{-1} x_i` and the CR2 Bell-
+            # McCaffrey adjustment matrix `A_g = (I - H_gg)^{-1/2}` must
+            # be computed on the full FE projection. Pivoted-QR rank
+            # detection in `solve_ols` cleanly drops any collinear FE
+            # dummies (e.g. an always-treated unit × treatment_post
+            # collinearity) without poisoning the ATT.
+            y = data[outcome].values.astype(np.float64)
+            cov_arrs = [data[c].values.astype(np.float64) for c in (covariates or [])]
+            unit_dummies_df = pd.get_dummies(data[unit], prefix=f"_fe_{unit}", drop_first=True)
+            time_dummies_df = pd.get_dummies(data[time], prefix=f"_fe_{time}", drop_first=True)
+            unit_dummies = unit_dummies_df.values.astype(np.float64)
+            time_dummies = time_dummies_df.values.astype(np.float64)
+            X = np.column_stack(
+                [np.ones(len(data)), data["_treatment_post"].values]
+                + cov_arrs
+                + [unit_dummies, time_dummies]
+            )
+            # FEs are now in X explicitly; solve_ols's n - k accounting
+            # already subtracts them, so the extra unit + time DOF
+            # adjustment used on the within-transform path would
+            # double-count.
+            df_adjustment = 0
+            # var_names parallels the X columns so the downstream
+            # `coefficients` dict can mirror the full-dummy vcov shape
+            # (matching the MPD invariant
+            # ``len(result.coefficients) == result.vcov.shape[0]``).
+            _twfe_var_names: Optional[List[str]] = (
+                ["const", "ATT"]
+                + list(covariates or [])
+                + list(unit_dummies_df.columns)
+                + list(time_dummies_df.columns)
+            )
+        else:
+            # Default within-transform path (HC1 / classical / Conley):
+            # demean outcome, covariates, AND interaction in a single pass
+            # so the regression uses demeaned regressors (FWL theorem).
+            all_vars = [outcome] + (covariates or []) + ["_treatment_post"]
+            data_demeaned = _within_transform_util(
+                data,
+                all_vars,
+                unit,
+                time,
+                suffix="_demeaned",
+                weights=survey_weights,
+            )
+            y = data_demeaned[f"{outcome}_demeaned"].values
+            X_list = [data_demeaned["_treatment_post_demeaned"].values]
+            if covariates:
+                for cov in covariates:
+                    X_list.append(data_demeaned[f"{cov}_demeaned"].values)
+            X = np.column_stack([np.ones(len(y))] + X_list)
+            df_adjustment = n_units + n_times - 2
+            # Within-transform path: FE dummies are NOT in X (they're absorbed
+            # by demeaning). var_names cover the visible columns only.
+            _twfe_var_names = ["const", "ATT"] + list(covariates or [])
+
+        # ATT is the coefficient on treatment_post (index 1) on both branches.
+        att_idx = 1
 
         # Always use LinearRegression for initial fit (unified code path)
         # For wild bootstrap, we don't need cluster SEs from the initial fit.
@@ -571,6 +617,21 @@ def _refit_twfe(w_r):
         else:
             _twfe_cluster_label = unit
 
+        # Build the coefficients dict mirroring the actual X columns. On the
+        # full-dummy path this surfaces the FE-dummy entries alongside the ATT;
+        # on the within-transform path it only carries the visible
+        # [const, ATT, covariates] columns. The "ATT" name at index 1 is
+        # preserved as the ATT key on both paths, so existing
+        # `result.coefficients["ATT"]` consumers continue to work. The
+        # invariant ``len(coefficients) == vcov.shape[0]`` is now upheld on the
+        # full-dummy path (matches the MPD absorb auto-route invariant
+        # checked at tests/test_estimators_vcov_type.py:1611).
+        coef_array = np.asarray(reg.coefficients_)
+        _coefficients_dict: dict = (
+            {name: float(c) for name, c in zip(_twfe_var_names, coef_array)}
+            if _twfe_var_names is not None
+            else {"ATT": float(att)}
+        )
         self.results_ = DiDResults(
             att=att,
             se=se,
@@ -581,7 +642,7 @@ def _refit_twfe(w_r):
             n_treated=n_treated,
             n_control=n_control,
             alpha=self.alpha,
-            coefficients={"ATT": float(att)},
+            coefficients=_coefficients_dict,
             vcov=vcov,
             residuals=residuals,
             fitted_values=fitted,
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 3efdcdbc..97a52af2 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2559,9 +2559,9 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in
 - [x] Phase 1a: HC2 + Bell-McCaffrey DOF correction in `diff_diff/linalg.py` via `vcov_type="hc2_bm"` enum (both one-way and CR2 cluster-robust with Imbens-Kolesar / Pustejovsky-Tipton Satterthwaite DOF). Weighted cluster CR2 raises `NotImplementedError` and is tracked as Phase 2+ in `TODO.md`.
     - **Note (scope limitation on absorbed FE):** HC2 and HC2 + Bell-McCaffrey on within-transformed designs still depend on the FULL FE hat matrix because FWL preserves coefficients and residuals but NOT the hat matrix: `h_ii = x_i' (X'X)^{-1} x_i` on the reduced design is not the diagonal of the full FE projection, and CR2's block adjustment `A_g = (I - H_gg)^{-1/2}` likewise depends on the full cluster-block hat matrix. The status across the three estimators that previously rejected this combination:
         - **`DifferenceInDifferences(absorb=..., vcov_type in {"hc2","hc2_bm"})` — SUPPORTED (auto-route).** When the user pairs `absorb=` with HC2 / HC2-BM, `DiD.fit()` internally promotes the absorb columns to `fixed_effects=` so the existing full-dummy code path computes the algebraically correct vcov from the full FE projection. Verified at ~1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=..., type="CR2")` (singleton-cluster CR2 trick for one-way HC2-BM Satterthwaite DOF; PT2018 §3.3 unweighted CR2 algebra). **User-visible surface change**: under the auto-route, the entire `DiDResults` (coefficients, vcov, residuals, fitted_values, r_squared) reflect the full-dummy fit rather than the within-transformed fit — the FE-dummy entries are included in `result.coefficients` / `result.vcov`, `r_squared` is computed on the un-demeaned outcome, and `residuals` / `fitted_values` are on the original scale. `result.att` is unaffected (FWL-equivalent). HC1/CR1 paths on `absorb=` are unchanged (no leverage term). **Survey-design scope**: when `survey_design=` is supplied, the existing survey variance path (Taylor-series linearization / replicate weights) takes precedence over the analytical HC2/HC2-BM sandwich; the auto-route only changes the FE handling (removing the prior reject) and does not redirect to the analytical small-sample sandwich on survey fits.
-        - **`TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` — still rejects.** TWFE is a standalone class with no `fixed_effects=` equivalent path, so the same auto-route surgery used for DiD-absorb and MPD-absorb is not directly applicable; lifting requires building the full-dummy design inline or refactoring TWFE to delegate to DiD. Tracked as a follow-up in `TODO.md`.
+        - **`TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` — SUPPORTED (inline full-dummy build).** TWFE has no `absorb=` / `fixed_effects=` parameter (the unit + time FE are baked into the estimator's identity), so the same parameter-swap auto-route used for DiD-absorb / MPD-absorb is not directly applicable. Instead, `TwoWayFixedEffects.fit()` bypasses the within-transform when `vcov_type in {"hc2","hc2_bm"}` and builds the full-dummy design `[intercept, treated×post, covariates, factor(unit), factor(time)]` explicitly, then runs OLS through the standard `solve_ols` path so the leverage correction and BM DOF compute on the full FE projection. Verified at atol=1e-10 vs `lm(y ~ treat_post + factor(unit) + factor(post)) + sandwich::vcovHC(type="HC2")` for HC2 and vs `clubSandwich::vcovCR(cluster=seq_len(n), type="CR2")` for the singleton-cluster one-way HC2-BM Satterthwaite DOF; vs `vcovCR(cluster=unit, type="CR2")` for the auto-cluster CR2-BM path. **Auto-cluster default:** TWFE's unit auto-cluster is preserved on `hc2_bm` (routes to CR2-BM at unit) and on `hc2 + wild_bootstrap` (the bootstrap consumes the cluster structure for resampling regardless of the analytical sandwich choice); dropped on explicit `hc2 + analytical` to match the one-way contract (the linalg validator rejects `hc2 + cluster_ids`). `hc2_bm + analytical` with no explicit cluster yields the auto-cluster CR2-BM path. **User-visible surface change** (matches the DiD-absorb / MPD-absorb disclosure above): under HC2 / HC2-BM, `result.coefficients`, `result.vcov`, `result.residuals`, `result.fitted_values`, and `result.r_squared` reflect the full-dummy fit rather than the within-transformed reduced fit (FE-dummy entries are included, `r_squared` is computed on the un-demeaned outcome, residuals/fitted are on the original scale). `result.att`, its SE, and analytical inference are unchanged (FWL-equivalent). HC1 / CR1 / Conley / classical paths remain on the within-transform (no leverage term in those vcov families). **Survey-design scope** (mirrors the DiD-absorb auto-route contract above): when `survey_design=` is supplied, the existing survey variance path (Taylor-series linearization for analytical-weight designs, or replicate-weight variance for BRR/Fay/JK1/JKn/SDR) takes precedence over the analytical HC2/HC2-BM sandwich; the full-dummy build only changes the FE handling (removing the prior reject) and does not redirect to the analytical small-sample sandwich on survey fits. **Replicate-weight survey designs** are blocked at the estimator level: `vcov_type in {"hc2","hc2_bm"}` + replicate weights raises `NotImplementedError` because the replicate refit path re-demeans per replicate, which doesn't compose with the full-dummy build (would require per-replicate full-dummy refit); workaround: use `vcov_type="hc1"` for replicate-weight CR1. `hc2_bm + weights` remains rejected upstream by the linalg validator (same gate as Gates 4-5 — weighted CR2 variants).
         - **`MultiPeriodDiD(absorb=..., vcov_type in {"hc2","hc2_bm"})` — SUPPORTED (auto-route).** Same auto-route pattern as `DifferenceInDifferences`: `MultiPeriodDiD.fit()` internally promotes the absorb columns to `fixed_effects=` for HC2 / HC2-BM callers, so the existing full-dummy code path computes the algebraically correct vcov from the full FE projection on the event-study design (`treated + period_X dummies + treated:period_X interactions + factor(unit)`). Verified at ~1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=1:n, type="CR2")` on a 5-cohort × 5-period event-study fixture; the parity target is a per-period interaction `treated:period_X` because MPD requires the `treated` column to be a time-invariant ever-treated indicator, which lies in the span of the intercept and the post-auto-route unit FE dummies (under `pd.get_dummies(..., drop_first=True)` the dropped reference unit is implicit in the intercept, so the exact alias relation depends on the omitted FE category — it is NOT simply "the sum of treated-cohort unit dummies"). `solve_ols` drops one column from the collinear set under R-style rank-deficiency handling; in the shipped parity fixture (4 ever-treated cohorts of 5 units + 1 never-treated cohort of 5 units) it drops a unit dummy from the never-treated cohort (`unit_25`) and the `treated` main effect remains finite, but the specific column that gets NaN'd is pivot-order and dummy-coding dependent. Either way, the slope coefficients (`treated:period_X`) and the post-period-average `avg_att` are identified and invariant to which column was dropped. Same `MultiPeriodDiDResults` surface change as DiD: `vcov`, `residuals`, `fitted_values`, `r_squared`, and `coefficients` reflect the full-dummy fit, with `period_effects[t].effect` / `.se` / `.p_value` / `.conf_int` invariant by FWL. HC1/CR1 paths on `absorb=` are unchanged (no leverage term). Same survey-design scope as DiD: replicate-weight variance routes through the standard `compute_replicate_vcov` path on the fixed full-dummy design rather than the per-replicate refit branch (which targets the demeaning path); since the auto-routed design does not depend on replicate weights, no refit is needed. **Redundant time-FE skip:** when the routed (or directly-supplied) `fixed_effects` list contains the `time` column, MPD silently skips emitting `<time>_<X>` dummies for that entry because the design already absorbs time via the non-reference period dummies. Without the skip, those blocks collide on dummy names and `MultiPeriodDiDResults.coefficients` (built as `{name: coef for name, coef in zip(var_names, coefficients)}`) would silently drop duplicates, breaking the coefficients-vs-vcov alignment that downstream consumers (HonestDiD sub-VCV extraction, BusinessReport, etc.) rely on. The skip applies to BOTH the new `absorb=` auto-route AND the pre-existing `fixed_effects=[<time_col>]` invocation (pre-PR, `fixed_effects=["unit", time]` produced a dict with `len < vcov.shape[0]` and NaN values overwriting the real event-study period coefficients).
-        - Workarounds for the still-rejecting paths: use `vcov_type="hc1"` (HC1/CR1 have no leverage term and survive FWL), or switch to `fixed_effects=` dummies so the hat matrix is computed on the full design.
+        - All three previously-rejecting absorbed-FE paths are now SUPPORTED. Weighted-CR2 variants (Gates 4-5: `vcov_type="hc2_bm" + weights`; weighted one-way HC2-BM) remain blocked at the `linalg.py::_validate_vcov_args` level pending the clubSandwich WLS algebra derivation.
 - [x] Phase 1a: `vcov_type` enum threaded through `DifferenceInDifferences` (`MultiPeriodDiD`, `TwoWayFixedEffects` inherit); `robust=True` <=> `vcov_type="hc1"`, `robust=False` <=> `vcov_type="classical"`. Conflict detection at `__init__`. Results summary prints the variance-family label.
     - **Note (`MultiPeriodDiD(cluster=..., vcov_type="hc2_bm")` — SUPPORTED via cluster-aware contrast DOF):** the scalar-coefficient `DifferenceInDifferences` path uses `_compute_cr2_bm`'s per-coefficient Satterthwaite DOF directly for the single-ATT contrast, but `MultiPeriodDiD` also reports a post-period-average ATT constructed as a *contrast* of the event-study coefficients (`avg_att = (1/n_post) Σ_{t ≥ t_treat} β_t`). Pre-PR the combination raised `NotImplementedError` because the cluster-aware CR2 Bell-McCaffrey Satterthwaite DOF for an arbitrary linear combination was not implemented — only the per-coefficient case existed. The new `_compute_cr2_bm_contrast_dof` helper in `diff_diff/linalg.py` generalizes the per-coefficient loop to arbitrary `(k, m)` contrast matrices using the identical Pustejovsky-Tipton 2018 Section 4 algebra (`q = X bread_inv c`, `omega_g = A_g X_g bread_inv c`, `DOF = trace(B)² / trace(B²)`), and `_compute_cr2_bm` is refactored to call it with `contrasts=eye(k)` so the per-coefficient case is recovered at machine precision (atol=1e-10, see refactor regression in `tests/test_linalg_hc2_bm.py::TestCR2BMContrastDOF`). `MultiPeriodDiD.fit()` extends its existing avg_att DOF block to branch on cluster presence: one-way `_compute_bm_dof_from_contrasts` for `effective_cluster_ids is None`, cluster-aware `_compute_cr2_bm_contrast_dof` otherwise. R parity verified against clubSandwich's `Wald_test(constraints=matrix(c, 1), test="HTZ")$df_denom` at atol=1e-10 on the `mpd_clustered_avg_att_dof` fixture in `benchmarks/data/clubsandwich_cr2_golden.json` (Wald_test's HTZ on a 1-row constraint matrix yields the Satterthwaite t-test DOF). Cluster IDs are per-observation length `n` and are NOT subscripted by the rank-deficient column-drop mask `_kept` — the helper accepts the full `effective_cluster_ids` array. Weighted CR2-BM (`survey_design=` paths) remains a separate gate.
 - [x] Phase 1a: `clubSandwich::vcovCR(..., type="CR2")` parity harness committed: R script at `benchmarks/R/generate_clubsandwich_golden.R` plus the authoritative R-generated JSON at `benchmarks/data/clubsandwich_cr2_golden.json` (`"source": "clubSandwich"`, with `clubSandwich_version`, `R_version`, and `generated_at` captured in `meta` for forensic traceability). The parity test at `tests/test_linalg_hc2_bm.py::TestCR2BMCluster::test_cr2_parity_with_golden` runs at 1e-6 tolerance and passes at ≤ 7.1e-15 across all three datasets — Python's `_compute_cr2_bm` matches clubSandwich at machine precision.
diff --git a/tests/test_estimators_vcov_type.py b/tests/test_estimators_vcov_type.py
index 691b051b..01ae1730 100644
--- a/tests/test_estimators_vcov_type.py
+++ b/tests/test_estimators_vcov_type.py
@@ -588,9 +588,7 @@ def test_multi_period_cluster_hc2_bm_avg_att_uses_clubsandwich_dof(self):
         # last-half-of-periods rule and computes avg_att over [3, 4] on this
         # 4-period panel, but the R fixture's `c_avg` is over [2, 3, 4] —
         # the DOFs happen to coincide here but the avg_att estimands differ.
-        post_periods = [
-            int(name.rsplit("_", 1)[1]) for name in d["post_interaction_names"]
-        ]
+        post_periods = [int(name.rsplit("_", 1)[1]) for name in d["post_interaction_names"]]
         res = MultiPeriodDiD(vcov_type="hc2_bm", cluster="unit").fit(
             data,
             outcome="y",
@@ -656,28 +654,285 @@ def test_multi_period_fit_honors_hc2_bm(self):
         ci_width = r_hc2bm.avg_conf_int[1] - r_hc2bm.avg_conf_int[0]
         assert ci_width > 0
 
-    def test_twfe_rejects_hc2_and_hc2_bm(self):
-        """TWFE rejects vcov_type in {hc2, hc2_bm} because it uses within-
-        transformation. HC2 leverage on the reduced design is not the hat
-        matrix of the full FE projection (FWL preserves coefficients, not
-        the hat matrix), so applying HC2/CR2-BM to the demeaned regressors
-        would silently ship wrong small-sample SEs. The fit must raise with
-        a pointer to HC1 (which has no leverage term and survives FWL) or
-        fixed_effects= dummies as workarounds.
+    def test_twfe_hc2_and_hc2_bm_produce_finite_inference(self):
+        """TWFE with vcov_type in {hc2, hc2_bm} now produces finite inference
+        via the inline full-dummy build (Gate 1 lift).
+
+        FWL preserves coefficients and residuals but NOT the hat matrix, so
+        HC2 leverage and CR2-BM DOF must compute on the full FE projection.
+        TWFE.fit bypasses the within-transform on these vcov_types and stacks
+        [intercept, treated*post, covariates, unit_dummies, time_dummies]
+        explicitly.
         """
         data = _make_did_panel(n_units=20)
-        for bad in ("hc2", "hc2_bm"):
-            with pytest.raises(
-                NotImplementedError,
-                match="TwoWayFixedEffects.*not yet supported",
-            ):
-                TwoWayFixedEffects(vcov_type=bad).fit(
-                    data,
-                    outcome="y",
-                    treatment="treated",
-                    time="time",
-                    unit="unit",
-                )
+        for vcov in ("hc2", "hc2_bm"):
+            res = TwoWayFixedEffects(vcov_type=vcov).fit(
+                data,
+                outcome="y",
+                treatment="treated",
+                time="time",
+                unit="unit",
+            )
+            assert np.isfinite(res.att), f"{vcov}: ATT not finite"
+            assert np.isfinite(res.se), f"{vcov}: SE not finite"
+            assert res.se > 0, f"{vcov}: SE not positive"
+            assert np.isfinite(res.p_value), f"{vcov}: p-value not finite"
+            ci = res.conf_int
+            assert np.isfinite(ci[0]) and np.isfinite(ci[1]), f"{vcov}: CI not finite"
+
+    def test_twfe_hc2_matches_did_fixed_effects_full_dummy(self):
+        """TWFE(vcov_type='hc2') is bit-equal to DifferenceInDifferences with
+        fixed_effects=[unit, time] (same full-dummy algebra under the hood).
+
+        Compares only .att and .se — the full .coefficients dict may differ
+        because pd.get_dummies(drop_first=True) reference-category ordering
+        is not guaranteed identical between TWFE's inline build and DiD's
+        fixed_effects= branch.
+        """
+        data = _make_did_panel(n_units=20)
+        res_twfe = TwoWayFixedEffects(vcov_type="hc2").fit(
+            data, outcome="y", treatment="treated", time="time", unit="unit"
+        )
+        res_did = DifferenceInDifferences(vcov_type="hc2").fit(
+            data,
+            outcome="y",
+            treatment="treated",
+            time="time",
+            fixed_effects=["unit", "time"],
+        )
+        np.testing.assert_allclose(res_twfe.att, res_did.att, atol=1e-12)
+        np.testing.assert_allclose(res_twfe.se, res_did.se, atol=1e-12)
+
+    def test_twfe_hc2_bm_matches_did_fixed_effects_full_dummy(self):
+        """Same refactor-regression check as the hc2 variant, for hc2_bm.
+
+        Note: TWFE's hc2_bm path auto-clusters at unit (preserved), while DiD
+        does NOT auto-cluster — so we explicitly pass cluster='unit' to DiD
+        to align the inference paths.
+        """
+        data = _make_did_panel(n_units=20)
+        res_twfe = TwoWayFixedEffects(vcov_type="hc2_bm").fit(
+            data, outcome="y", treatment="treated", time="time", unit="unit"
+        )
+        res_did = DifferenceInDifferences(vcov_type="hc2_bm", cluster="unit").fit(
+            data,
+            outcome="y",
+            treatment="treated",
+            time="time",
+            fixed_effects=["unit", "time"],
+        )
+        np.testing.assert_allclose(res_twfe.att, res_did.att, atol=1e-12)
+        np.testing.assert_allclose(res_twfe.se, res_did.se, atol=1e-12)
+
+    def test_twfe_hc2_bm_auto_clusters_at_unit(self):
+        """TWFE(vcov_type='hc2_bm') with no explicit cluster routes to CR2-BM
+        at unit (auto-cluster default preserved on the hc2_bm path).
+
+        Two-pronged verification, both required to distinguish CR2-BM-at-unit
+        from one-way HC2-BM:
+
+        (1) **Equivalence check against a reference path**:
+            DifferenceInDifferences(vcov_type='hc2_bm', cluster='unit',
+            fixed_effects=[unit, time]). Both paths share the full-dummy
+            design and the same CR2-BM Satterthwaite DOF at unit, so ATT
+            and SE match bit-equally at atol=1e-12.
+
+        (2) **Inequality check against one-way HC2-BM on the same X**:
+            on the shared 20×4 multi-period fixture, CR2-BM-at-unit and
+            one-way HC2-BM produce numerically different SEs (ratio ~1.22).
+            Without this check, the test would pass even if TWFE silently
+            fell through to one-way HC2-BM (on a 2-period panel the two
+            paths happen to coincide numerically, defeating the equivalence
+            check above). The 4-period fixture separates them.
+        """
+        # Multi-period panel: cluster blocks of size 4 do NOT coincide with
+        # the unit FE structure in the same way 2-obs clusters would.
+        rng = np.random.default_rng(20260420)
+        n_units, n_periods = 20, 4
+        rows = []
+        for i in range(n_units):
+            treated = int(i >= n_units // 2)
+            for t in range(n_periods):
+                post = int(t >= n_periods // 2)
+                y = rng.normal(0.0, 1.0) + 0.5 * treated + 1.0 * treated * post
+                rows.append({"unit": i, "time": post, "treated": treated, "y": y})
+        data = pd.DataFrame(rows)
+
+        res_twfe = TwoWayFixedEffects(vcov_type="hc2_bm").fit(
+            data, outcome="y", treatment="treated", time="time", unit="unit"
+        )
+        # Auto-cluster fires; result reports unit as the cluster name.
+        assert res_twfe.cluster_name == "unit"
+        assert np.isfinite(res_twfe.se) and res_twfe.se > 0
+
+        # (1) Reference path: explicit CR2-BM at unit via DiD's fixed_effects=
+        # branch. TWFE's auto-cluster should land on the same algebra at
+        # machine precision.
+        res_did = DifferenceInDifferences(vcov_type="hc2_bm", cluster="unit").fit(
+            data,
+            outcome="y",
+            treatment="treated",
+            time="time",
+            fixed_effects=["unit", "time"],
+        )
+        np.testing.assert_allclose(res_twfe.att, res_did.att, atol=1e-12)
+        np.testing.assert_allclose(res_twfe.se, res_did.se, atol=1e-12)
+
+        # (2) Sanity: the auto-clustered SE must NOT equal the one-way
+        # HC2-BM SE on the same full-dummy X. If it did, a regression where
+        # TWFE silently dropped the auto-cluster (one-way fall-through) would
+        # slip through the equivalence check above.
+        from diff_diff.linalg import solve_ols
+
+        df_local = data.copy()
+        df_local["_tp"] = df_local["treated"] * df_local["time"]
+        unit_dummies = pd.get_dummies(
+            df_local["unit"], prefix="_fe_unit", drop_first=True
+        ).values.astype(np.float64)
+        time_dummies = pd.get_dummies(
+            df_local["time"], prefix="_fe_time", drop_first=True
+        ).values.astype(np.float64)
+        X = np.column_stack(
+            [
+                np.ones(len(df_local)),
+                df_local["_tp"].values.astype(np.float64),
+                unit_dummies,
+                time_dummies,
+            ]
+        )
+        y = df_local["y"].values.astype(np.float64)
+        _, _, vcov_one_way = solve_ols(X, y, vcov_type="hc2_bm")
+        se_one_way_att = float(np.sqrt(vcov_one_way[1, 1]))
+        # Use a meaningful tolerance: on this fixture the two SEs differ by
+        # ~22%; require at least 1% gap to lock in the distinction.
+        assert abs(res_twfe.se - se_one_way_att) / se_one_way_att > 0.01, (
+            f"auto-cluster CR2-BM SE ({res_twfe.se}) coincides with one-way "
+            f"HC2-BM SE ({se_one_way_att}); the test cannot distinguish "
+            "the two paths on this fixture, so a regression where TWFE "
+            "silently drops the unit cluster would not be caught."
+        )
+
+    def test_twfe_hc2_explicit_no_auto_cluster_analytical(self):
+        """Explicit `vcov_type='hc2'` + analytical inference drops the unit
+        auto-cluster (one-way HC2; the linalg validator rejects hc2 + cluster).
+        """
+        data = _make_did_panel(n_units=20)
+        res = TwoWayFixedEffects(vcov_type="hc2", inference="analytical").fit(
+            data, outcome="y", treatment="treated", time="time", unit="unit"
+        )
+        assert np.isfinite(res.att)
+        assert np.isfinite(res.se)
+        # No auto-cluster on explicit one-way hc2 + analytical.
+        assert res.cluster_name is None
+
+    def test_twfe_hc2_wild_bootstrap_keeps_auto_cluster(self):
+        """Wild-bootstrap inference on TWFE(vcov_type='hc2') must keep the
+        unit auto-cluster (bootstrap resampling uses the cluster structure).
+
+        Regression for the auto-cluster sub-guard: omitting the
+        `inference == "analytical"` companion would crash wild_bootstrap
+        with `np.unique(None)` TypeError.
+        """
+        data = _make_did_panel(n_units=20)
+        res = TwoWayFixedEffects(
+            vcov_type="hc2",
+            inference="wild_bootstrap",
+            n_bootstrap=50,
+            seed=1,
+        ).fit(data, outcome="y", treatment="treated", time="time", unit="unit")
+        assert np.isfinite(res.se)
+        assert res.se > 0
+        # Bootstrap consumed unit-level clusters.
+        assert res.n_clusters == 20
+
+    @pytest.mark.parametrize("vcov", ["hc2", "hc2_bm"])
+    def test_twfe_rejects_replicate_weights_under_hc2(self, vcov):
+        """TWFE + hc2/hc2_bm + replicate-weight survey design raises
+        NotImplementedError.
+
+        The replicate path re-demeans per replicate (re-demeaning depends
+        on the per-replicate weight vector), which doesn't compose with
+        the full-dummy build. Documented scope limit; tracked in TODO.md.
+        """
+        data = _make_did_panel(n_units=20).copy()
+        # Attach full-sample weight + 4 BRR replicate-weight columns.
+        rng = np.random.default_rng(0)
+        data["weight"] = 1.0
+        rep_cols = [f"rep{r}" for r in range(4)]
+        for col in rep_cols:
+            data[col] = rng.choice([0.5, 1.5], size=len(data))
+        sd = SurveyDesign(
+            weights="weight",
+            replicate_weights=rep_cols,
+            replicate_method="BRR",
+            weight_type="pweight",
+        )
+        with pytest.raises(
+            NotImplementedError,
+            match=r"replicate-weight.*not yet supported",
+        ):
+            TwoWayFixedEffects(vcov_type=vcov).fit(
+                data,
+                outcome="y",
+                treatment="treated",
+                time="time",
+                unit="unit",
+                survey_design=sd,
+            )
+
+    def test_twfe_hc2_always_treated_unit_finite_att(self):
+        """Always-treated unit (D=1 in all periods) doesn't poison the ATT
+        on the full-dummy HC2 path.
+
+        The plan's footgun was theoretical (always-treated unit × treat_post
+        could be collinear with the unit dummy). In practice, on a 2-period
+        DiD with at least one switching cohort, the design retains full rank.
+        Pivoted-QR in solve_ols would cleanly drop any column that DID
+        become rank-deficient on a more degenerate design.
+        """
+        data = _make_did_panel(n_units=20)
+        # Make unit 0 always-treated (treated=1 in both periods).
+        data = data.copy()
+        data.loc[data["unit"] == 0, "treated"] = 1
+        # Recompute treat * time for the always-treated rows.
+        # (TWFE.fit builds _treatment_post internally from data[treatment] *
+        # data[time], so we just need data["treated"] and data["time"] right.)
+        res = TwoWayFixedEffects(vcov_type="hc2_bm").fit(
+            data, outcome="y", treatment="treated", time="time", unit="unit"
+        )
+        assert np.isfinite(res.att)
+        assert np.isfinite(res.se)
+        assert res.se > 0
+
+    @pytest.mark.parametrize("vcov", ["hc2", "hc2_bm"])
+    def test_twfe_hc2_coefficients_align_with_vcov(self, vcov):
+        """Under the full-dummy HC2/HC2-BM path, `result.coefficients` must
+        carry one entry per `result.vcov` column (no duplicates, no
+        collapsing).
+
+        Mirrors the MPD invariant at
+        ``test_absorb_hc2_bm_coefficients_align_with_vcov`` (line 1611)
+        and the REGISTRY/CHANGELOG promise that the full-dummy fit
+        exposes the FE-dummy entries alongside the ATT key.
+        """
+        from collections import Counter
+
+        data = _make_did_panel(n_units=20)
+        res = TwoWayFixedEffects(vcov_type=vcov).fit(
+            data, outcome="y", treatment="treated", time="time", unit="unit"
+        )
+        assert res.vcov is not None
+        assert res.vcov.shape[0] == res.vcov.shape[1]
+        assert len(res.coefficients) == res.vcov.shape[0], (
+            f"[{vcov}] coefficients dict length ({len(res.coefficients)}) "
+            f"must match vcov rank ({res.vcov.shape[0]}); duplicate var_names "
+            "or hardcoded {'ATT': ...} would break this invariant."
+        )
+        dups = {k: v for k, v in Counter(res.coefficients.keys()).items() if v > 1}
+        assert not dups, f"[{vcov}] duplicate names in coefficients: {dups}"
+        # Backward-compat: ATT key still resolves to the ATT coefficient.
+        assert "ATT" in res.coefficients
+        assert np.isclose(res.coefficients["ATT"], res.att, atol=1e-12)
 
     def test_twfe_results_record_cluster_name(self):
         """TWFE results should label the auto-clustered SE with the unit column."""
diff --git a/tests/test_methodology_twfe.py b/tests/test_methodology_twfe.py
index 631d4658..da94224f 100644
--- a/tests/test_methodology_twfe.py
+++ b/tests/test_methodology_twfe.py
@@ -27,7 +27,6 @@
 from diff_diff.linalg import LinearRegression
 from diff_diff.utils import within_transform
 
-
 # =============================================================================
 # R Availability Fixtures
 # =============================================================================
@@ -99,13 +98,15 @@ def generate_twfe_panel(
                 y += treatment_effect
             y += np.random.normal(0, noise_sd)
 
-            data.append({
-                "unit": unit,
-                "period": period,
-                "treated": int(is_treated),
-                "post": post,
-                "outcome": y,
-            })
+            data.append(
+                {
+                    "unit": unit,
+                    "period": period,
+                    "treated": int(is_treated),
+                    "post": post,
+                    "outcome": y,
+                }
+            )
 
     return pd.DataFrame(data)
 
@@ -117,18 +118,24 @@ def generate_hand_calculable_panel() -> pd.DataFrame:
     4 units (2 treated, 2 control) × 2 periods = 8 observations.
     No noise, so ATT is exactly 3.0.
     """
-    return pd.DataFrame({
-        "unit": [0, 0, 1, 1, 2, 2, 3, 3],
-        "period": [0, 1, 0, 1, 0, 1, 0, 1],
-        "treated": [1, 1, 1, 1, 0, 0, 0, 0],
-        "post": [0, 1, 0, 1, 0, 1, 0, 1],
-        "outcome": [
-            10.0, 15.0,  # Unit 0 (treated): pre=10, post=15 (diff=5)
-            12.0, 17.0,  # Unit 1 (treated): pre=12, post=17 (diff=5)
-            8.0, 10.0,   # Unit 2 (control): pre=8, post=10 (diff=2)
-            6.0, 8.0,    # Unit 3 (control): pre=6, post=8 (diff=2)
-        ],
-    })
+    return pd.DataFrame(
+        {
+            "unit": [0, 0, 1, 1, 2, 2, 3, 3],
+            "period": [0, 1, 0, 1, 0, 1, 0, 1],
+            "treated": [1, 1, 1, 1, 0, 0, 0, 0],
+            "post": [0, 1, 0, 1, 0, 1, 0, 1],
+            "outcome": [
+                10.0,
+                15.0,  # Unit 0 (treated): pre=10, post=15 (diff=5)
+                12.0,
+                17.0,  # Unit 1 (treated): pre=12, post=17 (diff=5)
+                8.0,
+                10.0,  # Unit 2 (control): pre=8, post=10 (diff=2)
+                6.0,
+                8.0,  # Unit 3 (control): pre=6, post=8 (diff=2)
+            ],
+        }
+    )
     # ATT = (mean treated diff) - (mean control diff) = 5.0 - 2.0 = 3.0
 
 
@@ -187,9 +194,7 @@ def test_twfe_att_matches_hand_calculated_demeaned_ols(self):
 
         # Run TWFE
         twfe = TwoWayFixedEffects(robust=True)
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         # Manual demeaned OLS: demean both y and the interaction term
         data_with_tp = data.copy()
@@ -217,9 +222,7 @@ def test_twfe_att_matches_basic_did_for_two_period_design(self):
 
         # Basic DiD
         did = DifferenceInDifferences(robust=True, cluster="unit")
-        did_results = did.fit(
-            data, outcome="outcome", treatment="treated", time="post"
-        )
+        did_results = did.fit(data, outcome="outcome", treatment="treated", time="post")
 
         np.testing.assert_allclose(twfe_results.att, did_results.att, rtol=1e-10)
 
@@ -242,8 +245,13 @@ def test_within_transform_weighted_warns_on_nonconvergence(self):
 
         with pytest.warns(UserWarning, match="did not converge"):
             within_transform(
-                data, ["outcome"], "unit", "period",
-                weights=weights, max_iter=1, tol=1e-15,
+                data,
+                ["outcome"],
+                "unit",
+                "period",
+                weights=weights,
+                max_iter=1,
+                tol=1e-15,
             )
 
     def test_within_transform_weighted_no_warning_on_convergence(self):
@@ -272,7 +280,7 @@ def _run_r_feols_twfe(data_path: str, covariates=None) -> Dict[str, Any]:
     else:
         formula = "outcome ~ treated:post | unit + post"
 
-    r_script = f'''
+    r_script = f"""
     suppressMessages(library(fixest))
     suppressMessages(library(jsonlite))
 
@@ -309,7 +317,7 @@ def _run_r_feols_twfe(data_path: str, covariates=None) -> Dict[str, Any]:
     )
 
     cat(toJSON(output, pretty = TRUE, digits = 15))
-    '''
+    """
 
     result = subprocess.run(
         ["Rscript", "-e", r_script],
@@ -351,13 +359,15 @@ def r_benchmark_panel_data(tmp_path_factory):
                 y += 3.0
             y += np.random.normal(0, 0.5)
 
-            data.append({
-                "unit": unit,
-                "period": period,
-                "treated": int(is_treated),
-                "post": post,
-                "outcome": y,
-            })
+            data.append(
+                {
+                    "unit": unit,
+                    "period": period,
+                    "treated": int(is_treated),
+                    "post": post,
+                    "outcome": y,
+                }
+            )
 
     df = pd.DataFrame(data)
     tmp_dir = tmp_path_factory.mktemp("r_benchmark")
@@ -388,14 +398,16 @@ def r_benchmark_panel_data_with_covariate(tmp_path_factory):
                 y += 3.0
             y += np.random.normal(0, 0.5)
 
-            data.append({
-                "unit": unit,
-                "period": period,
-                "treated": int(is_treated),
-                "post": post,
-                "outcome": y,
-                "x1": x1,
-            })
+            data.append(
+                {
+                    "unit": unit,
+                    "period": period,
+                    "treated": int(is_treated),
+                    "post": post,
+                    "outcome": y,
+                    "x1": x1,
+                }
+            )
 
     df = pd.DataFrame(data)
     tmp_dir = tmp_path_factory.mktemp("r_benchmark_cov")
@@ -445,7 +457,9 @@ def test_att_matches_r_twfe(self, r_twfe_results, r_benchmark_panel_data):
         py_results = self._run_python_twfe(data)
 
         np.testing.assert_allclose(
-            py_results.att, r_twfe_results["att"], rtol=1e-3,
+            py_results.att,
+            r_twfe_results["att"],
+            rtol=1e-3,
             err_msg=f"ATT mismatch: Python={py_results.att:.6f}, R={r_twfe_results['att']:.6f}",
         )
 
@@ -456,7 +470,9 @@ def test_se_matches_r_twfe(self, r_twfe_results, r_benchmark_panel_data):
         py_results = self._run_python_twfe(data)
 
         np.testing.assert_allclose(
-            py_results.se, r_twfe_results["se"], rtol=0.01,
+            py_results.se,
+            r_twfe_results["se"],
+            rtol=0.01,
             err_msg=f"SE mismatch: Python={py_results.se:.6f}, R={r_twfe_results['se']:.6f}",
         )
 
@@ -467,7 +483,9 @@ def test_pvalue_matches_r_twfe(self, r_twfe_results, r_benchmark_panel_data):
         py_results = self._run_python_twfe(data)
 
         np.testing.assert_allclose(
-            py_results.p_value, r_twfe_results["p_value"], atol=0.01,
+            py_results.p_value,
+            r_twfe_results["p_value"],
+            atol=0.01,
             err_msg=f"P-value mismatch: Python={py_results.p_value:.6f}, R={r_twfe_results['p_value']:.6f}",
         )
 
@@ -478,11 +496,15 @@ def test_ci_matches_r_twfe(self, r_twfe_results, r_benchmark_panel_data):
         py_results = self._run_python_twfe(data)
 
         np.testing.assert_allclose(
-            py_results.conf_int[0], r_twfe_results["ci_lower"], rtol=0.01,
+            py_results.conf_int[0],
+            r_twfe_results["ci_lower"],
+            rtol=0.01,
             err_msg=f"CI lower mismatch: Python={py_results.conf_int[0]:.6f}, R={r_twfe_results['ci_lower']:.6f}",
         )
         np.testing.assert_allclose(
-            py_results.conf_int[1], r_twfe_results["ci_upper"], rtol=0.01,
+            py_results.conf_int[1],
+            r_twfe_results["ci_upper"],
+            rtol=0.01,
             err_msg=f"CI upper mismatch: Python={py_results.conf_int[1]:.6f}, R={r_twfe_results['ci_upper']:.6f}",
         )
 
@@ -495,7 +517,9 @@ def test_att_matches_r_with_covariate(
         py_results = self._run_python_twfe(data, covariates=["x1"])
 
         np.testing.assert_allclose(
-            py_results.att, r_twfe_results_with_covariate["att"], rtol=1e-3,
+            py_results.att,
+            r_twfe_results_with_covariate["att"],
+            rtol=1e-3,
             err_msg=f"ATT w/ cov mismatch: Python={py_results.att:.6f}, R={r_twfe_results_with_covariate['att']:.6f}",
         )
 
@@ -508,7 +532,9 @@ def test_se_matches_r_with_covariate(
         py_results = self._run_python_twfe(data, covariates=["x1"])
 
         np.testing.assert_allclose(
-            py_results.se, r_twfe_results_with_covariate["se"], rtol=0.01,
+            py_results.se,
+            r_twfe_results_with_covariate["se"],
+            rtol=0.01,
             err_msg=f"SE w/ cov mismatch: Python={py_results.se:.6f}, R={r_twfe_results_with_covariate['se']:.6f}",
         )
 
@@ -545,10 +571,14 @@ def test_staggered_treatment_warning_multiperiod_time(self):
                 else:
                     treated = 0
                 y = 10.0 + unit * 0.1 + period * 0.5 + treated * 3.0 + np.random.normal(0, 0.5)
-                data.append({
-                    "unit": unit, "period": period, "treated": treated,
-                    "outcome": y,
-                })
+                data.append(
+                    {
+                        "unit": unit,
+                        "period": period,
+                        "treated": treated,
+                        "outcome": y,
+                    }
+                )
         df = pd.DataFrame(data)
 
         twfe = TwoWayFixedEffects(robust=True)
@@ -562,9 +592,9 @@ def test_staggered_treatment_warning_multiperiod_time(self):
 
         # Multi-period time warning also fires (time="period" has 5 unique values)
         multiperiod_warnings = [x for x in w if "unique values" in str(x.message)]
-        assert len(multiperiod_warnings) > 0, (
-            "Expected multi-period time warning when time='period' with 5 values"
-        )
+        assert (
+            len(multiperiod_warnings) > 0
+        ), "Expected multi-period time warning when time='period' with 5 values"
 
     def test_staggered_warning_not_fired_with_binary_time(self):
         """Staggered warning does NOT fire with binary time (known limitation).
@@ -590,10 +620,15 @@ def test_staggered_warning_not_fired_with_binary_time(self):
                     treated = 0
                 post = 1 if period >= 2 else 0
                 y = 10.0 + unit * 0.1 + period * 0.5 + treated * 3.0 + np.random.normal(0, 0.5)
-                data.append({
-                    "unit": unit, "period": period, "post": post,
-                    "treated": treated, "outcome": y,
-                })
+                data.append(
+                    {
+                        "unit": unit,
+                        "period": period,
+                        "post": post,
+                        "treated": treated,
+                        "outcome": y,
+                    }
+                )
         df = pd.DataFrame(data)
 
         twfe = TwoWayFixedEffects(robust=True)
@@ -603,9 +638,9 @@ def test_staggered_warning_not_fired_with_binary_time(self):
             twfe.fit(df, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         staggered_warnings = [x for x in w if "Staggered treatment" in str(x.message)]
-        assert len(staggered_warnings) == 0, (
-            "Staggered warning should NOT fire with binary time (known limitation)"
-        )
+        assert (
+            len(staggered_warnings) == 0
+        ), "Staggered warning should NOT fire with binary time (known limitation)"
 
     def test_multiperiod_time_warning(self):
         """Multi-period time column triggers UserWarning advising binary post indicator."""
@@ -617,9 +652,9 @@ def test_multiperiod_time_warning(self):
             twfe.fit(data, outcome="outcome", treatment="treated", time="period", unit="unit")
 
         multiperiod_warnings = [x for x in w if "unique values" in str(x.message)]
-        assert len(multiperiod_warnings) > 0, (
-            "Expected multi-period time warning when time has >2 unique values"
-        )
+        assert (
+            len(multiperiod_warnings) > 0
+        ), "Expected multi-period time warning when time has >2 unique values"
         msg = str(multiperiod_warnings[0].message)
         assert "binary" in msg, "Warning should mention binary post indicator"
         assert "post" in msg, "Warning should mention post indicator"
@@ -634,9 +669,9 @@ def test_binary_time_no_multiperiod_warning(self):
             twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         multiperiod_warnings = [x for x in w if "unique values" in str(x.message)]
-        assert len(multiperiod_warnings) == 0, (
-            "Multi-period time warning should NOT fire with binary time"
-        )
+        assert (
+            len(multiperiod_warnings) == 0
+        ), "Multi-period time warning should NOT fire with binary time"
 
     def test_non_binary_time_values_warning(self):
         """Non-{0,1} binary time values emit warning but ATT is correct."""
@@ -651,9 +686,7 @@ def test_non_binary_time_values_warning(self):
             )
 
         non_binary_warnings = [x for x in w if "instead of {0, 1}" in str(x.message)]
-        assert len(non_binary_warnings) > 0, (
-            "Expected warning about non-{0,1} binary time values"
-        )
+        assert len(non_binary_warnings) > 0, "Expected warning about non-{0,1} binary time values"
         assert np.isfinite(results.att), "ATT should be finite"
         np.testing.assert_allclose(results.att, 3.0, rtol=1e-10)
 
@@ -666,14 +699,17 @@ def test_boolean_time_no_warning(self):
         with warnings.catch_warnings(record=True) as w:
             warnings.simplefilter("always")
             twfe.fit(
-                data, outcome="outcome", treatment="treated",
-                time="post_bool", unit="unit",
+                data,
+                outcome="outcome",
+                treatment="treated",
+                time="post_bool",
+                unit="unit",
             )
 
         non_binary_warnings = [x for x in w if "instead of {0, 1}" in str(x.message)]
-        assert len(non_binary_warnings) == 0, (
-            "Boolean time values should NOT trigger non-{0,1} warning"
-        )
+        assert (
+            len(non_binary_warnings) == 0
+        ), "Boolean time values should NOT trigger non-{0,1} warning"
 
     def test_att_invariant_to_time_encoding(self):
         """ATT, SE, and p-value are identical for {0,1} vs {2020,2021} time encoding."""
@@ -694,15 +730,21 @@ def test_att_invariant_to_time_encoding(self):
             )
 
         np.testing.assert_allclose(
-            results_binary.att, results_year.att, rtol=1e-10,
+            results_binary.att,
+            results_year.att,
+            rtol=1e-10,
             err_msg="ATT should be invariant to time encoding",
         )
         np.testing.assert_allclose(
-            results_binary.se, results_year.se, rtol=1e-10,
+            results_binary.se,
+            results_year.se,
+            rtol=1e-10,
             err_msg="SE should be invariant to time encoding",
         )
         np.testing.assert_allclose(
-            results_binary.p_value, results_year.p_value, rtol=1e-10,
+            results_binary.p_value,
+            results_year.p_value,
+            rtol=1e-10,
             err_msg="P-value should be invariant to time encoding",
         )
 
@@ -723,7 +765,9 @@ def test_auto_clusters_at_unit_level(self):
         )
 
         np.testing.assert_allclose(
-            results_default.se, results_explicit.se, rtol=1e-12,
+            results_default.se,
+            results_explicit.se,
+            rtol=1e-12,
         )
         # Config should be immutable
         assert twfe_default.cluster is None
@@ -740,9 +784,7 @@ def test_df_adjustment_for_absorbed_fe(self):
 
         # Run TWFE
         twfe = TwoWayFixedEffects(robust=True)
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         # Manual: demean both y and the interaction, then run LinearRegression
         data_with_tp = data.copy()
@@ -766,7 +808,9 @@ def test_df_adjustment_for_absorbed_fe(self):
         manual_se = reg.get_inference(1).se
 
         np.testing.assert_allclose(
-            results.se, manual_se, rtol=1e-10,
+            results.se,
+            manual_se,
+            rtol=1e-10,
             err_msg=f"SE df-adjustment mismatch: TWFE={results.se:.8f}, manual={manual_se:.8f}",
         )
 
@@ -776,13 +820,15 @@ def test_covariate_collinear_with_interaction_raises_error(self):
         Adding bad_cov = treated * post duplicates the internal _treatment_post
         variable, making the demeaned design matrix rank-deficient.
         """
-        data = pd.DataFrame({
-            "unit": [0, 0, 1, 1, 2, 2, 3, 3],
-            "period": [0, 1, 0, 1, 0, 1, 0, 1],
-            "treated": [1, 1, 1, 1, 0, 0, 0, 0],
-            "post": [0, 1, 0, 1, 0, 1, 0, 1],
-            "outcome": [10.0, 11.0, 12.0, 13.0, 8.0, 9.0, 6.0, 7.0],
-        })
+        data = pd.DataFrame(
+            {
+                "unit": [0, 0, 1, 1, 2, 2, 3, 3],
+                "period": [0, 1, 0, 1, 0, 1, 0, 1],
+                "treated": [1, 1, 1, 1, 0, 0, 0, 0],
+                "post": [0, 1, 0, 1, 0, 1, 0, 1],
+                "outcome": [10.0, 11.0, 12.0, 13.0, 8.0, 9.0, 6.0, 7.0],
+            }
+        )
 
         # bad_cov = treated * post duplicates the internal _treatment_post column
         data["bad_cov"] = data["treated"] * data["post"]
@@ -790,8 +836,12 @@ def test_covariate_collinear_with_interaction_raises_error(self):
         twfe = TwoWayFixedEffects(robust=True, rank_deficient_action="error")
         with pytest.raises(ValueError):
             twfe.fit(
-                data, outcome="outcome", treatment="treated", time="post",
-                unit="unit", covariates=["bad_cov"],
+                data,
+                outcome="outcome",
+                treatment="treated",
+                time="post",
+                unit="unit",
+                covariates=["bad_cov"],
             )
 
     def test_covariate_collinearity_warns_not_errors(self):
@@ -864,9 +914,7 @@ def test_unbalanced_panel_produces_valid_results(self):
         data = data.drop(index=drop_indices).reset_index(drop=True)
 
         twfe = TwoWayFixedEffects(robust=True)
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         assert np.isfinite(results.att), "ATT should be finite for unbalanced panel"
         assert results.se > 0, "SE should be positive"
@@ -879,8 +927,11 @@ def test_unit_column_missing_raises_error(self):
         twfe = TwoWayFixedEffects(robust=True)
         with pytest.raises(ValueError, match="not found"):
             twfe.fit(
-                data, outcome="outcome", treatment="treated",
-                time="post", unit="nonexistent_unit",
+                data,
+                outcome="outcome",
+                treatment="treated",
+                time="post",
+                unit="nonexistent_unit",
             )
 
     def test_decompose_integration(self):
@@ -900,12 +951,14 @@ def test_decompose_integration(self):
             for period in range(1, 6):
                 treated = 1 if (first_treat > 0 and period >= first_treat) else 0
                 y = 10.0 + unit * 0.1 + period * 0.5 + treated * 2.0 + np.random.normal(0, 0.5)
-                data.append({
-                    "unit": unit,
-                    "period": period,
-                    "outcome": y,
-                    "first_treat": first_treat,
-                })
+                data.append(
+                    {
+                        "unit": unit,
+                        "period": period,
+                        "outcome": y,
+                        "first_treat": first_treat,
+                    }
+                )
 
         df = pd.DataFrame(data)
 
@@ -978,8 +1031,10 @@ def test_cluster_se_differs_from_hc1_se(self):
         manual_cluster_se = cluster_reg.get_inference(1).se
 
         np.testing.assert_allclose(
-            twfe_results.se, manual_cluster_se, rtol=1e-10,
-            err_msg="TWFE SE should match manually computed cluster SE"
+            twfe_results.se,
+            manual_cluster_se,
+            rtol=1e-10,
+            err_msg="TWFE SE should match manually computed cluster SE",
         )
 
     def test_vcov_positive_semidefinite(self):
@@ -987,14 +1042,12 @@ def test_vcov_positive_semidefinite(self):
         data = generate_twfe_panel(n_units=20, n_periods=4, seed=42)
 
         twfe = TwoWayFixedEffects(robust=True)
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         eigenvalues = np.linalg.eigvalsh(results.vcov)
-        assert np.all(eigenvalues >= -1e-10), (
-            f"VCoV has negative eigenvalues: {eigenvalues[eigenvalues < -1e-10]}"
-        )
+        assert np.all(
+            eigenvalues >= -1e-10
+        ), f"VCoV has negative eigenvalues: {eigenvalues[eigenvalues < -1e-10]}"
 
 
 # =============================================================================
@@ -1013,9 +1066,7 @@ def test_wild_bootstrap_produces_valid_inference(self, ci_params):
         twfe = TwoWayFixedEffects(
             robust=True, inference="wild_bootstrap", n_bootstrap=n_boot, seed=42
         )
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         assert np.isfinite(results.se) and results.se > 0
         assert 0 <= results.p_value <= 1
@@ -1034,9 +1085,7 @@ def test_wild_bootstrap_weight_types(self, ci_params, weight_type):
             bootstrap_weights=weight_type,
             seed=42,
         )
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         assert np.isfinite(results.se) and results.se > 0
         assert 0 <= results.p_value <= 1
@@ -1045,12 +1094,8 @@ def test_inference_parameter_routing(self):
         """inference='wild_bootstrap' routes to wild bootstrap method."""
         data = generate_twfe_panel(n_units=20, n_periods=2, seed=42)
 
-        twfe = TwoWayFixedEffects(
-            robust=True, inference="wild_bootstrap", n_bootstrap=99, seed=42
-        )
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        twfe = TwoWayFixedEffects(robust=True, inference="wild_bootstrap", n_bootstrap=99, seed=42)
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         assert results.inference_method == "wild_bootstrap"
 
@@ -1069,13 +1114,18 @@ def test_get_params_returns_all_parameters(self):
         params = twfe.get_params()
 
         expected_keys = {
-            "robust", "cluster", "alpha", "inference",
-            "n_bootstrap", "bootstrap_weights", "seed",
+            "robust",
+            "cluster",
+            "alpha",
+            "inference",
+            "n_bootstrap",
+            "bootstrap_weights",
+            "seed",
             "rank_deficient_action",
         }
-        assert expected_keys.issubset(params.keys()), (
-            f"Missing params: {expected_keys - params.keys()}"
-        )
+        assert expected_keys.issubset(
+            params.keys()
+        ), f"Missing params: {expected_keys - params.keys()}"
 
     def test_set_params_modifies_attributes(self):
         """set_params() modifies estimator attributes."""
@@ -1089,9 +1139,7 @@ def test_summary_contains_key_info(self):
         """summary() output contains ATT."""
         data = generate_hand_calculable_panel()
         twfe = TwoWayFixedEffects(robust=True)
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         summary = results.summary()
         assert "ATT" in summary
@@ -1100,9 +1148,7 @@ def test_to_dict_contains_all_fields(self):
         """to_dict() contains required fields."""
         data = generate_hand_calculable_panel()
         twfe = TwoWayFixedEffects(robust=True)
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         d = results.to_dict()
         for key in ["att", "se", "t_stat", "p_value", "n_obs"]:
@@ -1117,9 +1163,7 @@ def test_residuals_plus_fitted_equals_demeaned_outcome(self):
         data = generate_twfe_panel(n_units=20, n_periods=4, seed=42)
 
         twfe = TwoWayFixedEffects(robust=True)
-        results = twfe.fit(
-            data, outcome="outcome", treatment="treated", time="post", unit="unit"
-        )
+        results = twfe.fit(data, outcome="outcome", treatment="treated", time="post", unit="unit")
 
         # Within-transform by unit + post (same as TWFE internally does)
         demeaned = within_transform(data, ["outcome"], "unit", "post")
@@ -1127,6 +1171,182 @@ def test_residuals_plus_fitted_equals_demeaned_outcome(self):
 
         reconstructed = results.residuals + results.fitted_values
         np.testing.assert_allclose(
-            reconstructed, y_demeaned, rtol=1e-10,
+            reconstructed,
+            y_demeaned,
+            rtol=1e-10,
             err_msg="residuals + fitted_values should equal demeaned outcome",
         )
+
+
+# =============================================================================
+# HC2 / HC2 Bell-McCaffrey R parity (Gate 1)
+# =============================================================================
+
+
+def _load_twfe_golden_scenario():
+    """Load the `twfe_two_period` scenario from clubsandwich_cr2_golden.json.
+
+    Returns the parsed scenario dict, or None if the JSON / scenario is
+    missing (caller should pytest.skip).
+    """
+    import json
+    from pathlib import Path
+
+    golden_path = (
+        Path(__file__).parent.parent / "benchmarks" / "data" / "clubsandwich_cr2_golden.json"
+    )
+    if not golden_path.exists():
+        return None
+    with open(golden_path) as f:
+        golden = json.load(f)
+    return golden.get("twfe_two_period")
+
+
+class TestTWFEHC2RParity:
+    """R parity for TwoWayFixedEffects with vcov_type in {hc2, hc2_bm}.
+
+    These tests pin Python's ATT SE / BM DOF on the new full-dummy
+    auto-route path against the R targets in
+    benchmarks/data/clubsandwich_cr2_golden.json under the
+    `twfe_two_period` scenario. Tolerance is atol=1e-10 (the
+    same target used for the existing absorbed-FE DiD / MPD parity tests
+    in tests/test_linalg_hc2_bm.py).
+
+    Skips when the golden JSON or the scenario is missing — regenerate
+    via ``Rscript benchmarks/R/generate_clubsandwich_golden.R``.
+    """
+
+    def _build_panel(self, scenario):
+        return pd.DataFrame(
+            {
+                "unit": scenario["unit"],
+                "period": scenario["period"],
+                "treated": scenario["treated"],
+                "post": scenario["post"],
+                "y": scenario["y"],
+            }
+        )
+
+    def test_twfe_hc2_se_matches_r_lm_vcovHC(self):
+        """TwoWayFixedEffects(vcov_type='hc2') ATT SE matches R
+        sandwich::vcovHC(lm(y ~ treat_post + factor(unit) + factor(post)),
+        type='HC2') at atol=1e-10.
+
+        Singleton-cluster CR2 trick verified separately by the BM DOF test
+        below; here we pin the HC2 vcov diagonal on the ATT coefficient.
+        """
+        scenario = _load_twfe_golden_scenario()
+        if scenario is None:
+            pytest.skip(
+                "twfe_two_period scenario not in golden JSON; regenerate via "
+                "`Rscript benchmarks/R/generate_clubsandwich_golden.R`."
+            )
+        data = self._build_panel(scenario)
+        res = TwoWayFixedEffects(vcov_type="hc2").fit(
+            data, outcome="y", treatment="treated", time="post", unit="unit"
+        )
+        vcov_R = np.array(scenario["vcov_hc2"]).reshape(scenario["vcov_hc2_shape"], order="F")
+        # ATT is the 2nd coef (index 1) in the R design
+        # `lm(y ~ treat_post + factor(unit) + factor(post))`.
+        att_idx = scenario["coef_names"].index("treat_post")
+        se_R = float(np.sqrt(vcov_R[att_idx, att_idx]))
+        np.testing.assert_allclose(res.se, se_R, atol=1e-10, rtol=0)
+
+    def test_twfe_hc2_bm_dof_matches_singleton_cluster_cr2(self):
+        """One-way HC2-BM DOF matches clubSandwich's singleton-cluster CR2
+        Satterthwaite DOF (Pustejovsky-Tipton 2018 Section 3.3; the trick is
+        that CR2 with cluster=seq_len(n) reduces to Imbens-Kolesar BM).
+
+        Pinned via the analytical one-way HC2-BM path (no auto-cluster):
+        TwoWayFixedEffects(vcov_type='hc2_bm', cluster=...) → cluster-aware
+        CR2-BM (not what we want here). We invoke the one-way path by
+        explicitly passing an empty cluster column, which TWFE preserves
+        as-is. Actually simpler: use the linalg helper directly on the
+        same X built by TWFE and compare.
+        """
+        scenario = _load_twfe_golden_scenario()
+        if scenario is None:
+            pytest.skip("twfe_two_period scenario not in golden JSON.")
+        if "dof_bm_one_way" not in scenario:
+            pytest.skip(
+                "twfe_two_period scenario does not include dof_bm_one_way; "
+                "regenerate via the R script."
+            )
+        data = self._build_panel(scenario)
+        # Build the same full-dummy design TWFE uses internally for
+        # vcov_type='hc2_bm', then call compute_robust_vcov directly to
+        # extract the per-coef BM DOF (the one-way HC2-BM path).
+        from diff_diff.linalg import compute_robust_vcov, solve_ols
+
+        data_local = data.copy()
+        data_local["_tp"] = data_local["treated"] * data_local["post"]
+        unit_d = pd.get_dummies(
+            data_local["unit"], prefix="_fe_unit", drop_first=True
+        ).values.astype(np.float64)
+        time_d = pd.get_dummies(
+            data_local["post"], prefix="_fe_post", drop_first=True
+        ).values.astype(np.float64)
+        X = np.column_stack(
+            [
+                np.ones(len(data_local)),
+                data_local["_tp"].values.astype(np.float64),
+                unit_d,
+                time_d,
+            ]
+        )
+        y = data_local["y"].values.astype(np.float64)
+        _, residuals, _ = solve_ols(X, y, vcov_type="hc2")
+        _, dof_bm_one_way = compute_robust_vcov(X, residuals, vcov_type="hc2_bm", return_dof=True)
+        att_idx = scenario["coef_names"].index("treat_post")
+        dof_R = float(scenario["dof_bm_one_way"][att_idx])
+        np.testing.assert_allclose(float(dof_bm_one_way[att_idx]), dof_R, atol=1e-10, rtol=0)
+
+    def test_twfe_hc2_bm_clustered_at_unit_dof_matches_clubsandwich(self):
+        """CR2-BM DOF clustered at unit matches clubSandwich
+        vcovCR(cluster=unit, type='CR2') + coef_test()$df_Satt at
+        atol=1e-10.
+
+        This is the inference path triggered by
+        TwoWayFixedEffects(vcov_type='hc2_bm') on its default auto-cluster
+        (cluster=unit).
+        """
+        scenario = _load_twfe_golden_scenario()
+        if scenario is None:
+            pytest.skip("twfe_two_period scenario not in golden JSON.")
+        if "dof_bm_unit" not in scenario:
+            pytest.skip(
+                "twfe_two_period scenario does not include dof_bm_unit; "
+                "regenerate via the R script."
+            )
+        data = self._build_panel(scenario)
+        from diff_diff.linalg import compute_robust_vcov, solve_ols
+
+        data_local = data.copy()
+        data_local["_tp"] = data_local["treated"] * data_local["post"]
+        unit_d = pd.get_dummies(
+            data_local["unit"], prefix="_fe_unit", drop_first=True
+        ).values.astype(np.float64)
+        time_d = pd.get_dummies(
+            data_local["post"], prefix="_fe_post", drop_first=True
+        ).values.astype(np.float64)
+        X = np.column_stack(
+            [
+                np.ones(len(data_local)),
+                data_local["_tp"].values.astype(np.float64),
+                unit_d,
+                time_d,
+            ]
+        )
+        y = data_local["y"].values.astype(np.float64)
+        cluster_ids = np.asarray(data_local["unit"].values)
+        _, residuals, _ = solve_ols(X, y, vcov_type="hc2")
+        _, dof_bm_unit = compute_robust_vcov(
+            X,
+            residuals,
+            cluster_ids=cluster_ids,
+            vcov_type="hc2_bm",
+            return_dof=True,
+        )
+        att_idx = scenario["coef_names"].index("treat_post")
+        dof_R = float(scenario["dof_bm_unit"][att_idx])
+        np.testing.assert_allclose(float(dof_bm_unit[att_idx]), dof_R, atol=1e-10, rtol=0)

From b0ccc33c77cbf04637ecc5abe5ca34af19445140 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Tue, 19 May 2026 09:20:29 -0400
Subject: [PATCH 2/8] twfe: add memory guard, full-surface regression test,
 refactor TODO row
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Addresses CI Codex review findings on PR #469:

P2 (Performance): the HC2/HC2-BM full-dummy build can OOM on large
TWFE panels (n × (n_units + n_times) float64 entries). Add a memory-
size warning at >50M entries (~400 MB) suggesting hc1 (within-
transform) for large panels.

P3 (Docs/Tests): new tests pinned ATT/SE/DOF but not the documented
full-surface change (residuals/fitted_values/r_squared reflect the
full-dummy fit). Add `test_twfe_hc2_full_surface_matches_did_fixed_effects`
parametrized over hc2/hc2_bm, asserting bit-equality against
DifferenceInDifferences(fixed_effects=[unit, time]) at atol=1e-12 on
all three fields.

P3 (Maintainability): TWFE's inline full-dummy builder duplicates
DiD's fixed_effects= dummy-construction logic. Substantive refactor —
better as a follow-up than inline in this PR. Added a TODO row.

P3 (Tech debt — replicate weights): already tracked, no action needed.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 TODO.md                            |  1 +
 diff_diff/twfe.py                  | 20 +++++++++++++++++++
 tests/test_estimators_vcov_type.py | 31 ++++++++++++++++++++++++++++++
 3 files changed, 52 insertions(+)

diff --git a/TODO.md b/TODO.md
index e11b2a73..e67026a1 100644
--- a/TODO.md
+++ b/TODO.md
@@ -101,6 +101,7 @@ Deferred items from PR reviews that were not addressed before merge.
 | Weighted one-way Bell-McCaffrey (`vcov_type="hc2_bm"` + `weights`, no cluster) currently raises `NotImplementedError`. `_compute_bm_dof_from_contrasts` builds its hat matrix from the unscaled design via `X (X'WX)^{-1} X' W`, but `solve_ols` solves the WLS problem by transforming to `X* = sqrt(w) X`, so the correct symmetric idempotent residual-maker is `M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W)`. Rederive the Satterthwaite `(tr G)^2 / tr(G^2)` ratio on the transformed design and add weighted parity tests before lifting the guard. | `linalg.py::_compute_bm_dof_from_contrasts`, `linalg.py::_validate_vcov_args` | Phase 1a | Medium |
 | Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
 | `TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` with replicate-weight survey designs raises `NotImplementedError` (`twfe.py:~233`). The replicate path re-demeans per replicate (re-demeaning depends on the per-replicate weight vector), which doesn't compose with the full-dummy HC2/HC2-BM build — a correct implementation would need per-replicate full-dummy refit. Workaround: use `vcov_type="hc1"` for replicate-weight CR1. | `twfe.py::fit` | follow-up | Low |
+| TWFE's HC2/HC2-BM inline full-dummy build (`twfe.py:280-315`) duplicates the dummy-construction logic in `DifferenceInDifferences(fixed_effects=...)` (`estimators.py:478-486`). Extract a shared helper (or delegate TWFE's HC2/HC2-BM path to DiD's `fixed_effects=` branch, with TWFE-specific cluster default threading) to reduce drift risk on FE naming, survey behavior, and result-surface conventions. Substantive refactor — touches both estimators. | `twfe.py::fit`, `estimators.py::DifferenceInDifferences.fit` | follow-up | Low |
 | Unify Rust local-method `estimate_model` solver path to `solve_wls_svd` (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (`tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local`) passes at `atol=1e-5` — the residual ~1e-7 gap is roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq`, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (`atol=1e-14` for `lambda_nn=inf`, `atol=1e-10` for finite `lambda_nn` — see `test_local_method_main_fit_parity`); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. | `rust/src/trop.rs::estimate_model`, `rust/src/linalg.rs::solve_wls_svd` | follow-up | Low |
 | Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
 | `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
diff --git a/diff_diff/twfe.py b/diff_diff/twfe.py
index 72a4c270..cb2f1bae 100644
--- a/diff_diff/twfe.py
+++ b/diff_diff/twfe.py
@@ -287,6 +287,26 @@ def fit(  # type: ignore[override]
             # detection in `solve_ols` cleanly drops any collinear FE
             # dummies (e.g. an always-treated unit × treatment_post
             # collinearity) without poisoning the ATT.
+            # Memory guard: the full-dummy design materializes a dense
+            # n × (1 + 1 + n_covs + (n_units-1) + (n_times-1)) matrix.
+            # On large TWFE panels (n_units > 5000 typical) this can blow
+            # up working memory. Warn when the design exceeds ~50M float64
+            # entries (~400 MB) so users can switch to HC1 (within-
+            # transform path) for those panels.
+            _design_cols = 2 + len(covariates or []) + max(0, n_units - 1) + max(0, n_times - 1)
+            _design_entries = len(data) * _design_cols
+            if _design_entries > 50_000_000:
+                warnings.warn(
+                    f"TwoWayFixedEffects(vcov_type={self.vcov_type!r}) builds a "
+                    f"dense {len(data)} × {_design_cols} full-dummy design "
+                    f"(~{_design_entries / 1e6:.1f}M float64 entries, "
+                    f"~{_design_entries * 8 / 1e9:.2f} GB). For panels with "
+                    f"many units/periods, consider vcov_type='hc1' (within-"
+                    "transform path; no leverage term, lower memory) unless "
+                    "small-sample HC2/HC2-BM inference is required.",
+                    UserWarning,
+                    stacklevel=2,
+                )
             y = data[outcome].values.astype(np.float64)
             cov_arrs = [data[c].values.astype(np.float64) for c in (covariates or [])]
             unit_dummies_df = pd.get_dummies(data[unit], prefix=f"_fe_{unit}", drop_first=True)
diff --git a/tests/test_estimators_vcov_type.py b/tests/test_estimators_vcov_type.py
index 01ae1730..d58c45da 100644
--- a/tests/test_estimators_vcov_type.py
+++ b/tests/test_estimators_vcov_type.py
@@ -934,6 +934,37 @@ def test_twfe_hc2_coefficients_align_with_vcov(self, vcov):
         assert "ATT" in res.coefficients
         assert np.isclose(res.coefficients["ATT"], res.att, atol=1e-12)
 
+    @pytest.mark.parametrize("vcov", ["hc2", "hc2_bm"])
+    def test_twfe_hc2_full_surface_matches_did_fixed_effects(self, vcov):
+        """Under the HC2/HC2-BM full-dummy path, the entire `DiDResults`
+        surface (residuals, fitted_values, r_squared) reflects the
+        full-dummy fit and matches DiD(fixed_effects=[unit, time]) bit-
+        equally, not just ATT/SE.
+
+        Regression for the REGISTRY/CHANGELOG disclosure that under
+        `vcov_type in {"hc2","hc2_bm"}`, `result.residuals`,
+        `result.fitted_values`, and `result.r_squared` reflect the
+        un-demeaned full-dummy fit (matching DiD-absorb / MPD-absorb
+        auto-route behavior).
+        """
+        data = _make_did_panel(n_units=20)
+        res_twfe = TwoWayFixedEffects(vcov_type=vcov).fit(
+            data, outcome="y", treatment="treated", time="time", unit="unit"
+        )
+        cluster_kwarg = "unit" if vcov == "hc2_bm" else None
+        res_did = DifferenceInDifferences(vcov_type=vcov, cluster=cluster_kwarg).fit(
+            data,
+            outcome="y",
+            treatment="treated",
+            time="time",
+            fixed_effects=["unit", "time"],
+        )
+        assert res_twfe.residuals is not None and res_did.residuals is not None
+        assert res_twfe.fitted_values is not None and res_did.fitted_values is not None
+        np.testing.assert_allclose(res_twfe.residuals, res_did.residuals, atol=1e-12)
+        np.testing.assert_allclose(res_twfe.fitted_values, res_did.fitted_values, atol=1e-12)
+        np.testing.assert_allclose(res_twfe.r_squared, res_did.r_squared, atol=1e-12)
+
     def test_twfe_results_record_cluster_name(self):
         """TWFE results should label the auto-clustered SE with the unit column."""
         rng = np.random.default_rng(1)

From 0a2706655c849b9296325ee621d7e94d9bb42f63 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Tue, 19 May 2026 09:32:55 -0400
Subject: [PATCH 3/8] twfe: add survey-weighted regression test for HC2/HC2-BM
 full-dummy path

CI review flagged the missing end-to-end test for the new non-replicate
survey-weighted hc2/hc2_bm TWFE path. A regression there could ship
while existing tests still pass.

New parametrized test compares TWFE(vcov_type=vcov, cluster=...) with
SurveyDesign(weights=...) against DifferenceInDifferences(vcov_type=vcov,
cluster=..., fixed_effects=['unit', 'time']) with the same design. Both
paths feed the survey-resolved full-dummy X to LinearRegression's
compute_survey_vcov (TSL), so ATT and SE match bit-equally at atol=1e-12.

Test documents the explicit-cluster requirement on hc2_bm: TWFE's
implicit auto-cluster + survey path intentionally drops PSU injection
(per the survey-design scope rule in _resolve_effective_cluster);
the explicit cluster='unit' form is what aligns with DiD's clustered-
survey behavior and is the documented user-facing way to invoke
clustered survey-aware HC2-BM on TWFE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tests/test_estimators_vcov_type.py | 51 ++++++++++++++++++++++++++++++
 1 file changed, 51 insertions(+)

diff --git a/tests/test_estimators_vcov_type.py b/tests/test_estimators_vcov_type.py
index d58c45da..286b71d8 100644
--- a/tests/test_estimators_vcov_type.py
+++ b/tests/test_estimators_vcov_type.py
@@ -965,6 +965,57 @@ def test_twfe_hc2_full_surface_matches_did_fixed_effects(self, vcov):
         np.testing.assert_allclose(res_twfe.fitted_values, res_did.fitted_values, atol=1e-12)
         np.testing.assert_allclose(res_twfe.r_squared, res_did.r_squared, atol=1e-12)
 
+    @pytest.mark.parametrize("vcov", ["hc2", "hc2_bm"])
+    def test_twfe_hc2_with_survey_weights_matches_did_fixed_effects(self, vcov):
+        """TWFE(vcov_type in {'hc2','hc2_bm'}) with a non-replicate
+        SurveyDesign(weights=...) routes through the full-dummy build,
+        with survey TSL variance taking precedence over the analytical
+        HC2/HC2-BM sandwich (per the documented survey-design scope).
+
+        End-to-end consistency check: TWFE's auto-route on the full-dummy
+        design under survey weights must match
+        DifferenceInDifferences(fixed_effects=[unit, time]) with the same
+        survey design and cluster. Both paths feed the survey-resolved
+        design to LinearRegression's compute_survey_vcov (TSL) on an
+        identical full-dummy X, so ATT and SE match bit-equally at
+        atol=1e-12. Regression for the concern that the survey path could
+        revert to the within-transform branch or mishandle PSU injection
+        under the new FE route.
+
+        Note: `cluster='unit'` is passed EXPLICITLY to TWFE on the
+        `hc2_bm` branch to align with DiD's explicit-cluster + survey
+        PSU-injection convention. Without explicit cluster, TWFE's
+        survey-design scope rule (twfe.py:_resolve_effective_cluster
+        branch) drops the auto-cluster from PSU injection — that's
+        intentional but causes the path to diverge from DiD here. The
+        explicit-cluster form is the documented user-facing way to
+        invoke clustered survey-aware HC2-BM on TWFE.
+        """
+        data = _make_did_panel(n_units=20).copy()
+        rng = np.random.default_rng(7)
+        data["w"] = rng.uniform(0.5, 2.0, size=len(data))
+        sd = SurveyDesign(weights="w")
+        # Explicit cluster on both paths so PSU injection matches.
+        cluster_kwarg = "unit" if vcov == "hc2_bm" else None
+        res_twfe = TwoWayFixedEffects(vcov_type=vcov, cluster=cluster_kwarg).fit(
+            data,
+            outcome="y",
+            treatment="treated",
+            time="time",
+            unit="unit",
+            survey_design=sd,
+        )
+        res_did = DifferenceInDifferences(vcov_type=vcov, cluster=cluster_kwarg).fit(
+            data,
+            outcome="y",
+            treatment="treated",
+            time="time",
+            fixed_effects=["unit", "time"],
+            survey_design=sd,
+        )
+        np.testing.assert_allclose(res_twfe.att, res_did.att, atol=1e-12)
+        np.testing.assert_allclose(res_twfe.se, res_did.se, atol=1e-12)
+
     def test_twfe_results_record_cluster_name(self):
         """TWFE results should label the auto-clustered SE with the unit column."""
         rng = np.random.default_rng(1)

From 382824162ef52d45a83481d1c9b302579646b9cc Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Tue, 19 May 2026 09:43:15 -0400
Subject: [PATCH 4/8] twfe: add strata+psu survey parity test; clarify
 auto-cluster wording for survey path
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI review polish:

- Add `test_twfe_hc2_with_survey_strata_psu_matches_did_fixed_effects`
  (parametrized over hc2/hc2_bm) — extends the weights-only survey
  regression with `SurveyDesign(weights="w", strata="stratum",
  psu="psu")`. Both TWFE and DiD(fixed_effects=[unit, time]) paths
  feed the survey-resolved full-dummy X to LinearRegression's TSL
  with stratified-design adjustments, so ATT/SE match bit-equally
  at atol=1e-12.

- Clarify the "auto-cluster preserved on hc2_bm" wording in
  TWFE docstring and REGISTRY: the auto-cluster default applies to
  the non-survey analytical path. Under `survey_design=` with no
  explicit `cluster=`, TWFE keeps the documented implicit-PSU path
  (auto-cluster NOT injected into survey PSU); users wanting
  unit-level PSU under a survey design must pass explicit
  `cluster="unit"` or set `survey_design.psu`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/twfe.py                  |  8 +++++-
 docs/methodology/REGISTRY.md       |  2 +-
 tests/test_estimators_vcov_type.py | 45 ++++++++++++++++++++++++++++++
 3 files changed, 53 insertions(+), 2 deletions(-)

diff --git a/diff_diff/twfe.py b/diff_diff/twfe.py
index cb2f1bae..75253349 100644
--- a/diff_diff/twfe.py
+++ b/diff_diff/twfe.py
@@ -71,7 +71,13 @@ class TwoWayFixedEffects(DifferenceInDifferences):
     ATT coefficient, its SE, and analytical inference are unchanged.
     Auto-cluster-at-unit is preserved on ``hc2_bm`` (routes to CR2-BM at
     unit) and on ``hc2`` + ``wild_bootstrap``; dropped on explicit ``hc2``
-    + ``analytical`` to match the one-way contract. Documented in
+    + ``analytical`` to match the one-way contract. **This wording applies
+    to the non-survey analytical path**: under ``survey_design=`` with no
+    explicit ``cluster=``, TWFE intentionally keeps the documented
+    implicit-PSU path (auto-cluster is NOT injected into the survey PSU
+    structure) — users who want unit-level PSU injection under a survey
+    design must pass explicit ``cluster="unit"`` or set
+    ``survey_design.psu``. Documented in
     ``docs/methodology/REGISTRY.md`` under the scope-limitation note.
 
     **Conley spatial-HAC (``vcov_type="conley"``) is supported via the
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 97a52af2..7b18af42 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -2559,7 +2559,7 @@ Shipped in `diff_diff/had_pretests.py` as `stute_joint_pretest()` (residuals-in
 - [x] Phase 1a: HC2 + Bell-McCaffrey DOF correction in `diff_diff/linalg.py` via `vcov_type="hc2_bm"` enum (both one-way and CR2 cluster-robust with Imbens-Kolesar / Pustejovsky-Tipton Satterthwaite DOF). Weighted cluster CR2 raises `NotImplementedError` and is tracked as Phase 2+ in `TODO.md`.
     - **Note (scope limitation on absorbed FE):** HC2 and HC2 + Bell-McCaffrey on within-transformed designs still depend on the FULL FE hat matrix because FWL preserves coefficients and residuals but NOT the hat matrix: `h_ii = x_i' (X'X)^{-1} x_i` on the reduced design is not the diagonal of the full FE projection, and CR2's block adjustment `A_g = (I - H_gg)^{-1/2}` likewise depends on the full cluster-block hat matrix. The status across the three estimators that previously rejected this combination:
         - **`DifferenceInDifferences(absorb=..., vcov_type in {"hc2","hc2_bm"})` — SUPPORTED (auto-route).** When the user pairs `absorb=` with HC2 / HC2-BM, `DiD.fit()` internally promotes the absorb columns to `fixed_effects=` so the existing full-dummy code path computes the algebraically correct vcov from the full FE projection. Verified at ~1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=..., type="CR2")` (singleton-cluster CR2 trick for one-way HC2-BM Satterthwaite DOF; PT2018 §3.3 unweighted CR2 algebra). **User-visible surface change**: under the auto-route, the entire `DiDResults` (coefficients, vcov, residuals, fitted_values, r_squared) reflect the full-dummy fit rather than the within-transformed fit — the FE-dummy entries are included in `result.coefficients` / `result.vcov`, `r_squared` is computed on the un-demeaned outcome, and `residuals` / `fitted_values` are on the original scale. `result.att` is unaffected (FWL-equivalent). HC1/CR1 paths on `absorb=` are unchanged (no leverage term). **Survey-design scope**: when `survey_design=` is supplied, the existing survey variance path (Taylor-series linearization / replicate weights) takes precedence over the analytical HC2/HC2-BM sandwich; the auto-route only changes the FE handling (removing the prior reject) and does not redirect to the analytical small-sample sandwich on survey fits.
-        - **`TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` — SUPPORTED (inline full-dummy build).** TWFE has no `absorb=` / `fixed_effects=` parameter (the unit + time FE are baked into the estimator's identity), so the same parameter-swap auto-route used for DiD-absorb / MPD-absorb is not directly applicable. Instead, `TwoWayFixedEffects.fit()` bypasses the within-transform when `vcov_type in {"hc2","hc2_bm"}` and builds the full-dummy design `[intercept, treated×post, covariates, factor(unit), factor(time)]` explicitly, then runs OLS through the standard `solve_ols` path so the leverage correction and BM DOF compute on the full FE projection. Verified at atol=1e-10 vs `lm(y ~ treat_post + factor(unit) + factor(post)) + sandwich::vcovHC(type="HC2")` for HC2 and vs `clubSandwich::vcovCR(cluster=seq_len(n), type="CR2")` for the singleton-cluster one-way HC2-BM Satterthwaite DOF; vs `vcovCR(cluster=unit, type="CR2")` for the auto-cluster CR2-BM path. **Auto-cluster default:** TWFE's unit auto-cluster is preserved on `hc2_bm` (routes to CR2-BM at unit) and on `hc2 + wild_bootstrap` (the bootstrap consumes the cluster structure for resampling regardless of the analytical sandwich choice); dropped on explicit `hc2 + analytical` to match the one-way contract (the linalg validator rejects `hc2 + cluster_ids`). `hc2_bm + analytical` with no explicit cluster yields the auto-cluster CR2-BM path. **User-visible surface change** (matches the DiD-absorb / MPD-absorb disclosure above): under HC2 / HC2-BM, `result.coefficients`, `result.vcov`, `result.residuals`, `result.fitted_values`, and `result.r_squared` reflect the full-dummy fit rather than the within-transformed reduced fit (FE-dummy entries are included, `r_squared` is computed on the un-demeaned outcome, residuals/fitted are on the original scale). `result.att`, its SE, and analytical inference are unchanged (FWL-equivalent). HC1 / CR1 / Conley / classical paths remain on the within-transform (no leverage term in those vcov families). **Survey-design scope** (mirrors the DiD-absorb auto-route contract above): when `survey_design=` is supplied, the existing survey variance path (Taylor-series linearization for analytical-weight designs, or replicate-weight variance for BRR/Fay/JK1/JKn/SDR) takes precedence over the analytical HC2/HC2-BM sandwich; the full-dummy build only changes the FE handling (removing the prior reject) and does not redirect to the analytical small-sample sandwich on survey fits. **Replicate-weight survey designs** are blocked at the estimator level: `vcov_type in {"hc2","hc2_bm"}` + replicate weights raises `NotImplementedError` because the replicate refit path re-demeans per replicate, which doesn't compose with the full-dummy build (would require per-replicate full-dummy refit); workaround: use `vcov_type="hc1"` for replicate-weight CR1. `hc2_bm + weights` remains rejected upstream by the linalg validator (same gate as Gates 4-5 — weighted CR2 variants).
+        - **`TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` — SUPPORTED (inline full-dummy build).** TWFE has no `absorb=` / `fixed_effects=` parameter (the unit + time FE are baked into the estimator's identity), so the same parameter-swap auto-route used for DiD-absorb / MPD-absorb is not directly applicable. Instead, `TwoWayFixedEffects.fit()` bypasses the within-transform when `vcov_type in {"hc2","hc2_bm"}` and builds the full-dummy design `[intercept, treated×post, covariates, factor(unit), factor(time)]` explicitly, then runs OLS through the standard `solve_ols` path so the leverage correction and BM DOF compute on the full FE projection. Verified at atol=1e-10 vs `lm(y ~ treat_post + factor(unit) + factor(post)) + sandwich::vcovHC(type="HC2")` for HC2 and vs `clubSandwich::vcovCR(cluster=seq_len(n), type="CR2")` for the singleton-cluster one-way HC2-BM Satterthwaite DOF; vs `vcovCR(cluster=unit, type="CR2")` for the auto-cluster CR2-BM path. **Auto-cluster default (non-survey analytical path):** TWFE's unit auto-cluster is preserved on `hc2_bm` (routes to CR2-BM at unit) and on `hc2 + wild_bootstrap` (the bootstrap consumes the cluster structure for resampling regardless of the analytical sandwich choice); dropped on explicit `hc2 + analytical` to match the one-way contract (the linalg validator rejects `hc2 + cluster_ids`). `hc2_bm + analytical` with no explicit cluster yields the auto-cluster CR2-BM path. **Survey-design exception:** under `survey_design=` with no explicit `cluster=`, TWFE intentionally keeps the documented implicit-PSU path (the auto-cluster is NOT injected into the survey PSU structure); users who want unit-level PSU injection under a survey design must pass explicit `cluster="unit"` or set `survey_design.psu` directly. **User-visible surface change** (matches the DiD-absorb / MPD-absorb disclosure above): under HC2 / HC2-BM, `result.coefficients`, `result.vcov`, `result.residuals`, `result.fitted_values`, and `result.r_squared` reflect the full-dummy fit rather than the within-transformed reduced fit (FE-dummy entries are included, `r_squared` is computed on the un-demeaned outcome, residuals/fitted are on the original scale). `result.att`, its SE, and analytical inference are unchanged (FWL-equivalent). HC1 / CR1 / Conley / classical paths remain on the within-transform (no leverage term in those vcov families). **Survey-design scope** (mirrors the DiD-absorb auto-route contract above): when `survey_design=` is supplied, the existing survey variance path (Taylor-series linearization for analytical-weight designs, or replicate-weight variance for BRR/Fay/JK1/JKn/SDR) takes precedence over the analytical HC2/HC2-BM sandwich; the full-dummy build only changes the FE handling (removing the prior reject) and does not redirect to the analytical small-sample sandwich on survey fits. **Replicate-weight survey designs** are blocked at the estimator level: `vcov_type in {"hc2","hc2_bm"}` + replicate weights raises `NotImplementedError` because the replicate refit path re-demeans per replicate, which doesn't compose with the full-dummy build (would require per-replicate full-dummy refit); workaround: use `vcov_type="hc1"` for replicate-weight CR1. `hc2_bm + weights` remains rejected upstream by the linalg validator (same gate as Gates 4-5 — weighted CR2 variants).
         - **`MultiPeriodDiD(absorb=..., vcov_type in {"hc2","hc2_bm"})` — SUPPORTED (auto-route).** Same auto-route pattern as `DifferenceInDifferences`: `MultiPeriodDiD.fit()` internally promotes the absorb columns to `fixed_effects=` for HC2 / HC2-BM callers, so the existing full-dummy code path computes the algebraically correct vcov from the full FE projection on the event-study design (`treated + period_X dummies + treated:period_X interactions + factor(unit)`). Verified at ~1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=1:n, type="CR2")` on a 5-cohort × 5-period event-study fixture; the parity target is a per-period interaction `treated:period_X` because MPD requires the `treated` column to be a time-invariant ever-treated indicator, which lies in the span of the intercept and the post-auto-route unit FE dummies (under `pd.get_dummies(..., drop_first=True)` the dropped reference unit is implicit in the intercept, so the exact alias relation depends on the omitted FE category — it is NOT simply "the sum of treated-cohort unit dummies"). `solve_ols` drops one column from the collinear set under R-style rank-deficiency handling; in the shipped parity fixture (4 ever-treated cohorts of 5 units + 1 never-treated cohort of 5 units) it drops a unit dummy from the never-treated cohort (`unit_25`) and the `treated` main effect remains finite, but the specific column that gets NaN'd is pivot-order and dummy-coding dependent. Either way, the slope coefficients (`treated:period_X`) and the post-period-average `avg_att` are identified and invariant to which column was dropped. Same `MultiPeriodDiDResults` surface change as DiD: `vcov`, `residuals`, `fitted_values`, `r_squared`, and `coefficients` reflect the full-dummy fit, with `period_effects[t].effect` / `.se` / `.p_value` / `.conf_int` invariant by FWL. HC1/CR1 paths on `absorb=` are unchanged (no leverage term). Same survey-design scope as DiD: replicate-weight variance routes through the standard `compute_replicate_vcov` path on the fixed full-dummy design rather than the per-replicate refit branch (which targets the demeaning path); since the auto-routed design does not depend on replicate weights, no refit is needed. **Redundant time-FE skip:** when the routed (or directly-supplied) `fixed_effects` list contains the `time` column, MPD silently skips emitting `<time>_<X>` dummies for that entry because the design already absorbs time via the non-reference period dummies. Without the skip, those blocks collide on dummy names and `MultiPeriodDiDResults.coefficients` (built as `{name: coef for name, coef in zip(var_names, coefficients)}`) would silently drop duplicates, breaking the coefficients-vs-vcov alignment that downstream consumers (HonestDiD sub-VCV extraction, BusinessReport, etc.) rely on. The skip applies to BOTH the new `absorb=` auto-route AND the pre-existing `fixed_effects=[<time_col>]` invocation (pre-PR, `fixed_effects=["unit", time]` produced a dict with `len < vcov.shape[0]` and NaN values overwriting the real event-study period coefficients).
         - All three previously-rejecting absorbed-FE paths are now SUPPORTED. Weighted-CR2 variants (Gates 4-5: `vcov_type="hc2_bm" + weights`; weighted one-way HC2-BM) remain blocked at the `linalg.py::_validate_vcov_args` level pending the clubSandwich WLS algebra derivation.
 - [x] Phase 1a: `vcov_type` enum threaded through `DifferenceInDifferences` (`MultiPeriodDiD`, `TwoWayFixedEffects` inherit); `robust=True` <=> `vcov_type="hc1"`, `robust=False` <=> `vcov_type="classical"`. Conflict detection at `__init__`. Results summary prints the variance-family label.
diff --git a/tests/test_estimators_vcov_type.py b/tests/test_estimators_vcov_type.py
index 286b71d8..ffec660c 100644
--- a/tests/test_estimators_vcov_type.py
+++ b/tests/test_estimators_vcov_type.py
@@ -1016,6 +1016,51 @@ def test_twfe_hc2_with_survey_weights_matches_did_fixed_effects(self, vcov):
         np.testing.assert_allclose(res_twfe.att, res_did.att, atol=1e-12)
         np.testing.assert_allclose(res_twfe.se, res_did.se, atol=1e-12)
 
+    @pytest.mark.parametrize("vcov", ["hc2", "hc2_bm"])
+    def test_twfe_hc2_with_survey_strata_psu_matches_did_fixed_effects(self, vcov):
+        """TWFE(vcov_type in {'hc2','hc2_bm'}) with a full SurveyDesign
+        (weights + strata + psu) routes through the full-dummy build, with
+        survey TSL variance (including stratified-design adjustments)
+        taking precedence over the analytical sandwich.
+
+        Extends the weights-only regression with a multi-stage survey
+        design (strata + PSU). Verifies that TWFE's full-dummy route
+        threads strata / PSU columns to LinearRegression's survey
+        variance path identically to DiD's fixed_effects= branch — so
+        ATT and SE match bit-equally at atol=1e-12 under non-trivial
+        survey design metadata.
+        """
+        data = _make_did_panel(n_units=20).copy()
+        rng = np.random.default_rng(11)
+        data["w"] = rng.uniform(0.5, 2.0, size=len(data))
+        # Stratum = unit cohort (treated vs control); PSU = unit. Both
+        # constant within each unit, satisfying typical survey-design
+        # constraints. Globally unique PSU ids per SurveyDesign convention.
+        data["stratum"] = data["treated"].astype(int)
+        data["psu"] = data["unit"].astype(int)
+        sd = SurveyDesign(weights="w", strata="stratum", psu="psu")
+        # Explicit cluster='unit' on both paths so PSU injection matches
+        # under hc2_bm; hc2 paths drop the cluster as one-way.
+        cluster_kwarg = "unit" if vcov == "hc2_bm" else None
+        res_twfe = TwoWayFixedEffects(vcov_type=vcov, cluster=cluster_kwarg).fit(
+            data,
+            outcome="y",
+            treatment="treated",
+            time="time",
+            unit="unit",
+            survey_design=sd,
+        )
+        res_did = DifferenceInDifferences(vcov_type=vcov, cluster=cluster_kwarg).fit(
+            data,
+            outcome="y",
+            treatment="treated",
+            time="time",
+            fixed_effects=["unit", "time"],
+            survey_design=sd,
+        )
+        np.testing.assert_allclose(res_twfe.att, res_did.att, atol=1e-12)
+        np.testing.assert_allclose(res_twfe.se, res_did.se, atol=1e-12)
+
     def test_twfe_results_record_cluster_name(self):
         """TWFE results should label the auto-clustered SE with the unit column."""
         rng = np.random.default_rng(1)

From baceb22bda068f738f540f24fd6b98d7a12af7ce Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Tue, 19 May 2026 11:53:26 -0400
Subject: [PATCH 5/8] twfe: clear stale TWFE-rejection claims in CHANGELOG
 release notes
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI review (R4) flagged that the new top entry says TWFE HC2/HC2-BM is
supported, but two older bullets in the same [Unreleased] block (DiD-
absorb and MPD-absorb entries) still claim "TwoWayFixedEffects
rejection remains as a follow-up" / "remain as follow-ups". Those
sentences were factually correct when those PRs landed but become
contradictory once Gate 1 ships in the same release.

Replace both with cross-references to the Gate 1 top entry, noting that
TWFE uses a separate full-dummy branch (no fixed_effects= equivalent
inside TWFE) rather than the absorb→fixed_effects parameter swap used
by DiD/MPD.

Per feedback_changelog_accuracy_fixes.md: scanned all CHANGELOG bullets
for similar stale claims; the remaining matches are unrelated entries
(Spillover, BaconDecomposition, Conley waves) that don't reference TWFE
rejection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md | 4 ++--
 1 file changed, 2 insertions(+), 2 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index d2be9da0..e0d494dc 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -20,8 +20,8 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 - **Helper API: `compute_pretrends_power` and `compute_mdv` now accept `violation_weights` and `pretest_form` (PR-B Step 6).** Closes the PR-A R18 helper/class API gap that previously made `violation_type='custom'` unusable from the helper functions. Helpers now forward both new parameters to the underlying `PreTrendsPower` class. Default `pretest_form='nis'` matches the class default. All existing helper call sites in `test_pretrends.py` and `test_pretrends_event_study.py` continue to pass without changes because the form-invariance of most assertions allowed the default flip with only 3 tests needing targeted updates.
 - **NEW `tests/test_methodology_pretrends.py` (PR-B Step 7).** Roth (2022) Section II.A-B paper-equation-numbered Verified Components walk-through. 8 classes, 30+ tests covering K=1 closed-form (Proposition 2 proof), NIS box probability via MC simulation cross-check, Propositions 1-4 simulation parity, linear-units γ-scale verification on regular / irregular / pandas.Period grids, custom-weight persistence regression, JSON-serializability of `to_dict`, CS/SA full-VCV adapter regression, helper API end-to-end, NIS-vs-Wald differentiation, and skip-gated `TestPretrendsParityR` stubs for PR-C R-package goldens.
 - **`benchmarks/R/generate_pretrends_golden.R` (PR-B Step 12).** R generator script for the PR-C deferred goldens. Script committed with a `<PR-C-PIN>` placeholder commit reference; PR-C pins the audited `pretrends` revision, runs the script, commits the JSON goldens at `benchmarks/data/r_pretrends_golden.json`, and activates the parity tests.
-- **`MultiPeriodDiD(absorb=..., vcov_type in {"hc2", "hc2_bm"})` now supported** (`diff_diff/estimators.py:1476`). Mirrors the DiD-absorb auto-route shipped earlier in this release: when `absorb=` is paired with `vcov_type in {"hc2","hc2_bm"}`, `MultiPeriodDiD.fit()` promotes the absorb columns to `fixed_effects=` internally so the existing full-dummy-design code path computes the algebraically correct vcov on the event-study design (`treated + period_X dummies + treated:period_X interactions + factor(unit)`). Verified at ~1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=1:n, type="CR2")` on a 5-cohort × 5-period event-study fixture (new `tests/test_estimators_vcov_type.py::TestMPDAbsorbedFERParity` against `benchmarks/data/clubsandwich_cr2_golden.json` scenario `mpd_absorbed_fe_did`). HC1/CR1 paths on `absorb=` are unchanged (no leverage term). `TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` rejection remains as a follow-up (different fit-path structure — no `fixed_effects=` equivalent inside TWFE). **Behavioral note (full `MultiPeriodDiDResults` surface change under auto-route):** under the auto-route, the entire returned `MultiPeriodDiDResults` reflects the full-dummy fit rather than the within-transformed fit — `result.coefficients`, `result.vcov`, `result.residuals`, `result.fitted_values`, `result.r_squared` all include the FE-dummy entries / un-demeaned values. `result.period_effects[t].effect` / `.se` / `.p_value` / `.conf_int` and `result.avg_att` / `.avg_se` are invariant to this routing (FWL guarantee). MPD requires a time-invariant ever-treated indicator that lies in the span of the intercept and the post-auto-route unit FE dummies (the exact alias depends on the omitted FE reference category under `pd.get_dummies(drop_first=True)`, not just on "the sum of treated-cohort unit dummies"), so `solve_ols` drops one column from that collinear set under R-style rank-deficiency handling. Which specific column is dropped is pivot-order and dummy-coding dependent (in the shipped parity fixture it is a never-treated unit dummy, not the `treated` main effect itself). The per-period interaction coefficients (`treated:period_X`) and `avg_att` are identified and invariant to that choice; parity tests target those rather than the `treated` main effect. **Survey-design scope (replicate weights):** when `survey_design=` uses replicate weights, the auto-route short-circuits the absorb-refit branch at `estimators.py:1693` and routes through the standard `compute_replicate_vcov` path on the fixed full-dummy design — correct because the design does not depend on replicate weights so no per-replicate refit is needed. **Redundant time-FE skip:** when the routed (or directly-supplied) `fixed_effects` list contains the `time` column, MPD silently skips emitting `<time>_<X>` dummies for that entry because the design already absorbs the time dimension via the non-reference period dummies; without the skip, the two blocks would collide on dummy names and the `coefficients` dict would silently collapse duplicates under `var_names`-keyed construction, breaking the coefficients-vs-vcov alignment that downstream consumers rely on. This applies to both the new `absorb=` auto-route and the pre-existing `fixed_effects=[<time_col>]` invocation.
-- **`DifferenceInDifferences(absorb=..., vcov_type in {"hc2", "hc2_bm"})` now supported** (`diff_diff/estimators.py:382`). Previously raised `NotImplementedError` because the HC2 leverage correction and CR2 Bell-McCaffrey DOF depend on the FULL FE hat matrix, while within-transformation (FWL) preserves coefficients and residuals but not the hat. Lift via internal auto-route: when `absorb=` is paired with `vcov_type in {"hc2","hc2_bm"}`, the fit promotes the absorb columns to `fixed_effects=` internally so the existing full-dummy-design code path computes the algebraically correct vcov. Empirically matches `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=..., type="CR2")` at ~1e-10 (verified via new `tests/test_estimators_vcov_type.py::TestDiDAbsorbedFERParity` against `benchmarks/data/clubsandwich_cr2_golden.json` scenario `absorbed_fe_did`, with the R generator using the singleton-cluster CR2 trick for one-way HC2-BM Satterthwaite DOF). HC1/CR1 paths unchanged. `MultiPeriodDiD(absorb=...)` and `TwoWayFixedEffects` rejections remain as follow-ups (different fit-path structure). **Behavioral note (full `DiDResults` surface change under auto-route):** under the auto-route, the entire returned `DiDResults` reflects the full-dummy fit rather than the within-transformed fit. Specifically, `result.coefficients` and `result.vcov` include the FE-dummy entries (matching the `fixed_effects=` path), `result.residuals` and `result.fitted_values` are on the un-demeaned outcome scale, and `result.r_squared` is computed on the un-demeaned outcome (so it absorbs the FE variance and will typically be higher than the within-R²). `result.att` is invariant to this routing (FWL guarantee). Downstream consumers reading `result.att` are unaffected; consumers reading the broader result surface should expect the full-dummy values. **Survey-design scope:** the auto-route changes the FE handling (and removes the prior absorbed-FE rejection), but `survey_design=` continues to drive its own variance path (Taylor-series linearization or replicate-weight variance, per the existing survey contract) rather than the analytical HC2/HC2-BM sandwich. The auto-route is therefore methodologically meaningful for non-survey fits and for the FE-handling side of survey fits; analytical small-sample inference under `vcov_type in {"hc2","hc2_bm"}` is bypassed when a survey design is supplied.
+- **`MultiPeriodDiD(absorb=..., vcov_type in {"hc2", "hc2_bm"})` now supported** (`diff_diff/estimators.py:1476`). Mirrors the DiD-absorb auto-route shipped earlier in this release: when `absorb=` is paired with `vcov_type in {"hc2","hc2_bm"}`, `MultiPeriodDiD.fit()` promotes the absorb columns to `fixed_effects=` internally so the existing full-dummy-design code path computes the algebraically correct vcov on the event-study design (`treated + period_X dummies + treated:period_X interactions + factor(unit)`). Verified at ~1e-10 vs `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=1:n, type="CR2")` on a 5-cohort × 5-period event-study fixture (new `tests/test_estimators_vcov_type.py::TestMPDAbsorbedFERParity` against `benchmarks/data/clubsandwich_cr2_golden.json` scenario `mpd_absorbed_fe_did`). HC1/CR1 paths on `absorb=` are unchanged (no leverage term). (`TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` was lifted later in this same release via an inline full-dummy build — see the top entry; TWFE has no `fixed_effects=` equivalent inside the estimator, so it gets a separate full-dummy branch rather than the absorb→fixed_effects parameter swap used here.) **Behavioral note (full `MultiPeriodDiDResults` surface change under auto-route):** under the auto-route, the entire returned `MultiPeriodDiDResults` reflects the full-dummy fit rather than the within-transformed fit — `result.coefficients`, `result.vcov`, `result.residuals`, `result.fitted_values`, `result.r_squared` all include the FE-dummy entries / un-demeaned values. `result.period_effects[t].effect` / `.se` / `.p_value` / `.conf_int` and `result.avg_att` / `.avg_se` are invariant to this routing (FWL guarantee). MPD requires a time-invariant ever-treated indicator that lies in the span of the intercept and the post-auto-route unit FE dummies (the exact alias depends on the omitted FE reference category under `pd.get_dummies(drop_first=True)`, not just on "the sum of treated-cohort unit dummies"), so `solve_ols` drops one column from that collinear set under R-style rank-deficiency handling. Which specific column is dropped is pivot-order and dummy-coding dependent (in the shipped parity fixture it is a never-treated unit dummy, not the `treated` main effect itself). The per-period interaction coefficients (`treated:period_X`) and `avg_att` are identified and invariant to that choice; parity tests target those rather than the `treated` main effect. **Survey-design scope (replicate weights):** when `survey_design=` uses replicate weights, the auto-route short-circuits the absorb-refit branch at `estimators.py:1693` and routes through the standard `compute_replicate_vcov` path on the fixed full-dummy design — correct because the design does not depend on replicate weights so no per-replicate refit is needed. **Redundant time-FE skip:** when the routed (or directly-supplied) `fixed_effects` list contains the `time` column, MPD silently skips emitting `<time>_<X>` dummies for that entry because the design already absorbs the time dimension via the non-reference period dummies; without the skip, the two blocks would collide on dummy names and the `coefficients` dict would silently collapse duplicates under `var_names`-keyed construction, breaking the coefficients-vs-vcov alignment that downstream consumers rely on. This applies to both the new `absorb=` auto-route and the pre-existing `fixed_effects=[<time_col>]` invocation.
+- **`DifferenceInDifferences(absorb=..., vcov_type in {"hc2", "hc2_bm"})` now supported** (`diff_diff/estimators.py:382`). Previously raised `NotImplementedError` because the HC2 leverage correction and CR2 Bell-McCaffrey DOF depend on the FULL FE hat matrix, while within-transformation (FWL) preserves coefficients and residuals but not the hat. Lift via internal auto-route: when `absorb=` is paired with `vcov_type in {"hc2","hc2_bm"}`, the fit promotes the absorb columns to `fixed_effects=` internally so the existing full-dummy-design code path computes the algebraically correct vcov. Empirically matches `lm() + sandwich::vcovHC(type="HC2")` and `lm() + clubSandwich::vcovCR(cluster=..., type="CR2")` at ~1e-10 (verified via new `tests/test_estimators_vcov_type.py::TestDiDAbsorbedFERParity` against `benchmarks/data/clubsandwich_cr2_golden.json` scenario `absorbed_fe_did`, with the R generator using the singleton-cluster CR2 trick for one-way HC2-BM Satterthwaite DOF). HC1/CR1 paths unchanged. (`MultiPeriodDiD(absorb=...)` and `TwoWayFixedEffects(vcov_type in {"hc2","hc2_bm"})` were both lifted later in this same release — see the top entries; both use the same algebra on different fit-path structures.) **Behavioral note (full `DiDResults` surface change under auto-route):** under the auto-route, the entire returned `DiDResults` reflects the full-dummy fit rather than the within-transformed fit. Specifically, `result.coefficients` and `result.vcov` include the FE-dummy entries (matching the `fixed_effects=` path), `result.residuals` and `result.fitted_values` are on the un-demeaned outcome scale, and `result.r_squared` is computed on the un-demeaned outcome (so it absorbs the FE variance and will typically be higher than the within-R²). `result.att` is invariant to this routing (FWL guarantee). Downstream consumers reading `result.att` are unaffected; consumers reading the broader result surface should expect the full-dummy values. **Survey-design scope:** the auto-route changes the FE handling (and removes the prior absorbed-FE rejection), but `survey_design=` continues to drive its own variance path (Taylor-series linearization or replicate-weight variance, per the existing survey contract) rather than the analytical HC2/HC2-BM sandwich. The auto-route is therefore methodologically meaningful for non-survey fits and for the FE-handling side of survey fits; analytical small-sample inference under `vcov_type in {"hc2","hc2_bm"}` is bypassed when a survey design is supplied.
 - **`SpilloverDiD` Gardner GMM first-stage uncertainty correction across HC1 / Conley / cluster (Wave D).** Closes the documented Wave B/C "SEs biased downward by a few percent" caveat. **Documented synthesis** of Butts (2021) Section 3.1 (the IF construction for spillover-aware DiD) + Gardner (2022) Section 4 (the two-stage GMM sandwich) + Conley (1999) (the spatial kernel). No reference software combines all three — `did2s` (Butts & Gardner) implements the Gardner correction without rings or Conley; `conleyreg` and `acreg` implement Conley without the two-stage correction. Wave D is the synthesis. Applies unconditionally under `vcov_type ∈ {"hc1", "conley", "cluster"}` for both `event_study=False` AND `event_study=True`. **Formula** (Butts 2021 §3.1 + Gardner 2022 §4): `psi_i = gamma_hat' * X_{10,i} * eps_{10,i} - X_{2,i} * eps_{2,i}` where `gamma_hat = (X_10' X_10)^{-1} (X_1' X_2)` is the stage-1-projection-of-stage-2 cross-moment; meat = `Psi' K Psi` with `K` dispatched by `vcov_type` (identity for HC1, block-indicator for cluster, spatial kernel for Conley); vcov = `(X_2' X_2)^{-1} @ meat @ (X_2' X_2)^{-1}`. **Finite-sample multipliers:** `n/(n-p)` for HC1; `G/(G-1) * (n-1)/(n-p)` for cluster CR1; no multiplier for Conley (preserves `conleyreg` / Wave B convention). **Public surface:** `vcov_type="classical"` now raises `NotImplementedError` upfront (the Wave D synthesis has not been derived for the homoskedastic meat structure `sigma_hat^2 * (X_10' X_10)`); REGISTRY's "vcov_type restrictions" block updated accordingly. **Point estimates unchanged** (`tau_total`, `delta_j`, event-study `tau_k` / `delta_jk` are byte-identical to Wave B/C); SE values shift upward by 1-few percent depending on first-stage residual variance. **Implementation:** new module-level helper `_compute_gmm_corrected_meat` in `diff_diff/two_stage.py` (NOT a modification of the existing `_compute_gmm_variance` method — TwoStageDiD's path is unchanged); new module-level helper `_build_butts_fe_design_csr` in `diff_diff/spillover.py`; new module-level helper `_compute_conley_meat` in `diff_diff/conley.py` factored out of `_compute_conley_vcov` so the same kernel-application code path handles both standard sandwich (`X * residuals`) and Wave D IF outer product (`Psi`) cases. **No new public API kwarg** — the correction is unconditional. Wave D variance mode dispatch derives from the public contract: `vcov_type="conley"` → `"conley"`; `cluster=<col>` → `"cluster"` (CR1); otherwise `"hc1"`. **Wave B/C SE goldens re-pinned** at `tests/test_spillover.py::TestSpilloverDiDEventStudyBackwardCompat` (constants renamed `_WAVE_B_GOLDEN_*` → `_WAVE_D_GOLDEN_*`; pre-Wave-D references retained as commented baselines for the directional inflation invariant `_WAVE_B_UNCORRECTED_*`). **Tests:** new test classes `TestSpilloverDiDWaveDGmmCorrectedHc1Hand` (hand-derived `Psi` on a 4-unit × 3-period over-identified panel — matches at `atol=1e-12`), `TestSpilloverDiDWaveDGmmCorrectedEventStudy` (vcov shape on event-study path), `TestSpilloverDiDWaveDGmmCorrectedNanInferenceContract` (rank-deficient column propagation), `TestSpilloverDiDWaveDGmmCorrectedValidatorWiring` (Conley validator fires from the new helper), `TestSpilloverDiDWaveDGmmCorrectedFitIdempotence` (clone + repeat-fit bit-identity per `feedback_fit_does_not_mutate_config`), `TestSpilloverDiDWaveDPublicVarianceContract` (end-to-end public `cluster=<col>` CR1 routing, single-cluster rejection, classical NotImplementedError). Closes the Gardner-GMM follow-up row in `TODO.md`.
 - **BaconDecomposition R parity goldens.** Closes the PR-B deferral row in `TODO.md`. JSON goldens at `benchmarks/data/r_bacondecomp_golden.json` generated from the committed `benchmarks/R/generate_bacon_golden.R` script (3 fixtures: `uniform_3groups_with_never_treated`, `two_groups_no_never_treated`, `always_treated_remapped`) against `bacondecomp 0.1.1` on R 4.5.2. `tests/test_methodology_bacon.py::TestBaconParityR` now active (4 tests, no skips): TWFE coefficient parity at `atol=1e-6` across all 3 fixtures; weights-sum parity at `atol=1e-6` across all 3 fixtures; per-component estimate + weight parity at `atol=1e-6` on the 2 non-remap fixtures **and on the 6 timing-vs-timing rows of `always_treated_remapped`** (carve-out narrowed to U-bucket rows only); plus a dedicated fold-back test (`test_always_treated_remapped_fold_back_matches_r`) that pins the **documented convention divergence** on `always_treated_remapped` (R keeps `first_treat=1` as a distinct timing cohort and emits `Later vs Always Treated` comparisons; Python's paper-footnote-11 convention remaps those units to `U` and folds them into a single `treated_vs_never` cell per treated cohort) by aggregating R's split rows per cohort and asserting they match Python's single fold at `atol=1e-6`. The aggregate is invariant per Theorem 1; the per-component breakdown differs structurally between conventions but the fold-back is now directly asserted. New `**Note (R parity convention divergence on always-treated)**` and `**Deviation (first-period boundary extension on always-treated remap)**` in `docs/methodology/REGISTRY.md`. **First-period boundary deviation:** the paper uses strict `t_i < 1` for the always-treated bucket; the library uses the inclusive `first_treat <= min(time)` rule and folds `first_treat == min(time)` cohorts into `U`. R does NOT apply this fold (it keeps such cohorts as their own bucket). When `min(time) > 1` the rules coincide. Explicitly labeled in REGISTRY's Deviations block and mirrored in `METHODOLOGY_REVIEW.md` and `bacon.py`. METHODOLOGY_REVIEW.md tracker row promoted `**Complete** (R parity goldens pending)` → `**Complete**`.
 - **`generate_ddd_panel_data` — panel-structured DGP for Triple-Difference power analysis** (`diff_diff/prep_dgp.py`). New public function exported from `diff_diff` and `diff_diff.prep` for panel DDD simulations. Cross-sectional `generate_ddd_data` remains available unchanged. Produces a balanced panel of `n_units × n_periods` with two unit-level binary dimensions (`group`, `partition`) and a derived `post = 1[period >= treatment_period]` indicator; columns: `unit, period, outcome, group, partition, post, treated, true_effect` (+ `x1, x2` when `add_covariates=True`). DDD-CPT identification holds because the `group * partition` interaction enters as a unit-level (time-invariant) term, leaving the triple-interaction `treatment_effect * group * partition * post` as the sole source of differential group × partition trend. Compatible with `TripleDifference(cluster="unit").fit(..., time="post")` (the cluster kwarg is required because `TripleDifference` is the repeated-cross-section `panel=FALSE` estimator and unclustered SE on panel-generated rows understates variance under within-unit serial correlation; the point estimate `att` is invariant to clustering — see the new `TripleDifference` REGISTRY note on panel-shaped input). Users get panel-realistic unit fixed effects and within-unit serial correlation while the binary 2×2×2 estimator surface is unchanged. **Stratified allocation:** the partition split is drawn stratified-by-group at the requested `partition_frac` so every `(group, partition)` cell receives at least one unit; a targeted `ValueError` is raised at fit-time when the rounded cell counts (`n_units`, `group_frac`, `partition_frac`) would leave any cell empty. This guarantees the 2x2x2 DDD surface is populated for any valid input — independent marginal sampling (the cross-sectional `generate_ddd_data` convention) could collapse cells when marginals are small (e.g., `n_units=4, group_frac=partition_frac=0.25`). Validates `1 <= treatment_period < n_periods`, `group_frac` and `partition_frac` strictly in `(0, 1)`, and `n_units >= 4`. Deterministic recovery (`noise_sd=0`) matches `treatment_effect` to ~1e-15 (covered by `tests/test_prep.py::TestGenerateDddPanelData`, 16 tests including infeasible-config rejection and smallest-feasible-config round-trip through `TripleDifference.fit`). `power.simulate_power` is NOT yet auto-routed to the panel DGP for `TripleDifference` (the existing `_ddd_dgp_kwargs` registry entry still ignores `n_periods` and the existing `_check_ddd_dgp_compat` warning still fires on non-default kwargs) — that wiring is tracked as a follow-up in TODO.md.

From 7f5df65fb038ac807fd7d7ef84e1d34af3b91c71 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Tue, 19 May 2026 12:06:26 -0400
Subject: [PATCH 6/8] utils: fix wild_bootstrap NaN propagation on
 rank-deficient designs

CI review (R5) identified a P1 bug in wild_bootstrap_se() that was
newly reachable via the TWFE HC2/HC2-BM full-dummy path:

Before this fix, wild_bootstrap_se built each draw's pseudo-outcome
as `y_star = X @ beta_restricted`. When solve_ols dropped a rank-
deficient nuisance column (e.g. a time-invariant covariate collinear
with the unit FE on the full-dummy design), beta_restricted contained
NaN on the dropped slot, and X @ beta_restricted propagated NaN
through every observation. The ATT was analytically identified but
the bootstrap crashed because y_star was all-NaN.

Pre-PR this was unreachable on TWFE (the within-transform absorbed
time-invariant covariates before they entered X), but the new full-
dummy HC2/HC2-BM branch keeps unit/time dummies explicit alongside
covariates, exposing the bug.

Two fixes in wild_bootstrap_se (diff_diff/utils.py):

1. Use solve_ols(return_fitted=True) to get NaN-safe fitted values
   from the kept columns; build y_star = fitted_restricted +
   residuals_restricted * obs_weights instead of X @ beta_restricted.
   fitted_restricted is computed from the kept columns by solve_ols,
   so dropped nuisance NaN doesn't propagate.

2. Replace bootstrap_t_stats[b] = 0.0 fallback for singular draws
   with np.nan + a finite_mask filter at the p-value step. Setting
   t* = 0 biased the p-value downward (|0| < |t_original| counts as
   non-rejection, but those draws are invalid, not non-rejections).
   The same nan-safe filter applies to bootstrap_coefs for the SE
   and percentile CI.

New regression test
`test_twfe_hc2_wild_bootstrap_survives_rank_deficient_full_dummy`
fits TwoWayFixedEffects(vcov_type='hc2', inference='wild_bootstrap',
covariates=['x_invariant']) on a panel where x_invariant is time-
invariant (collinear with unit FE on the full-dummy design); asserts
finite ATT, SE, p-value, and CI. Pre-fix this test crashed with
all-NaN y_star.

No regression in the existing 53 wild_bootstrap tests across
test_wild_bootstrap, test_methodology_did, test_methodology_twfe,
test_conley_vcov, test_estimators_vcov_type, test_business_report,
test_replicate_weight_expansion, test_survey.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/utils.py                 | 151 ++++++++++++++++++++---------
 tests/test_estimators_vcov_type.py |  53 ++++++++++
 2 files changed, 156 insertions(+), 48 deletions(-)

diff --git a/diff_diff/utils.py b/diff_diff/utils.py
index e095b181..dced15da 100644
--- a/diff_diff/utils.py
+++ b/diff_diff/utils.py
@@ -572,16 +572,30 @@ def wild_bootstrap_se(
 
     # Fit restricted model (but we need to drop the column for the restricted coef)
     # Actually, for WCR bootstrap we keep all columns but impose the null via residuals
-    # Re-estimate with the restricted dependent variable
-    beta_restricted, residuals_restricted, _ = _solve_ols_linalg(X, y_restricted, return_vcov=False)
+    # Re-estimate with the restricted dependent variable.
+    #
+    # Use return_fitted=True so we get NaN-safe fitted values from the kept
+    # columns when solve_ols drops rank-deficient nuisance columns. Without
+    # this, building y_star via `X @ beta_restricted` would propagate NaN
+    # through every observation whenever a nuisance column was dropped
+    # (e.g. always-treated unit dummy collinear with treated*post on the
+    # full-dummy TWFE HC2/HC2-BM path), poisoning the entire bootstrap loop
+    # despite the ATT being analytically identified.
+    beta_restricted, residuals_restricted, fitted_restricted, _ = _solve_ols_linalg(
+        X, y_restricted, return_vcov=False, return_fitted=True
+    )
 
     # Create cluster-to-observation mapping for efficiency
     cluster_map = {c: np.where(cluster_ids == c)[0] for c in unique_clusters}
     cluster_indices = [cluster_map[c] for c in unique_clusters]
 
     # Step 3: Bootstrap loop
-    bootstrap_t_stats = np.zeros(n_bootstrap)
-    bootstrap_coefs = np.zeros(n_bootstrap)
+    # Use NaN for invalid draws (singular bootstrap SE) and filter at the
+    # p-value step, rather than coercing to t*=0 which biases the p-value
+    # toward small values (since |0| < |t_original| counts as "non-rejection"
+    # only when the original t is large).
+    bootstrap_t_stats = np.full(n_bootstrap, np.nan)
+    bootstrap_coefs = np.full(n_bootstrap, np.nan)
 
     for b in range(n_bootstrap):
         # Generate cluster-level weights
@@ -592,8 +606,10 @@ def wild_bootstrap_se(
         for g, indices in enumerate(cluster_indices):
             obs_weights[indices] = cluster_weights[g]
 
-        # Construct bootstrap sample: y* = X @ beta_restricted + e_restricted * weights
-        y_star = np.dot(X, beta_restricted) + residuals_restricted * obs_weights
+        # Construct bootstrap sample: y* = fitted_restricted + e_restricted * weights
+        # (fitted_restricted comes from solve_ols's kept-columns reconstruction,
+        # so it's NaN-safe even when beta_restricted has NaN on dropped columns)
+        y_star = fitted_restricted + residuals_restricted * obs_weights
 
         # Estimate bootstrap coefficients with cluster-robust SE
         beta_star, residuals_star, vcov_star = _solve_ols_linalg(
@@ -603,28 +619,40 @@ def wild_bootstrap_se(
         assert vcov_star is not None
         se_star = np.sqrt(vcov_star[coefficient_index, coefficient_index])
 
-        # Compute bootstrap t-statistic (under null hypothesis)
-        if se_star > 0:
+        # Compute bootstrap t-statistic (under null hypothesis); invalid
+        # draws (singular SE) leave the NaN sentinel for filtering below.
+        if se_star > 0 and np.isfinite(beta_star[coefficient_index]):
             bootstrap_t_stats[b] = (beta_star[coefficient_index] - null_hypothesis) / se_star
-        else:
-            bootstrap_t_stats[b] = 0.0
-
-    # Step 4: Compute bootstrap p-value
-    # P-value is proportion of |t*| >= |t_original|
-    p_value = np.mean(np.abs(bootstrap_t_stats) >= np.abs(t_stat_original))
 
-    # Ensure p-value is at least 1/(n_bootstrap+1) to avoid exact zero
-    p_value = float(max(float(p_value), 1 / (n_bootstrap + 1)))
-
-    # Step 5: Compute bootstrap SE and confidence interval
-    # SE from standard deviation of bootstrap coefficient distribution
-    se_bootstrap = float(np.std(bootstrap_coefs, ddof=1))
+    # Step 4: Compute bootstrap p-value from VALID (finite) draws only
+    finite_mask = np.isfinite(bootstrap_t_stats)
+    n_valid = int(finite_mask.sum())
+    if n_valid == 0:
+        # All bootstrap draws were singular; fall back to a conservative
+        # p-value of 1.0 rather than silently returning a misleading value.
+        p_value = 1.0
+    else:
+        p_value = float(np.mean(np.abs(bootstrap_t_stats[finite_mask]) >= np.abs(t_stat_original)))
+        # Ensure p-value is at least 1/(n_valid+1) to avoid exact zero.
+        p_value = float(max(p_value, 1 / (n_valid + 1)))
+
+    # Step 5: Compute bootstrap SE and confidence interval from valid draws
+    # only (use nan-safe reductions, mirroring the p-value filtering above).
+    valid_coefs = bootstrap_coefs[np.isfinite(bootstrap_coefs)]
+    if valid_coefs.size >= 2:
+        se_bootstrap = float(np.std(valid_coefs, ddof=1))
+    else:
+        se_bootstrap = float("nan")
 
     # Percentile confidence interval from bootstrap distribution
     lower_percentile = alpha / 2 * 100
     upper_percentile = (1 - alpha / 2) * 100
-    ci_lower = float(np.percentile(bootstrap_coefs, lower_percentile))
-    ci_upper = float(np.percentile(bootstrap_coefs, upper_percentile))
+    if valid_coefs.size >= 1:
+        ci_lower = float(np.percentile(valid_coefs, lower_percentile))
+        ci_upper = float(np.percentile(valid_coefs, upper_percentile))
+    else:
+        ci_lower = float("nan")
+        ci_upper = float("nan")
 
     return WildBootstrapResults(
         se=se_bootstrap,
@@ -823,7 +851,11 @@ def check_parallel_trends_robust(
 
     # Compute outcome changes
     treated_changes, control_changes = _compute_outcome_changes(
-        pre_data, outcome, time, treatment_group, unit,
+        pre_data,
+        outcome,
+        time,
+        treatment_group,
+        unit,
         caller_label="check_parallel_trends_robust",
     )
 
@@ -1026,7 +1058,11 @@ def equivalence_test_trends(
 
     # Compute outcome changes
     treated_changes, control_changes = _compute_outcome_changes(
-        pre_data, outcome, time, treatment_group, unit,
+        pre_data,
+        outcome,
+        time,
+        treatment_group,
+        unit,
         caller_label="equivalence_test_trends",
     )
 
@@ -1367,15 +1403,9 @@ def _sc_weight_fw(
     """
     Y_c = np.ascontiguousarray(Y, dtype=np.float64)
     init_c = (
-        np.ascontiguousarray(init_weights, dtype=np.float64)
-        if init_weights is not None
-        else None
-    )
-    rw_c = (
-        np.ascontiguousarray(reg_weights, dtype=np.float64)
-        if reg_weights is not None
-        else None
+        np.ascontiguousarray(init_weights, dtype=np.float64) if init_weights is not None else None
     )
+    rw_c = np.ascontiguousarray(reg_weights, dtype=np.float64) if reg_weights is not None else None
 
     if rw_c is not None:
         # Validate reg_weights shape at the dispatcher so Rust and NumPy
@@ -1396,26 +1426,53 @@ def _sc_weight_fw(
         if reg_weights is not None:
             if return_convergence:
                 weights, converged = _rust_sc_weight_fw_weighted_with_convergence(
-                    Y_c, zeta, intercept, init_c, min_decrease, max_iter, rw_c,
+                    Y_c,
+                    zeta,
+                    intercept,
+                    init_c,
+                    min_decrease,
+                    max_iter,
+                    rw_c,
                 )
                 return np.asarray(weights), converged
             return np.asarray(
                 _rust_sc_weight_fw_weighted(
-                    Y_c, zeta, intercept, init_c, min_decrease, max_iter, rw_c,
+                    Y_c,
+                    zeta,
+                    intercept,
+                    init_c,
+                    min_decrease,
+                    max_iter,
+                    rw_c,
                 )
             )
         if return_convergence:
             weights, converged = _rust_sc_weight_fw_with_convergence(
-                Y_c, zeta, intercept, init_c, min_decrease, max_iter,
+                Y_c,
+                zeta,
+                intercept,
+                init_c,
+                min_decrease,
+                max_iter,
             )
             return np.asarray(weights), converged
         return np.asarray(
             _rust_sc_weight_fw(
-                Y_c, zeta, intercept, init_c, min_decrease, max_iter,
+                Y_c,
+                zeta,
+                intercept,
+                init_c,
+                min_decrease,
+                max_iter,
             )
         )
     return _sc_weight_fw_numpy(
-        Y, zeta, intercept, init_weights, min_decrease, max_iter,
+        Y,
+        zeta,
+        intercept,
+        init_weights,
+        min_decrease,
+        max_iter,
         return_convergence=return_convergence,
         reg_weights=reg_weights,
     )
@@ -1910,8 +1967,7 @@ def compute_sdid_unit_weights_survey(
 
     if rw_control.shape != (n_control,):
         raise ValueError(
-            f"rw_control shape {rw_control.shape} does not match expected "
-            f"({n_control},)"
+            f"rw_control shape {rw_control.shape} does not match expected " f"({n_control},)"
         )
 
     if n_control == 0:
@@ -1924,10 +1980,12 @@ def compute_sdid_unit_weights_survey(
     # Build the column-scaled Y matrix: each control column j is multiplied by
     # rw_control[j], so A·ω in the loss equals Σ_j rw_j·ω_j·Y_j,pre.
     rw = np.ascontiguousarray(rw_control, dtype=np.float64)
-    Y_scaled = np.column_stack([
-        Y_pre_control * rw[np.newaxis, :],
-        Y_pre_treated_mean.reshape(-1, 1),
-    ])
+    Y_scaled = np.column_stack(
+        [
+            Y_pre_control * rw[np.newaxis, :],
+            Y_pre_treated_mean.reshape(-1, 1),
+        ]
+    )
 
     if return_convergence:
         omega, conv1 = _sc_weight_fw(
@@ -2031,8 +2089,7 @@ def compute_time_weights_survey(
 
     if rw_control.shape != (n_control,):
         raise ValueError(
-            f"rw_control shape {rw_control.shape} does not match expected "
-            f"({n_control},)"
+            f"rw_control shape {rw_control.shape} does not match expected " f"({n_control},)"
         )
 
     if Y_post_control.shape[0] == 0:
@@ -2058,9 +2115,7 @@ def compute_time_weights_survey(
     # does not re-center on the row-scaled matrix.
     rw_sum = float(np.sum(rw_control))
     if intercept and rw_sum > 0:
-        col_weighted_means = (
-            (Y_time * rw_control[:, np.newaxis]).sum(axis=0) / rw_sum
-        )
+        col_weighted_means = (Y_time * rw_control[:, np.newaxis]).sum(axis=0) / rw_sum
         Y_time = Y_time - col_weighted_means[np.newaxis, :]
 
     # Row-scale by sqrt(rw): after weighted centering (if any), each
diff --git a/tests/test_estimators_vcov_type.py b/tests/test_estimators_vcov_type.py
index ffec660c..b25a577d 100644
--- a/tests/test_estimators_vcov_type.py
+++ b/tests/test_estimators_vcov_type.py
@@ -14,6 +14,8 @@
 
 from __future__ import annotations
 
+import warnings
+
 import numpy as np
 import pandas as pd
 import pytest
@@ -825,6 +827,57 @@ def test_twfe_hc2_explicit_no_auto_cluster_analytical(self):
         # No auto-cluster on explicit one-way hc2 + analytical.
         assert res.cluster_name is None
 
+    def test_twfe_hc2_wild_bootstrap_survives_rank_deficient_full_dummy(self):
+        """TWFE(vcov_type='hc2', inference='wild_bootstrap') stays finite when
+        the full-dummy design has a rank-deficient nuisance column.
+
+        Regression for a P1 bug in `wild_bootstrap_se()`: it previously built
+        `y_star = X @ beta_restricted`, which propagates NaN through every
+        observation whenever solve_ols dropped a nuisance column (e.g. a
+        time-invariant covariate collinear with the unit FE). The ATT was
+        analytically identified, but the bootstrap crashed because every
+        `y_star` was all-NaN. Reachable on the new TWFE HC2 full-dummy path
+        (the within-transform path absorbed time-invariant covariates so
+        the issue was hidden pre-PR).
+
+        Fix: `wild_bootstrap_se()` now uses solve_ols's kept-columns
+        `fitted_restricted` instead of `X @ beta_restricted`, so dropped
+        nuisance columns no longer poison `y_star`.
+        """
+        data = _make_did_panel(n_units=20).copy()
+        # x_invariant is time-invariant (only varies across units),
+        # so it's collinear with the unit fixed effect on the
+        # full-dummy design and gets dropped by solve_ols.
+        rng = np.random.default_rng(99)
+        unit_to_x = {u: rng.normal() for u in data["unit"].unique()}
+        data["x_invariant"] = data["unit"].map(unit_to_x).astype(float)
+        with warnings.catch_warnings():
+            # The expected rank-deficient column drop emits a UserWarning;
+            # we accept it as part of the documented full-dummy path.
+            warnings.simplefilter("ignore", UserWarning)
+            res = TwoWayFixedEffects(
+                vcov_type="hc2",
+                inference="wild_bootstrap",
+                n_bootstrap=50,
+                seed=1,
+            ).fit(
+                data,
+                outcome="y",
+                treatment="treated",
+                time="time",
+                unit="unit",
+                covariates=["x_invariant"],
+            )
+        # ATT remains identified despite the dropped nuisance column.
+        assert np.isfinite(res.att), "ATT should remain finite despite rank deficiency"
+        assert np.isfinite(res.se), (
+            "Bootstrap SE should be finite — if NaN, wild_bootstrap_se's "
+            "y_star construction is propagating NaN from beta_restricted."
+        )
+        assert res.se > 0
+        assert np.isfinite(res.p_value)
+        assert np.isfinite(res.conf_int[0]) and np.isfinite(res.conf_int[1])
+
     def test_twfe_hc2_wild_bootstrap_keeps_auto_cluster(self):
         """Wild-bootstrap inference on TWFE(vcov_type='hc2') must keep the
         unit auto-cluster (bootstrap resampling uses the cluster structure).

From 25c364f77c681db7320b8e652d6dfa3deae2fe7e Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Tue, 19 May 2026 12:15:06 -0400
Subject: [PATCH 7/8] utils: enforce all-or-nothing NaN inference contract on
 degenerate bootstrap
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI review (R6) caught a new P0 in my R5 wild_bootstrap_se() fix:
the degenerate-bootstrap branches violated the all-or-nothing NaN
contract from feedback_bootstrap_nan_on_invalid_contract:

- n_valid == 0 returned p_value = 1.0 with se = NaN (split inference)
- valid_coefs.size == 1 returned a finite percentile CI from a single
  draw alongside se = NaN
- t_stat_original was always finite (analytical), surfacing alongside
  NaN bootstrap se when bootstrap was degenerate

Fix: when n_valid < 2 OR valid_coefs.size < 2, NaN-out the entire
inference quadruple (se, p_value, ci_lower, ci_upper) AND the
surfaced t_stat_original. The analytical t-stat from step 1 is still
computed for diagnostic use inside the helper but not propagated
to the user-facing result on a degenerate bootstrap — this prevents
the estimator wrapper from emitting an analytical t-stat alongside
NaN bootstrap fields, which would mix inference families on the
same coefficient.

New regression tests in tests/test_wild_bootstrap.py::
TestWildBootstrapDegenerateAllNaN:

- test_degenerate_n_valid_zero_returns_all_nan: monkeypatches
  solve_ols so every bootstrap draw has singular vcov; asserts
  ALL five user-surface fields are NaN.

- test_degenerate_single_valid_draw_returns_all_nan: forces exactly
  one valid draw (n_valid == 1); asserts ALL five fields NaN — no
  percentile CI from a single-point sample.

Both branches were previously not exercised by the analytical-design
tests, which is why the R5 fix passed but the R6 reviewer caught the
contract violation via code inspection.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/utils.py           |  46 ++--
 tests/test_wild_bootstrap.py | 469 +++++++++++++++++++----------------
 2 files changed, 281 insertions(+), 234 deletions(-)

diff --git a/diff_diff/utils.py b/diff_diff/utils.py
index dced15da..41f2f574 100644
--- a/diff_diff/utils.py
+++ b/diff_diff/utils.py
@@ -624,40 +624,48 @@ def wild_bootstrap_se(
         if se_star > 0 and np.isfinite(beta_star[coefficient_index]):
             bootstrap_t_stats[b] = (beta_star[coefficient_index] - null_hypothesis) / se_star
 
-    # Step 4: Compute bootstrap p-value from VALID (finite) draws only
+    # Step 4: Compute bootstrap inference from VALID (finite) draws only.
+    #
+    # All-or-nothing NaN contract (per feedback_bootstrap_nan_on_invalid_contract):
+    # when bootstrap output is degenerate (fewer than 2 finite t-stats or
+    # 2 finite coefs), return NaN across the full inference surface (se,
+    # p_value, both CI endpoints, AND the surfaced t_stat_original). The
+    # original analytical t_stat is still computed in step 1 for diagnostic
+    # use but is NOT propagated to the user-facing result when bootstrap
+    # is degenerate — surfacing it alongside NaN se/p/CI would mix
+    # analytical and bootstrap inference families on the same coefficient.
     finite_mask = np.isfinite(bootstrap_t_stats)
     n_valid = int(finite_mask.sum())
-    if n_valid == 0:
-        # All bootstrap draws were singular; fall back to a conservative
-        # p-value of 1.0 rather than silently returning a misleading value.
-        p_value = 1.0
-    else:
-        p_value = float(np.mean(np.abs(bootstrap_t_stats[finite_mask]) >= np.abs(t_stat_original)))
-        # Ensure p-value is at least 1/(n_valid+1) to avoid exact zero.
-        p_value = float(max(p_value, 1 / (n_valid + 1)))
-
-    # Step 5: Compute bootstrap SE and confidence interval from valid draws
-    # only (use nan-safe reductions, mirroring the p-value filtering above).
     valid_coefs = bootstrap_coefs[np.isfinite(bootstrap_coefs)]
-    if valid_coefs.size >= 2:
-        se_bootstrap = float(np.std(valid_coefs, ddof=1))
-    else:
-        se_bootstrap = float("nan")
 
-    # Percentile confidence interval from bootstrap distribution
     lower_percentile = alpha / 2 * 100
     upper_percentile = (1 - alpha / 2) * 100
-    if valid_coefs.size >= 1:
+
+    if n_valid >= 2 and valid_coefs.size >= 2:
+        p_value = float(np.mean(np.abs(bootstrap_t_stats[finite_mask]) >= np.abs(t_stat_original)))
+        # Ensure p-value is at least 1/(n_valid+1) to avoid exact zero.
+        p_value = float(max(p_value, 1 / (n_valid + 1)))
+        se_bootstrap = float(np.std(valid_coefs, ddof=1))
         ci_lower = float(np.percentile(valid_coefs, lower_percentile))
         ci_upper = float(np.percentile(valid_coefs, upper_percentile))
+        surfaced_t_stat = t_stat_original
     else:
+        # Degenerate bootstrap (insufficient valid draws): NaN-out the
+        # entire inference tuple. Downstream consumers (estimator-level
+        # `_run_wild_bootstrap_inference`) map these fields directly onto
+        # the result object; this guarantees the (se, t_stat, p_value, ci)
+        # quadruple moves together rather than reporting analytical t_stat
+        # with NaN se.
+        p_value = float("nan")
+        se_bootstrap = float("nan")
         ci_lower = float("nan")
         ci_upper = float("nan")
+        surfaced_t_stat = float("nan")
 
     return WildBootstrapResults(
         se=se_bootstrap,
         p_value=p_value,
-        t_stat_original=t_stat_original,
+        t_stat_original=surfaced_t_stat,
         ci_lower=ci_lower,
         ci_upper=ci_upper,
         n_clusters=n_clusters,
diff --git a/tests/test_wild_bootstrap.py b/tests/test_wild_bootstrap.py
index a9ddae35..bcf6ae6b 100644
--- a/tests/test_wild_bootstrap.py
+++ b/tests/test_wild_bootstrap.py
@@ -4,7 +4,6 @@
 Tests the wild_bootstrap_se() function and its integration with DiD estimators.
 """
 
-
 import numpy as np
 import pandas as pd
 import pytest
@@ -49,14 +48,16 @@ def clustered_did_data():
                     y += 3.0  # True ATT = 3.0
                 y += np.random.normal(0, 1)  # Idiosyncratic error
 
-                data.append({
-                    "cluster": cluster,
-                    "unit": cluster * obs_per_cluster + obs,
-                    "period": period,
-                    "treated": int(is_treated),
-                    "post": period,
-                    "outcome": y,
-                })
+                data.append(
+                    {
+                        "cluster": cluster,
+                        "unit": cluster * obs_per_cluster + obs,
+                        "period": period,
+                        "treated": int(is_treated),
+                        "post": period,
+                        "outcome": y,
+                    }
+                )
 
     return pd.DataFrame(data)
 
@@ -84,14 +85,16 @@ def few_cluster_data():
                     y += 4.0  # True ATT = 4.0
                 y += np.random.normal(0, 1)
 
-                data.append({
-                    "cluster": cluster,
-                    "unit": cluster * obs_per_cluster + obs,
-                    "period": period,
-                    "treated": int(is_treated),
-                    "post": period,
-                    "outcome": y,
-                })
+                data.append(
+                    {
+                        "cluster": cluster,
+                        "unit": cluster * obs_per_cluster + obs,
+                        "period": period,
+                        "treated": int(is_treated),
+                        "post": period,
+                        "outcome": y,
+                    }
+                )
 
     return pd.DataFrame(data)
 
@@ -144,10 +147,16 @@ def test_webb_weights_values(self):
         rng = np.random.default_rng(42)
         weights = _generate_webb_weights(10000, rng)
 
-        expected_values = np.array([
-            -np.sqrt(3/2), -np.sqrt(2/2), -np.sqrt(1/2),
-            np.sqrt(1/2), np.sqrt(2/2), np.sqrt(3/2)
-        ])
+        expected_values = np.array(
+            [
+                -np.sqrt(3 / 2),
+                -np.sqrt(2 / 2),
+                -np.sqrt(1 / 2),
+                np.sqrt(1 / 2),
+                np.sqrt(2 / 2),
+                np.sqrt(3 / 2),
+            ]
+        )
 
         # Check all observed values are in expected set
         for w in weights:
@@ -200,10 +209,7 @@ def test_returns_wild_bootstrap_results(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(99)
 
         results = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         assert isinstance(results, WildBootstrapResults)
@@ -214,10 +220,7 @@ def test_se_is_positive(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(99)
 
         results = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         assert results.se > 0
@@ -228,10 +231,7 @@ def test_p_value_in_valid_range(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(99)
 
         results = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         assert 0 <= results.p_value <= 1
@@ -242,10 +242,7 @@ def test_ci_contains_reasonable_values(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(199)
 
         results = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         assert results.ci_lower < results.ci_upper
@@ -256,17 +253,11 @@ def test_reproducibility_with_seed(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(99)
 
         results1 = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         results2 = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         assert results1.se == results2.se
@@ -279,17 +270,11 @@ def test_different_seeds_different_results(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(99)
 
         results1 = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         results2 = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=123
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=123
         )
 
         # Should be different (not exactly equal)
@@ -302,11 +287,14 @@ def test_different_weight_types(self, ols_components, ci_params):
 
         for weight_type in ["rademacher", "webb", "mammen"]:
             results = wild_bootstrap_se(
-                X, y, residuals, cluster_ids,
+                X,
+                y,
+                residuals,
+                cluster_ids,
                 coefficient_index=3,
                 n_bootstrap=n_boot,
                 weight_type=weight_type,
-                seed=42
+                seed=42,
             )
 
             assert results.se > 0
@@ -319,9 +307,7 @@ def test_invalid_weight_type_raises(self, ols_components):
 
         with pytest.raises(ValueError, match="weight_type must be one of"):
             wild_bootstrap_se(
-                X, y, residuals, cluster_ids,
-                coefficient_index=3,
-                weight_type="invalid"
+                X, y, residuals, cluster_ids, coefficient_index=3, weight_type="invalid"
             )
 
     def test_few_clusters_warning(self, few_cluster_data, ci_params):
@@ -341,10 +327,7 @@ def test_few_clusters_warning(self, few_cluster_data, ci_params):
 
         with pytest.warns(UserWarning, match="Only 4 clusters detected"):
             wild_bootstrap_se(
-                X, y, residuals, cluster_ids,
-                coefficient_index=3,
-                n_bootstrap=n_boot,
-                seed=42
+                X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
             )
 
     def test_too_few_clusters_raises(self, ols_components):
@@ -355,10 +338,7 @@ def test_too_few_clusters_raises(self, ols_components):
         single_cluster = np.zeros(len(y))
 
         with pytest.raises(ValueError, match="at least 2 clusters"):
-            wild_bootstrap_se(
-                X, y, residuals, single_cluster,
-                coefficient_index=3
-            )
+            wild_bootstrap_se(X, y, residuals, single_cluster, coefficient_index=3)
 
     def test_n_clusters_reported_correctly(self, ols_components, ci_params):
         """Test n_clusters is reported correctly."""
@@ -366,10 +346,7 @@ def test_n_clusters_reported_correctly(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(99)
 
         results = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         assert results.n_clusters == 10
@@ -380,10 +357,7 @@ def test_n_bootstrap_reported_correctly(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(199)
 
         results = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         assert results.n_bootstrap == n_boot
@@ -401,18 +375,10 @@ def test_did_with_wild_bootstrap(self, clustered_did_data, ci_params):
         """Test DifferenceInDifferences with wild bootstrap."""
         n_boot = ci_params.bootstrap(99)
         did = DifferenceInDifferences(
-            cluster="cluster",
-            inference="wild_bootstrap",
-            n_bootstrap=n_boot,
-            seed=42
+            cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=42
         )
 
-        results = did.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="post"
-        )
+        results = did.fit(clustered_did_data, outcome="outcome", treatment="treated", time="post")
 
         assert results.inference_method == "wild_bootstrap"
         assert results.n_bootstrap == n_boot
@@ -423,32 +389,16 @@ def test_did_wild_bootstrap_reproducibility(self, clustered_did_data, ci_params)
         """Test wild bootstrap results are reproducible with seed."""
         n_boot = ci_params.bootstrap(99)
         did1 = DifferenceInDifferences(
-            cluster="cluster",
-            inference="wild_bootstrap",
-            n_bootstrap=n_boot,
-            seed=42
+            cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=42
         )
 
         did2 = DifferenceInDifferences(
-            cluster="cluster",
-            inference="wild_bootstrap",
-            n_bootstrap=n_boot,
-            seed=42
+            cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=42
         )
 
-        results1 = did1.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="post"
-        )
+        results1 = did1.fit(clustered_did_data, outcome="outcome", treatment="treated", time="post")
 
-        results2 = did2.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="post"
-        )
+        results2 = did2.fit(clustered_did_data, outcome="outcome", treatment="treated", time="post")
 
         assert results1.se == results2.se
         assert results1.p_value == results2.p_value
@@ -458,24 +408,15 @@ def test_did_analytical_vs_bootstrap_att_same(self, clustered_did_data, ci_param
         n_boot = ci_params.bootstrap(99)
         did_analytical = DifferenceInDifferences(cluster="cluster")
         did_bootstrap = DifferenceInDifferences(
-            cluster="cluster",
-            inference="wild_bootstrap",
-            n_bootstrap=n_boot,
-            seed=42
+            cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=42
         )
 
         results_analytical = did_analytical.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="post"
+            clustered_did_data, outcome="outcome", treatment="treated", time="post"
         )
 
         results_bootstrap = did_bootstrap.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="post"
+            clustered_did_data, outcome="outcome", treatment="treated", time="post"
         )
 
         # ATT should be identical
@@ -489,15 +430,10 @@ def test_did_wild_bootstrap_with_webb_weights(self, clustered_did_data, ci_param
             inference="wild_bootstrap",
             n_bootstrap=n_boot,
             bootstrap_weights="webb",
-            seed=42
+            seed=42,
         )
 
-        results = did.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="post"
-        )
+        results = did.fit(clustered_did_data, outcome="outcome", treatment="treated", time="post")
 
         assert results.inference_method == "wild_bootstrap"
         assert results.se > 0
@@ -506,17 +442,10 @@ def test_did_wild_bootstrap_requires_cluster(self, clustered_did_data, ci_params
         """Test that wild bootstrap is only used when cluster is specified."""
         n_boot = ci_params.bootstrap(99)
         did = DifferenceInDifferences(
-            inference="wild_bootstrap",  # No cluster specified
-            n_bootstrap=n_boot,
-            seed=42
+            inference="wild_bootstrap", n_bootstrap=n_boot, seed=42  # No cluster specified
         )
 
-        results = did.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="post"
-        )
+        results = did.fit(clustered_did_data, outcome="outcome", treatment="treated", time="post")
 
         # Should fall back to analytical since no cluster specified
         assert results.inference_method == "analytical"
@@ -525,18 +454,11 @@ def test_twfe_with_wild_bootstrap(self, clustered_did_data, ci_params):
         """Test TwoWayFixedEffects with wild bootstrap."""
         n_boot = ci_params.bootstrap(99)
         twfe = TwoWayFixedEffects(
-            cluster="cluster",
-            inference="wild_bootstrap",
-            n_bootstrap=n_boot,
-            seed=42
+            cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=42
         )
 
         results = twfe.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="period",
-            unit="unit"
+            clustered_did_data, outcome="outcome", treatment="treated", time="period", unit="unit"
         )
 
         assert results.inference_method == "wild_bootstrap"
@@ -547,18 +469,10 @@ def test_summary_shows_bootstrap_info(self, clustered_did_data, ci_params):
         """Test that summary shows bootstrap info."""
         n_boot = ci_params.bootstrap(99)
         did = DifferenceInDifferences(
-            cluster="cluster",
-            inference="wild_bootstrap",
-            n_bootstrap=n_boot,
-            seed=42
+            cluster="cluster", inference="wild_bootstrap", n_bootstrap=n_boot, seed=42
         )
 
-        results = did.fit(
-            clustered_did_data,
-            outcome="outcome",
-            treatment="treated",
-            time="post"
-        )
+        results = did.fit(clustered_did_data, outcome="outcome", treatment="treated", time="post")
 
         summary = results.summary()
 
@@ -569,10 +483,7 @@ def test_summary_shows_bootstrap_info(self, clustered_did_data, ci_params):
     def test_get_params_includes_bootstrap_params(self):
         """Test get_params includes bootstrap parameters."""
         did = DifferenceInDifferences(
-            inference="wild_bootstrap",
-            n_bootstrap=499,
-            bootstrap_weights="webb",
-            seed=123
+            inference="wild_bootstrap", n_bootstrap=499, bootstrap_weights="webb", seed=123
         )
 
         params = did.get_params()
@@ -586,11 +497,7 @@ def test_set_params_for_bootstrap(self):
         """Test set_params works for bootstrap parameters."""
         did = DifferenceInDifferences()
 
-        did.set_params(
-            inference="wild_bootstrap",
-            n_bootstrap=499,
-            bootstrap_weights="mammen"
-        )
+        did.set_params(inference="wild_bootstrap", n_bootstrap=499, bootstrap_weights="mammen")
 
         assert did.inference == "wild_bootstrap"
         assert did.n_bootstrap == 499
@@ -611,10 +518,7 @@ def test_summary_format(self, ols_components, ci_params):
         n_boot = ci_params.bootstrap(99)
 
         results = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         summary = results.summary()
@@ -630,10 +534,7 @@ def test_print_summary(self, ols_components, capsys, ci_params):
         n_boot = ci_params.bootstrap(99)
 
         results = wild_bootstrap_se(
-            X, y, residuals, cluster_ids,
-            coefficient_index=3,
-            n_bootstrap=n_boot,
-            seed=42
+            X, y, residuals, cluster_ids, coefficient_index=3, n_bootstrap=n_boot, seed=42
         )
 
         results.print_summary()
@@ -672,14 +573,16 @@ def test_three_clusters_still_works(self, ci_params):
                         y += 3.0
                     y += np.random.normal(0, 1)
 
-                    data.append({
-                        "cluster": cluster,
-                        "unit": cluster * obs_per_cluster + obs,
-                        "period": period,
-                        "treated": int(is_treated),
-                        "post": period,
-                        "outcome": y,
-                    })
+                    data.append(
+                        {
+                            "cluster": cluster,
+                            "unit": cluster * obs_per_cluster + obs,
+                            "period": period,
+                            "treated": int(is_treated),
+                            "post": period,
+                            "outcome": y,
+                        }
+                    )
 
         df = pd.DataFrame(data)
 
@@ -688,17 +591,12 @@ def test_three_clusters_still_works(self, ci_params):
             inference="wild_bootstrap",
             n_bootstrap=n_boot,
             bootstrap_weights="webb",  # Webb recommended for few clusters
-            seed=42
+            seed=42,
         )
 
         # Should warn about few clusters but still produce valid results
         with pytest.warns(UserWarning, match="Only 3 clusters"):
-            results = did.fit(
-                df,
-                outcome="outcome",
-                treatment="treated",
-                time="post"
-            )
+            results = did.fit(df, outcome="outcome", treatment="treated", time="post")
 
         assert results.se > 0
         assert results.inference_method == "wild_bootstrap"
@@ -726,14 +624,16 @@ def test_two_clusters_minimum(self, ci_params):
                         y += 3.0
                     y += np.random.normal(0, 1)
 
-                    data.append({
-                        "cluster": cluster,
-                        "unit": cluster * obs_per_cluster + obs,
-                        "period": period,
-                        "treated": int(is_treated),
-                        "post": period,
-                        "outcome": y,
-                    })
+                    data.append(
+                        {
+                            "cluster": cluster,
+                            "unit": cluster * obs_per_cluster + obs,
+                            "period": period,
+                            "treated": int(is_treated),
+                            "post": period,
+                            "outcome": y,
+                        }
+                    )
 
         df = pd.DataFrame(data)
 
@@ -742,17 +642,12 @@ def test_two_clusters_minimum(self, ci_params):
             inference="wild_bootstrap",
             n_bootstrap=n_boot,
             bootstrap_weights="webb",
-            seed=42
+            seed=42,
         )
 
         # Should warn about few clusters
         with pytest.warns(UserWarning, match="Only 2 clusters"):
-            results = did.fit(
-                df,
-                outcome="outcome",
-                treatment="treated",
-                time="post"
-            )
+            results = did.fit(df, outcome="outcome", treatment="treated", time="post")
 
         # Results should still be valid (though may have high variance)
         assert results.se > 0
@@ -767,7 +662,7 @@ def test_few_clusters_webb_vs_rademacher(self, few_cluster_data, ci_params):
             inference="wild_bootstrap",
             n_bootstrap=n_boot,
             bootstrap_weights="webb",
-            seed=42
+            seed=42,
         )
 
         did_rademacher = DifferenceInDifferences(
@@ -775,23 +670,17 @@ def test_few_clusters_webb_vs_rademacher(self, few_cluster_data, ci_params):
             inference="wild_bootstrap",
             n_bootstrap=n_boot,
             bootstrap_weights="rademacher",
-            seed=42
+            seed=42,
         )
 
         with pytest.warns(UserWarning):
             results_webb = did_webb.fit(
-                few_cluster_data,
-                outcome="outcome",
-                treatment="treated",
-                time="post"
+                few_cluster_data, outcome="outcome", treatment="treated", time="post"
             )
 
         with pytest.warns(UserWarning):
             results_rademacher = did_rademacher.fit(
-                few_cluster_data,
-                outcome="outcome",
-                treatment="treated",
-                time="post"
+                few_cluster_data, outcome="outcome", treatment="treated", time="post"
             )
 
         # Both should produce valid results
@@ -810,18 +699,168 @@ def test_few_clusters_confidence_intervals_valid(self, few_cluster_data, ci_para
             inference="wild_bootstrap",
             n_bootstrap=n_boot,
             bootstrap_weights="webb",
-            seed=42
+            seed=42,
         )
 
         with pytest.warns(UserWarning):
-            results = did.fit(
-                few_cluster_data,
-                outcome="outcome",
-                treatment="treated",
-                time="post"
-            )
+            results = did.fit(few_cluster_data, outcome="outcome", treatment="treated", time="post")
 
         lower, upper = results.conf_int
         assert lower < upper
         # CI should contain the point estimate
         assert lower < results.att < upper
+
+
+# =============================================================================
+# Degenerate bootstrap: all-or-nothing NaN inference contract
+# =============================================================================
+
+
+class TestWildBootstrapDegenerateAllNaN:
+    """Verify wild_bootstrap_se() returns the full NaN inference tuple when
+    the bootstrap is degenerate (fewer than 2 valid coefficient draws),
+    per feedback_bootstrap_nan_on_invalid_contract.md.
+
+    Mocks the internal solve_ols path so we can force `se_star <= 0` on
+    every draw (n_valid == 0) and exactly-one-valid (n_valid == 1) without
+    relying on a pathological numerical design. These two branches are not
+    exercised by the analytical-design tests above.
+    """
+
+    def _make_ols_components(self, n: int = 40):
+        rng = np.random.default_rng(0)
+        cluster_ids = np.repeat(np.arange(8), 5)
+        X = np.column_stack(
+            [
+                np.ones(n),
+                rng.normal(size=n),
+            ]
+        )
+        y = X @ np.array([1.0, 0.5]) + rng.normal(scale=0.1, size=n)
+        return X, y, cluster_ids
+
+    def test_degenerate_n_valid_zero_returns_all_nan(self, monkeypatch):
+        """When every bootstrap draw is singular, se / t_stat / p_value / CI
+        are all NaN (full inference quadruple moves together).
+        """
+        from diff_diff import utils as utils_mod
+
+        X, y, cluster_ids = self._make_ols_components()
+        orig_solve = utils_mod._solve_ols_linalg
+        call_count = {"n": 0}
+
+        def fake_solve(X_, y_, cluster_ids=None, return_vcov=True, return_fitted=False, **kw):
+            call_count["n"] += 1
+            # Calls 1 (original) and 2 (restricted) succeed normally.
+            if call_count["n"] <= 2:
+                return orig_solve(
+                    X_,
+                    y_,
+                    cluster_ids=cluster_ids,
+                    return_vcov=return_vcov,
+                    return_fitted=return_fitted,
+                    **kw,
+                )
+            # Bootstrap draws: force a singular vcov so se_star == 0.
+            coefs, residuals, _ = orig_solve(
+                X_,
+                y_,
+                cluster_ids=cluster_ids,
+                return_vcov=True,
+                **kw,
+            )
+            singular_vcov = np.zeros((X_.shape[1], X_.shape[1]))
+            if return_fitted:
+                return coefs, residuals, X_ @ coefs, singular_vcov
+            return coefs, residuals, singular_vcov
+
+        monkeypatch.setattr(utils_mod, "_solve_ols_linalg", fake_solve)
+        # Compute residuals on the original design (needed for the helper signature).
+        from diff_diff.linalg import solve_ols as _solve_orig
+
+        coefs0, residuals0, _ = _solve_orig(X, y, cluster_ids=cluster_ids)
+        results = utils_mod.wild_bootstrap_se(
+            X=X,
+            y=y,
+            residuals=residuals0,
+            cluster_ids=cluster_ids,
+            coefficient_index=1,
+            n_bootstrap=20,
+            seed=1,
+        )
+        # All five user-surface fields must be NaN together.
+        assert np.isnan(results.se), f"se should be NaN, got {results.se}"
+        assert np.isnan(results.t_stat_original), (
+            f"t_stat_original should be NaN under degenerate bootstrap "
+            f"(analytical t-stat must not surface alongside NaN se), "
+            f"got {results.t_stat_original}"
+        )
+        assert np.isnan(results.p_value), f"p_value should be NaN, got {results.p_value}"
+        assert np.isnan(results.ci_lower), f"ci_lower should be NaN, got {results.ci_lower}"
+        assert np.isnan(results.ci_upper), f"ci_upper should be NaN, got {results.ci_upper}"
+
+    def test_degenerate_single_valid_draw_returns_all_nan(self, monkeypatch):
+        """When exactly one bootstrap draw is finite (insufficient for
+        ddof=1 std), the full inference tuple is NaN — we don't return a
+        finite percentile CI on a single-point sample with NaN se.
+        """
+        from diff_diff import utils as utils_mod
+
+        X, y, cluster_ids = self._make_ols_components()
+        orig_solve = utils_mod._solve_ols_linalg
+        call_count = {"n": 0}
+
+        def fake_solve(X_, y_, cluster_ids=None, return_vcov=True, return_fitted=False, **kw):
+            call_count["n"] += 1
+            if call_count["n"] <= 2:
+                return orig_solve(
+                    X_,
+                    y_,
+                    cluster_ids=cluster_ids,
+                    return_vcov=return_vcov,
+                    return_fitted=return_fitted,
+                    **kw,
+                )
+            # Bootstrap calls start at index 3. Let the FIRST bootstrap draw
+            # (call_count == 3) succeed; force every subsequent draw to be
+            # singular. n_valid ends at exactly 1.
+            if call_count["n"] == 3:
+                return orig_solve(
+                    X_,
+                    y_,
+                    cluster_ids=cluster_ids,
+                    return_vcov=return_vcov,
+                    return_fitted=return_fitted,
+                    **kw,
+                )
+            coefs, residuals, _ = orig_solve(
+                X_,
+                y_,
+                cluster_ids=cluster_ids,
+                return_vcov=True,
+                **kw,
+            )
+            singular_vcov = np.zeros((X_.shape[1], X_.shape[1]))
+            if return_fitted:
+                return coefs, residuals, X_ @ coefs, singular_vcov
+            return coefs, residuals, singular_vcov
+
+        monkeypatch.setattr(utils_mod, "_solve_ols_linalg", fake_solve)
+        # Compute residuals on the original design (needed for the helper signature).
+        from diff_diff.linalg import solve_ols as _solve_orig
+
+        coefs0, residuals0, _ = _solve_orig(X, y, cluster_ids=cluster_ids)
+        results = utils_mod.wild_bootstrap_se(
+            X=X,
+            y=y,
+            residuals=residuals0,
+            cluster_ids=cluster_ids,
+            coefficient_index=1,
+            n_bootstrap=20,
+            seed=1,
+        )
+        assert np.isnan(results.se)
+        assert np.isnan(results.t_stat_original)
+        assert np.isnan(results.p_value)
+        assert np.isnan(results.ci_lower)
+        assert np.isnan(results.ci_upper)

From a66b9ada5efde2836301123b6985fde6d449e1ce Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Tue, 19 May 2026 12:31:35 -0400
Subject: [PATCH 8/8] twfe: scope coefficients-dict broadening to HC2/HC2-BM
 only
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

CI review (R7) flagged a P2 scope creep in the R1 coefficients-dict
fix: I built `_twfe_var_names` on BOTH the full-dummy and within-
transform branches, which silently broadened
`result.coefficients` on HC1/classical/Conley paths from
`{"ATT": att}` to `{"const": c, "ATT": att, ...covariates}`. That's
a user-visible API change on unchanged TWFE paths that wasn't
documented in CHANGELOG/REGISTRY or regression-tested.

Per the reviewer's recommendation, restoring the historical
`{"ATT": att}` contract on within-transform paths by setting
`_twfe_var_names = None` on the else branch (the fallback at the
DiDResults construction site handles None via the existing
`{"ATT": float(att)}` literal). Only the HC2/HC2-BM full-dummy path
now broadens the dict — which is what the REGISTRY/CHANGELOG
surface-change disclosure documents, and what the alignment-invariant
test and full-surface regression test pin.

Verified end-to-end: hc1/classical → `{'ATT'}`; hc2/hc2_bm →
`{'const', 'ATT', '_fe_unit_*', '_fe_time_*'}`.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/twfe.py | 10 +++++++---
 1 file changed, 7 insertions(+), 3 deletions(-)

diff --git a/diff_diff/twfe.py b/diff_diff/twfe.py
index 75253349..72948cb9 100644
--- a/diff_diff/twfe.py
+++ b/diff_diff/twfe.py
@@ -359,9 +359,13 @@ def fit(  # type: ignore[override]
                     X_list.append(data_demeaned[f"{cov}_demeaned"].values)
             X = np.column_stack([np.ones(len(y))] + X_list)
             df_adjustment = n_units + n_times - 2
-            # Within-transform path: FE dummies are NOT in X (they're absorbed
-            # by demeaning). var_names cover the visible columns only.
-            _twfe_var_names = ["const", "ATT"] + list(covariates or [])
+            # Within-transform path: preserve the historical
+            # `{"ATT": att}` user-facing `result.coefficients` contract.
+            # Broadening this dict here would silently change the
+            # API surface on HC1 / classical / Conley fits — the
+            # full-dummy `_twfe_var_names` exposure is scoped to the
+            # HC2 / HC2-BM paths only (the documented surface change).
+            _twfe_var_names = None
 
         # ATT is the coefficient on treatment_post (index 1) on both branches.
         att_idx = 1