From e0a2db9d82c552c6ab91a65a6c4c8897d447ff11 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 15:04:21 -0400
Subject: [PATCH 01/11] =?UTF-8?q?wooldridge:=20thread=20vcov=5Ftype=20?=
 =?UTF-8?q?=E2=88=88=20{classical,=20hc1,=20hc2,=20hc2=5Fbm}=20on=20OLS=20?=
 =?UTF-8?q?path=20(Phase=201b=203/8)?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

WooldridgeDiD now accepts `vcov_type` for the OLS path, mirroring the
SunAbraham PR #472 / StackedDiD PR #479 pattern:

- `hc1` (default) preserves bit-equal within-transform CR1 behavior
- `hc2_bm` / `hc2` / `classical` auto-route to full-dummy saturated design
  (FWL doesn't preserve the hat matrix; HC2 leverage + BM DOF need the
  full FE projection). Matches `clubSandwich::vcovCR(lm(...), type="CR2")
  + coef_test()$df_Satt` at atol=1e-10 on the 6 R-parity tests in
  tests/test_methodology_wooldridge.py.
- Bell-McCaffrey Satterthwaite DOF threaded into overall ATT inference
  via `_compute_cr2_bm_contrast_dof`; fail-closed (all-NaN) when DOF
  unavailable, per feedback_bm_contrast_dof_fail_closed.
- One-way `hc2`/`classical` auto-drop the unit auto-cluster (one-way
  families don't compose with cluster_ids). Explicit `cluster="X"` +
  one-way raises at the linalg validator.
- `method ∈ {logit, poisson}` + `vcov_type != "hc1"` rejected at
  `__init__` (GLM CR2-BM derivation deferred to follow-up TODO row).
- `SurveyDesign` + `vcov_type != "hc1"` rejected at `fit()` (survey
  TSL overrides analytical sandwich).
- `n_bootstrap > 0` + one-way + `cluster=None` rejected at `fit()`
  (bootstrap is intrinsically clustered).

WooldridgeDiDResults gains `vcov_type`, `cluster_name`, `n_clusters`
fields for downstream introspection. Third PR of the Phase 1b
standalone-estimator threading initiative (5 PRs remaining).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                              |   1 +
 TODO.md                                   |   6 +-
 benchmarks/R/generate_wooldridge_golden.R | 207 +++++++++
 benchmarks/data/wooldridge_golden.json    |  87 ++++
 benchmarks/data/wooldridge_test_panel.csv | 241 ++++++++++
 diff_diff/guides/llms-full.txt            |   7 +
 diff_diff/wooldridge.py                   | 510 ++++++++++++++++++----
 diff_diff/wooldridge_results.py           |  15 +
 docs/methodology/REGISTRY.md              |  11 +
 tests/test_methodology_wooldridge.py      | 161 +++++++
 tests/test_wooldridge.py                  | 253 +++++++++++
 11 files changed, 1403 insertions(+), 96 deletions(-)
 create mode 100644 benchmarks/R/generate_wooldridge_golden.R
 create mode 100644 benchmarks/data/wooldridge_golden.json
 create mode 100644 benchmarks/data/wooldridge_test_panel.csv
 create mode 100644 tests/test_methodology_wooldridge.py

diff --git a/CHANGELOG.md b/CHANGELOG.md
index c741b47e..852f55ad 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
+- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. **Bell-McCaffrey Satterthwaite DOF is threaded into the overall ATT inference for hc2_bm** via `_compute_cr2_bm_contrast_dof` on the post-period-aggregation contrast (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + explicit one-way (`hc2`/`classical`) + `cluster=None` raises at `fit()` (multiplier bootstrap is intrinsically clustered). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
 - **ChaisemartinDHaultfoeuille (DCDH) methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473) and ContinuousDiD precedent (PR #476). REGISTRY `## ChaisemartinDHaultfoeuille` gains a formal `### Deviations from the paper / from R / library extensions` block consolidating 7 documented deviations into a single AI-review-recognized labeled surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)"): (D1) equal-cell weighting (deviation from BOTH AER 2020 Equation 3 AND R `DIDmultiplegtDYN`); (D2) period-based vs cohort-based stable controls; (D3) balanced-baseline panel + interior-gap drops + terminal-missingness retention + cell-period-allocator targeted `ValueError`; (D4) SE normalization `N_l` vs R `G` (~4% smaller analytical SE); (D5) singleton-cohort degeneracy → NaN with `UserWarning`; (D6) `<50%` switcher warning at far horizons (library extension citing Favara-Imbs application, footnote 14 of NBER WP 29873); (D7) Phase 3 `DID^X` covariate first-stage equal-cell weights. R cross-language coverage holds at documented tolerance bands in `tests/test_chaisemartin_dhaultfoeuille_parity.py` (`POINT_RTOL = 1e-4` on pure-direction point estimates, `MIXED_POINT_RTOL = 0.025` on mixed-direction, `PURE_DIRECTION_SE_RTOL = 0.05` on pure-direction SE, `SE_RTOL = 0.10` on multi-horizon SE, `se_rtol=0.15` on the long-panel `L_max=5` joiners-only scenario where cell-count-weighting compounds). No source code changes, no new tests, no new docstrings — consolidation only against the existing 12 methodology tests (`tests/test_methodology_chaisemartin_dhaultfoeuille.py`), 26 R-parity tests (`tests/test_chaisemartin_dhaultfoeuille_parity.py`), 352 unit tests (`tests/test_chaisemartin_dhaultfoeuille.py`), survey suites (`tests/test_survey_dcdh.py`, `tests/test_survey_dcdh_replicate_psu.py`, three cell-period coverage suites), and two primary-source DCDH paper reviews on disk (2020 AER + 2022/2023 NBER WP 29873 via PR #478; the `dechaisemartin-2026-review.md` on disk is HAD's primary source, not DCDH's, and is referenced as adjacent context only). The REGISTRY Deviations block uses semantic section-name anchors (rather than fragile line numbers) for back-references to other parts of the DCDH section — an intentional divergence from the PR #476 ContinuousDiD precedent reflecting PR-A wording-drift CI feedback that flagged line-number cross-references as drift-prone in long sections. `METHODOLOGY_REVIEW.md` DCDH row promoted **In Progress** → **Complete**; L27 In Progress example paragraph re-pointed to WooldridgeDiD; L1289 priority-order queue item #6 (DCDH) removed and items #7-#11 renumbered to #6-#10.
 
 ## [3.4.1] - 2026-05-21
diff --git a/TODO.md b/TODO.md
index a642394e..87dbed98 100644
--- a/TODO.md
+++ b/TODO.md
@@ -99,9 +99,11 @@ Deferred items from PR reviews that were not addressed before merge.
 | PreTrendsPower: CS/SA `anticipation=1` R-parity fixture. The PR-C R-parity goldens cover NIS power + γ_p MDV at `atol=1e-4` on four shifted-grid / regular / irregular / K=1 fixtures, but R `pretrends` has no anticipation parameter so the Python-side `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py` lines 1138-1150 for CS; mirror in SA branch) is not R-parity-locked. Build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 event-study entry that should be filtered before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. Existing PR-B MC-based tests (`TestPretrendsPropositions`) and full-VCV tests (`TestPretrendsCovarianceSource`) already cover the filter mechanically; this would close the loop against R. | `tests/test_methodology_pretrends.py::TestPretrendsParityR`, `benchmarks/R/generate_pretrends_golden.R` | PR-C follow-up | Low |
 
 
-| Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the standalone estimators that expose `cluster=` but not yet `vcov_type=`: `CallawaySantAnna`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added the chain to DiD/MPD/TWFE; Phase 1b PR 1/8 added `SunAbraham`; Phase 1b PR 2/8 added `StackedDiD` (this row tracks the remaining 6). | multiple | Phase 1b | Medium |
+| Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the standalone estimators that expose `cluster=` but not yet `vcov_type=`: `CallawaySantAnna`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `EfficientDiD`. Phase 1a added the chain to DiD/MPD/TWFE; Phase 1b PR 1/8 added `SunAbraham`; Phase 1b PR 2/8 added `StackedDiD`; Phase 1b PR 3/8 added `WooldridgeDiD` OLS path (this row tracks the remaining 5). | multiple | Phase 1b | Medium |
 | Extend `SunAbraham` with `vcov_type="conley"` (Conley spatial-HAC) as a first-class feature: thread `conley_coords` / `conley_cutoff_km` / `conley_metric` / `conley_kernel` / `conley_time` / `conley_unit` / `conley_lag_cutoff` through `_fit_saturated_regression`. Phase 1b PR 1/8 deferred this; SA currently rejects `vcov_type="conley"` at `__init__` with a deferral message. | `diff_diff/sun_abraham.py` | follow-up | Medium |
 | Extend `StackedDiD` with `vcov_type="conley"` (Conley spatial-HAC) — thread the six `conley_*` params through `solve_ols` at `stacked_did.py:419` (and the `_refit_stacked` closure at `:444`). Phase 1b PR 2/8 deferred this; StackedDiD currently rejects `vcov_type="conley"` at `__init__` with a deferral message. Same shape as the SunAbraham conley follow-up. | `diff_diff/stacked_did.py` | follow-up | Medium |
+| Extend `WooldridgeDiD` with `vcov_type="conley"` — thread the six `conley_*` params through `solve_ols` in `_fit_ols`. Phase 1b PR 3/8 deferred this; WooldridgeDiD currently rejects `vcov_type="conley"` at `__init__` with a deferral message. Same shape as the SunAbraham / StackedDiD conley follow-ups. | `diff_diff/wooldridge.py` | follow-up | Medium |
+| Extend `WooldridgeDiD` `method ∈ {"logit","poisson"}` paths with `vcov_type ∈ {classical, hc2, hc2_bm}`. The GLM QMLE sandwich uses pseudo-residuals (`weights=p(1-p)` for logit, `weights=μ_i` for Poisson, aweight semantics); composing HC2 leverage and Bell-McCaffrey Satterthwaite DOF with QMLE on canonical-link pseudo-residuals needs derivation + R parity against `clubSandwich::vcovCR(glm(...), type="CR2")`. Phase 1b PR 3/8 rejects `method != "ols" + vcov_type != "hc1"` at `__init__` with a deferral pointer here. | `diff_diff/wooldridge.py` (`_fit_logit`, `_fit_poisson`) | follow-up | Medium |
 | Harmonize SunAbraham's HC1 within-transform finite-sample correction with `fixest::sunab()`. SA's `solve_ols` applies `n / (n - k_dm)` (within-transform columns only); fixest applies `n / (n - k_total)` (counts absorbed FE). SE values differ by ~1-2% on typical panel sizes (documented in REGISTRY.md "Deviation from R"; pinned at `atol=5e-3` in `tests/test_methodology_sun_abraham.py`). Either thread `df_adjustment` into the vcov scaling or document as an intentional difference. | `diff_diff/sun_abraham.py`, `diff_diff/linalg.py::compute_robust_vcov` | follow-up | Low |
 <!-- Rows 104-105 LIFTED 2026-05-20 via the clubSandwich WLS-CR2 port. The diff-diff
      form matches clubSandwich's specific algebra (W not sqrt(W), W^2 in bias term,
@@ -194,7 +196,7 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk,
 
 #### Tier B — Mid-size methodology (5-10 CI rounds expected, per memory cascade priors)
 
-- Thread `vcov_type` through 8 standalone estimators: `CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD` (none currently expose `self.vcov_type`)
+- Thread `vcov_type` through the 5 remaining standalone estimators: `CallawaySantAnna`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `EfficientDiD` (Phase 1b PR 1/8 added SunAbraham, PR 2/8 added StackedDiD, PR 3/8 added WooldridgeDiD-OLS)
 - SyntheticDiD: rename internal `placebo_effects` → `variance_effects` AND public `placebo_effects` field with deprecation alias retained for one release (`synthetic_did.py`, `results.py`)
 - StaggeredTripleDifference R parity: commit CSV fixtures + add covariate-adjusted scenarios + aggregation-SE assertions (`tests/test_methodology_staggered_triple_diff.py`, `benchmarks/R/benchmark_staggered_triplediff.R`)
 - StaggeredTripleDifference: per-cohort group-effect SE WIF override for exact R `triplediff` match (`staggered_triple_diff.py`)
diff --git a/benchmarks/R/generate_wooldridge_golden.R b/benchmarks/R/generate_wooldridge_golden.R
new file mode 100644
index 00000000..5115a50d
--- /dev/null
+++ b/benchmarks/R/generate_wooldridge_golden.R
@@ -0,0 +1,207 @@
+# Generate R-parity goldens for WooldridgeDiD OLS path vcov_type variants.
+#
+# Phase 1b PR 3/8: pins Python `WooldridgeDiD(method='ols', vcov_type=...)` SE
+# output against `lm()` + clubSandwich / sandwich on the fixed-seed staggered
+# panel from `benchmarks/data/wooldridge_test_panel.csv`.
+#
+# Variants generated:
+#   - hc1 (CR1 Liang-Zeger cluster-robust at unit; matches `type="CR1S"` —
+#     Stata-style G/(G-1) * (n-1)/(n-p) correction)
+#   - hc2_bm (CR2 Bell-McCaffrey at unit; per-coef DOF via coef_test()$df_Satt;
+#     overall ATT BM contrast DOF via Wald_test(test="HTZ")$df_denom)
+#   - classical (lm() summary's heteroskedasticity-only SE)
+#   - hc2 (sandwich::vcovHC type="HC2"; no clustering)
+#
+# clubSandwich >= 0.7.0 required (matches PR #475 / PR #479 pin).
+
+suppressPackageStartupMessages({
+  library(clubSandwich)
+  library(sandwich)
+  library(jsonlite)
+})
+
+stopifnot(packageVersion("clubSandwich") >= "0.7.0")
+stopifnot(packageVersion("sandwich") >= "3.0.0")
+
+panel_path <- file.path("benchmarks", "data", "wooldridge_test_panel.csv")
+out_path <- file.path("benchmarks", "data", "wooldridge_golden.json")
+
+df <- read.csv(panel_path)
+stopifnot(all(c("unit", "time", "cohort", "y") %in% names(df)))
+# Force integer types on unit/time/cohort so the cluster formula resolves
+# cleanly (clubSandwich's `cluster = df$unit` calls `unique(model$unit)` which
+# fails on factor-coerced columns from intermediate model frames).
+df$unit <- as.integer(df$unit)
+df$time <- as.integer(df$time)
+df$cohort <- as.integer(df$cohort)
+
+# Build treated (g, t) interaction dummies, matching the Python OLS path's
+# `_build_interaction_matrix` (control_group="not_yet_treated", anticipation=0):
+# one indicator per treated (g, t) cell with g > 0 and t >= g.
+treated_cohorts <- sort(unique(df$cohort[df$cohort > 0]))
+times <- sort(unique(df$time))
+gt_pairs <- list()
+for (g in treated_cohorts) {
+  for (t in times) {
+    if (t >= g) {
+      gt_pairs[[length(gt_pairs) + 1L]] <- c(g, t)
+    }
+  }
+}
+gt_names <- vapply(gt_pairs, function(p) sprintf("D_%d_%d", p[1], p[2]), character(1))
+for (i in seq_along(gt_pairs)) {
+  g <- gt_pairs[[i]][1]
+  t <- gt_pairs[[i]][2]
+  df[[gt_names[i]]] <- as.integer((df$cohort == g) & (df$time == t))
+}
+n_int <- length(gt_names)
+
+# Fit lm(y ~ <interactions> + as.factor(unit) + as.factor(time)). The
+# `as.factor(...)` form drops the first level of each FE block, matching the
+# Python full-dummy build (`drop_first=True` on `pd.get_dummies(unit)` and
+# `pd.get_dummies(time)`), and adds a single intercept — matching Python's
+# `[intercept, X_design, unit_dummies, time_dummies]`.
+formula_str <- paste0(
+  "y ~ ", paste(gt_names, collapse = " + "),
+  " + as.factor(unit) + as.factor(time)"
+)
+fit <- lm(as.formula(formula_str), data = df)
+
+# Extract the (interaction) coefficient indices in fit$coefficients. R places
+# them right after the intercept (positions 2..(1+n_int) in 1-indexed R).
+coef_names <- names(coef(fit))
+int_idx <- match(gt_names, coef_names)
+stopifnot(!any(is.na(int_idx)))
+
+# Cell weights n_{g,t} for the overall ATT contrast (matches Python's
+# `_compute_weighted_agg` with default `weights=n_{g,t}`).
+n_gt <- vapply(seq_along(gt_pairs), function(i) {
+  g <- gt_pairs[[i]][1]
+  t <- gt_pairs[[i]][2]
+  sum(df$cohort == g & df$time == t)
+}, integer(1))
+n_post_total <- sum(n_gt)
+contrast_weights <- n_gt / n_post_total  # length n_int
+
+# Build the overall ATT contrast in full-coef space (intercept = 0, then n_int
+# weights, then 0 for FE dummies).
+n_total_coef <- length(coef_names)
+overall_contrast <- numeric(n_total_coef)
+overall_contrast[int_idx] <- contrast_weights
+
+# 1. hc1 + CR1S (Stata-style cluster-robust; matches diff-diff's hc1+cluster)
+vcov_cr1s <- vcovCR(fit, cluster = df$unit, type = "CR1S")
+se_hc1 <- sqrt(diag(vcov_cr1s)[int_idx])
+overall_se_hc1 <- sqrt(
+  t(overall_contrast) %*% vcov_cr1s %*% overall_contrast
+)[1, 1]
+
+# 2. hc2_bm + CR2 + BM Satterthwaite DOF
+vcov_cr2 <- vcovCR(fit, cluster = df$unit, type = "CR2")
+se_hc2_bm <- sqrt(diag(vcov_cr2)[int_idx])
+coef_test_out <- coef_test(fit, vcov = vcov_cr2, test = "Satterthwaite")
+df_satt_hc2_bm <- coef_test_out$df[int_idx]
+
+# Overall ATT BM contrast DOF via Wald_test (HTZ reduces to Satterthwaite on
+# 1-row constraint matrices; df_denom is the BM contrast DOF).
+constraint_matrix <- matrix(overall_contrast, nrow = 1)
+overall_dof_hc2_bm <- tryCatch(
+  {
+    wt <- Wald_test(
+      fit,
+      constraints = constrain_equal(int_idx, reg_ex = FALSE),
+      vcov = vcov_cr2,
+      test = "HTZ"
+    )
+    # HTZ test on multi-row constraints reports a single F + df_num/df_denom
+    # row; df_denom is the Bell-McCaffrey-style aggregated DOF.
+    wt$df_denom
+  },
+  error = function(e) NA_real_
+)
+
+# For the OVERALL ATT scalar contrast (1-row weights vector), build directly:
+# Wald_test with `constraints` requiring a list of `constrain_*` calls
+# (clubSandwich >= 0.5.0); for an arbitrary linear contrast pass the matrix
+# directly via `constraints = matrix(...)`. The `df_denom` is the BM
+# Satterthwaite DOF for the scalar contrast.
+overall_wt <- tryCatch(
+  Wald_test(
+    fit,
+    constraints = constraint_matrix,
+    vcov = vcov_cr2,
+    test = "HTZ"
+  ),
+  error = function(e) NULL
+)
+overall_att_contrast_dof <- if (!is.null(overall_wt)) overall_wt$df_denom else NA_real_
+
+overall_se_hc2_bm <- sqrt(
+  t(overall_contrast) %*% vcov_cr2 %*% overall_contrast
+)[1, 1]
+
+# 3. classical (lm summary SE; OLS sigma^2 * (X'X)^-1)
+vcov_classical <- vcov(fit)
+se_classical <- sqrt(diag(vcov_classical)[int_idx])
+overall_se_classical <- sqrt(
+  t(overall_contrast) %*% vcov_classical %*% overall_contrast
+)[1, 1]
+
+# 4. hc2 (sandwich::vcovHC type="HC2"; no clustering)
+vcov_hc2 <- vcovHC(fit, type = "HC2")
+se_hc2 <- sqrt(diag(vcov_hc2)[int_idx])
+overall_se_hc2 <- sqrt(
+  t(overall_contrast) %*% vcov_hc2 %*% overall_contrast
+)[1, 1]
+
+# Coefficient point estimates (for cross-check; identical across all 4 variants
+# since they share the lm fit).
+beta_int <- coef(fit)[int_idx]
+
+golden <- list(
+  meta = list(
+    panel_csv = panel_path,
+    n_obs = nrow(df),
+    n_units = length(unique(df$unit)),
+    n_periods = length(times),
+    cohorts = sort(unique(df$cohort)),
+    gt_pairs = lapply(gt_pairs, function(p) list(g = p[1], t = p[2])),
+    n_int = n_int,
+    n_post_total = n_post_total,
+    contrast_weights = contrast_weights,
+    clubsandwich_version = as.character(packageVersion("clubSandwich")),
+    sandwich_version = as.character(packageVersion("sandwich"))
+  ),
+  point_estimates = list(
+    interaction_coefs = unname(beta_int),
+    gt_keys = lapply(gt_pairs, function(p) list(g = p[1], t = p[2]))
+  ),
+  hc1 = list(
+    per_coef_se = unname(se_hc1),
+    overall_att_se = overall_se_hc1
+  ),
+  hc2_bm = list(
+    per_coef_se = unname(se_hc2_bm),
+    per_coef_df_satt = unname(df_satt_hc2_bm),
+    overall_att_se = overall_se_hc2_bm,
+    overall_att_contrast_dof = overall_att_contrast_dof
+  ),
+  classical = list(
+    per_coef_se = unname(se_classical),
+    overall_att_se = overall_se_classical
+  ),
+  hc2 = list(
+    per_coef_se = unname(se_hc2),
+    overall_att_se = overall_se_hc2
+  )
+)
+
+write_json(golden, out_path, auto_unbox = TRUE, pretty = TRUE, digits = 18)
+cat(sprintf("Wrote %s\n", out_path))
+cat(sprintf("  n_obs=%d, n_int=%d, n_units=%d\n",
+            nrow(df), n_int, length(unique(df$unit))))
+cat(sprintf("  hc1 overall_se=%.10f\n", overall_se_hc1))
+cat(sprintf("  hc2_bm overall_se=%.10f, overall_dof=%.4f\n",
+            overall_se_hc2_bm, overall_att_contrast_dof))
+cat(sprintf("  classical overall_se=%.10f\n", overall_se_classical))
+cat(sprintf("  hc2 overall_se=%.10f\n", overall_se_hc2))
diff --git a/benchmarks/data/wooldridge_golden.json b/benchmarks/data/wooldridge_golden.json
new file mode 100644
index 00000000..46f5b328
--- /dev/null
+++ b/benchmarks/data/wooldridge_golden.json
@@ -0,0 +1,87 @@
+{
+  "meta": {
+    "panel_csv": "benchmarks/data/wooldridge_test_panel.csv",
+    "n_obs": 240,
+    "n_units": 40,
+    "n_periods": 6,
+    "cohorts": [0, 3, 5],
+    "gt_pairs": [
+      {
+        "g": 3,
+        "t": 3
+      },
+      {
+        "g": 3,
+        "t": 4
+      },
+      {
+        "g": 3,
+        "t": 5
+      },
+      {
+        "g": 3,
+        "t": 6
+      },
+      {
+        "g": 5,
+        "t": 5
+      },
+      {
+        "g": 5,
+        "t": 6
+      }
+    ],
+    "n_int": 6,
+    "n_post_total": 62,
+    "contrast_weights": [0.17741935483870969, 0.17741935483870969, 0.17741935483870969, 0.17741935483870969, 0.14516129032258066, 0.14516129032258066],
+    "clubsandwich_version": "0.7.0",
+    "sandwich_version": "3.1.1"
+  },
+  "point_estimates": {
+    "interaction_coefs": [0.76614176414719071, 0.88267316960019337, 1.1986682582624895, 1.4873820142516874, 0.39177980252905253, 0.63325934470689971],
+    "gt_keys": [
+      {
+        "g": 3,
+        "t": 3
+      },
+      {
+        "g": 3,
+        "t": 4
+      },
+      {
+        "g": 3,
+        "t": 5
+      },
+      {
+        "g": 3,
+        "t": 6
+      },
+      {
+        "g": 5,
+        "t": 5
+      },
+      {
+        "g": 5,
+        "t": 6
+      }
+    ]
+  },
+  "hc1": {
+    "per_coef_se": [0.06096060031056097, 0.059046218350942571, 0.069811491900727871, 0.064215951133206925, 0.064370685282068074, 0.081348726856405193],
+    "overall_att_se": 0.035044276020269757
+  },
+  "hc2_bm": {
+    "per_coef_se": [0.055501994847834073, 0.053348051350068385, 0.06354112948601906, 0.058782844441069397, 0.058847181337849344, 0.075290767121827779],
+    "per_coef_df_satt": [18.095161160028869, 18.095161160028777, 20.439187947573622, 20.439187947573593, 15.498545101842772, 15.49854510184279],
+    "overall_att_se": 0.031917611670264516,
+    "overall_att_contrast_dof": 28.533525200424727
+  },
+  "classical": {
+    "per_coef_se": [0.067005350609178102, 0.067005350609178199, 0.070375557382814979, 0.070375557382815007, 0.069334111291963985, 0.069334111291964068],
+    "overall_att_se": 0.039913974880197489
+  },
+  "hc2": {
+    "per_coef_se": [0.059361785012656779, 0.065622553464974559, 0.066582402900941334, 0.063499229589624784, 0.065073838228263084, 0.077454318870318936],
+    "overall_att_se": 0.038128767955083756
+  }
+}
diff --git a/benchmarks/data/wooldridge_test_panel.csv b/benchmarks/data/wooldridge_test_panel.csv
new file mode 100644
index 00000000..6ad871e0
--- /dev/null
+++ b/benchmarks/data/wooldridge_test_panel.csv
@@ -0,0 +1,241 @@
+unit,time,cohort,y
+0,1,0,0.8067243691809026
+0,2,0,0.802839369994431
+0,3,0,0.855235020967609
+0,4,0,1.0890019201669177
+0,5,0,1.078817761644982
+0,6,0,1.2402777330636348
+1,1,0,0.6716431807370137
+1,2,0,1.021413208108961
+1,3,0,0.9866159667451085
+1,4,0,1.4226053246303583
+1,5,0,1.0724850604695806
+1,6,0,1.1775976531321386
+2,1,3,0.8762255460096424
+2,2,3,1.169594443961727
+2,3,3,1.8733718050695736
+2,4,3,2.0739788511684867
+2,5,3,2.4888887853711053
+2,6,3,2.9374396979097903
+3,1,0,0.8434883362205124
+3,2,0,1.0765104108681542
+3,3,0,1.2951120739444533
+3,4,0,1.5575669913445345
+3,5,0,1.180262304259975
+3,6,0,1.4024583279441452
+4,1,5,0.9579968617291661
+4,2,5,1.0304790565980828
+4,3,5,1.1903900318457077
+4,4,5,1.4488597181997718
+4,5,5,1.9200982434610052
+4,6,5,1.7460518959593672
+5,1,5,1.0695071475108837
+5,2,5,1.0021690254170517
+5,3,5,0.9041565230984308
+5,4,5,1.3386064249680396
+5,5,5,1.7293879006098134
+5,6,5,2.3234654978556293
+6,1,3,1.1066970682266215
+6,2,3,1.0489085760994656
+6,3,3,2.004767735123629
+6,4,3,2.07911089361797
+6,5,3,2.8056578457035983
+6,6,3,3.127994745520973
+7,1,0,1.1912483521317636
+7,2,0,1.0771470284272093
+7,3,0,1.3884256287782524
+7,4,0,1.411551595944057
+7,5,0,1.4827697155522293
+7,6,0,1.8683085554556331
+8,1,0,1.0522526296904187
+8,2,0,1.2890049003130537
+8,3,0,1.4883629743237297
+8,4,0,1.389175181126923
+8,5,0,1.6044627803802873
+8,6,0,1.7374789381590487
+9,1,3,0.8548346845088668
+9,2,3,1.4259230225853687
+9,3,3,2.4157061712224026
+9,4,3,2.1076454935430453
+9,5,3,2.6693910170692225
+9,6,3,3.3890804359295283
+10,1,0,1.4187698282011565
+10,2,0,1.7233321908301873
+10,3,0,1.4686939095427263
+10,4,0,1.4215549990347567
+10,5,0,1.727398303011593
+10,6,0,1.9899618590390489
+11,1,3,1.125964093311145
+11,2,3,1.5558134804979564
+11,3,3,2.2972372074185725
+11,4,3,2.5763136270330245
+11,5,3,3.11789470881427
+11,6,3,3.29063314291348
+12,1,0,1.3710663729650512
+12,2,0,1.4587610647880944
+12,3,0,1.8198777443107352
+12,4,0,1.9045810611288103
+12,5,0,1.9596053065566634
+12,6,0,1.8659179197488536
+13,1,5,1.5835017429392555
+13,2,5,1.3782713390152497
+13,3,5,1.6330905624357686
+13,4,5,1.5921817415731745
+13,5,5,1.9828534348331546
+13,6,5,2.7889461393997337
+14,1,3,1.6812989088898902
+14,2,3,1.7307701628605074
+14,3,3,2.481101320579936
+14,4,3,2.736633099987141
+14,5,3,3.317693536108193
+14,6,3,3.9163375263402234
+15,1,0,1.767420496234671
+15,2,0,1.5521391694460966
+15,3,0,1.813937250948815
+15,4,0,1.4368040129154498
+15,5,0,1.8050260045666267
+15,6,0,2.055575672922338
+16,1,0,1.5932110332440816
+16,2,0,1.6176050768040053
+16,3,0,1.8788799404263752
+16,4,0,1.6640116983608908
+16,5,0,2.2993444467569564
+16,6,0,2.3468669005204563
+17,1,5,2.0757279871707
+17,2,5,1.690766715057971
+17,3,5,1.7046165176486634
+17,4,5,1.938550975346223
+17,5,5,2.499803918304378
+17,6,5,2.9181262344333545
+18,1,0,1.5601554115997491
+18,2,0,1.9530473676349007
+18,3,0,1.9286861678699514
+18,4,0,1.935012513065232
+18,5,0,1.9931475634198847
+18,6,0,2.269181165562252
+19,1,5,1.727567366717711
+19,2,5,1.9342552348894628
+19,3,5,2.186948450665394
+19,4,5,2.2443331740348778
+19,5,5,2.4891906308843086
+19,6,5,2.8496523415698793
+20,1,3,1.907488989202299
+20,2,3,1.9206221420442746
+20,3,3,2.904064365632254
+20,4,3,3.0995370491087666
+20,5,3,3.334052467296891
+20,6,3,3.8601621877716026
+21,1,0,1.9347807843987976
+21,2,0,1.7787291203107265
+21,3,0,2.0667761363599193
+21,4,0,2.46256113975946
+21,5,0,2.118620556868284
+21,6,0,2.4742879628751475
+22,1,3,1.9286798089466262
+22,2,3,1.702079115968752
+22,3,3,2.793999960913965
+22,4,3,2.9834513668545344
+22,5,3,3.232801447518672
+22,6,3,3.7357374931376564
+23,1,3,2.0395996264338176
+23,2,3,2.0474892916852867
+23,3,3,2.7642877447704852
+23,4,3,3.1247299872643763
+23,5,3,3.3060355175599416
+23,6,3,3.6708617550207507
+24,1,0,1.8906087745711637
+24,2,0,2.0252513935022782
+24,3,0,2.132037272865601
+24,4,0,2.318954844348794
+24,5,0,2.5894209448464096
+24,6,0,2.2901581876400448
+25,1,5,2.023051398770029
+25,2,5,2.215938315162985
+25,3,5,2.1883290075397666
+25,4,5,2.6059569099809647
+25,5,5,2.8244382785518956
+25,6,5,3.219242130062021
+26,1,5,2.211494592295367
+26,2,5,2.3263442939779257
+26,3,5,2.1344348369291852
+26,4,5,2.4011024220341435
+26,5,5,2.919155167724921
+26,6,5,3.3045539106625106
+27,1,0,1.9619555873412118
+27,2,0,2.2485454490615377
+27,3,0,2.4071903976056954
+27,4,0,2.3884115008926057
+27,5,0,2.3839568418620836
+27,6,0,2.807899913547411
+28,1,0,2.3287413469319267
+28,2,0,2.369558229320355
+28,3,0,2.3380981654088164
+28,4,0,2.479469742571976
+28,5,0,2.2592241470811785
+28,6,0,2.714174149619618
+29,1,0,2.4019432785921024
+29,2,0,2.338149504489332
+29,3,0,2.404166593645222
+29,4,0,2.4743477350079104
+29,5,0,2.8413292176143705
+29,6,0,2.778554170823987
+30,1,5,2.211895521223946
+30,2,5,2.5686060485179656
+30,3,5,2.5466027852322597
+30,4,5,2.6293281010398752
+30,5,5,3.2570240178847425
+30,6,5,3.561286100042884
+31,1,3,2.3147609109081664
+31,2,3,2.466060778917495
+31,3,3,3.352230805491938
+31,4,3,3.69716742136579
+31,5,3,3.9293239547696697
+31,6,3,4.342676542995517
+32,1,3,2.2960350983326996
+32,2,3,2.720151113055017
+32,3,3,3.342145810145472
+32,4,3,3.6069837250895147
+32,5,3,4.133231888118893
+32,6,3,4.366420640914075
+33,1,0,2.353034304419169
+33,2,0,2.705753848171819
+33,3,0,2.7946668359019466
+33,4,0,2.217724954850553
+33,5,0,2.7398923125097583
+33,6,0,2.95830932316208
+34,1,0,2.5892597901252503
+34,2,0,2.355144191789028
+34,3,0,2.863723617637387
+34,4,0,2.6783607124018385
+34,5,0,2.7019944218250656
+34,6,0,2.9009030301925174
+35,1,3,2.4945991047157885
+35,2,3,2.697995804255369
+35,3,3,3.473319935657297
+35,4,3,3.8141925844127353
+35,5,3,4.093173148750251
+35,6,3,4.531185240678102
+36,1,0,2.6156492042833954
+36,2,0,2.5970390134222696
+36,3,0,2.750153304449297
+36,4,0,2.713596083469214
+36,5,0,2.886633657359758
+36,6,0,3.208109627567911
+37,1,0,2.5100541057618346
+37,2,0,2.7887901703517057
+37,3,0,2.631222249129081
+37,4,0,3.0702947848052773
+37,5,0,3.346626122693915
+37,6,0,2.962480910764188
+38,1,5,2.8076502192828516
+38,2,5,2.785136937928184
+38,3,5,2.767473482687917
+38,4,5,3.07942723100164
+38,5,5,3.563915737168458
+38,6,5,3.927932048631433
+39,1,0,2.6517207063647357
+39,2,0,2.6001993802822025
+39,3,0,3.2102067663634184
+39,4,0,2.878204680000166
+39,5,0,3.2359236915320246
+39,6,0,3.1030460248345406
diff --git a/diff_diff/guides/llms-full.txt b/diff_diff/guides/llms-full.txt
index cd850ca2..86dbfac9 100644
--- a/diff_diff/guides/llms-full.txt
+++ b/diff_diff/guides/llms-full.txt
@@ -967,6 +967,13 @@ WooldridgeDiD(
     bootstrap_weights: str = "rademacher",
     seed: int | None = None,
     rank_deficient_action: str = "warn",
+    vcov_type: str = "hc1",                 # {"classical","hc1","hc2","hc2_bm"}; OLS path only.
+                                            # hc1 (default) preserves prior bit-equal within-transform CR1.
+                                            # hc2_bm auto-routes to full-dummy + clubSandwich WLS-CR2 algebra.
+                                            # classical/hc2 auto-drop the unit auto-cluster (one-way only);
+                                            # explicit cluster="X" + one-way raises at the linalg validator.
+                                            # conley deferred. method != "ols" requires hc1 (QMLE CR2-BM TBD).
+                                            # survey_design= requires hc1 (survey TSL overrides analytical).
 )
 ```
 
diff --git a/diff_diff/wooldridge.py b/diff_diff/wooldridge.py
index fb63a29a..31aa1dca 100644
--- a/diff_diff/wooldridge.py
+++ b/diff_diff/wooldridge.py
@@ -82,8 +82,8 @@ def _resolve_survey_for_wooldridge(survey_design, sample, cluster_ids, cluster_n
         compute_survey_metadata,
     )
 
-    resolved, survey_weights, survey_weight_type, survey_metadata = (
-        _resolve_survey_for_fit(survey_design, sample)
+    resolved, survey_weights, survey_weight_type, survey_metadata = _resolve_survey_for_fit(
+        survey_design, sample
     )
     if resolved is not None and resolved.uses_replicate_variance:
         raise NotImplementedError(
@@ -97,9 +97,7 @@ def _resolve_survey_for_wooldridge(survey_design, sample, cluster_ids, cluster_n
             f"assumes probability weights (pweight)."
         )
     if resolved is not None:
-        effective_cluster = _resolve_effective_cluster(
-            resolved, cluster_ids, cluster_name
-        )
+        effective_cluster = _resolve_effective_cluster(resolved, cluster_ids, cluster_name)
         if effective_cluster is not None:
             resolved = _inject_cluster_as_psu(resolved, effective_cluster)
             if resolved.psu is not None and survey_metadata is not None:
@@ -297,6 +295,29 @@ class WooldridgeDiD:
         Random seed for reproducibility.
     rank_deficient_action : {"warn", "error", "silent"}
         How to handle rank-deficient design matrices.
+    vcov_type : {"classical", "hc1", "hc2", "hc2_bm"}, default "hc1"
+        Variance-covariance family for the analytical sandwich, OLS path only.
+        ``hc1`` (default) preserves the prior bit-equal CR1 Liang-Zeger
+        cluster-robust behavior via the within-transform path. ``hc2_bm``
+        auto-routes to a full-dummy saturated design (intercept + treatment
+        cells + unit dummies + time dummies) — FWL preserves cohort coefficients
+        but NOT the hat matrix, so HC2 leverage and Bell-McCaffrey Satterthwaite
+        DOF must be computed on the full FE projection (matches
+        ``clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt``).
+        ``classical`` / ``hc2`` are supported via the same full-dummy route AND
+        an auto-drop of the unit auto-cluster (one-way families don't compose
+        with cluster_ids per the linalg validator). Explicit ``cluster="X"`` +
+        one-way ``vcov_type`` raises at the validator.
+
+        ``conley`` is REJECTED at ``__init__`` (would require threading
+        ``conley_*`` params through ``solve_ols``; tracked in TODO.md).
+        ``method`` in ``{"logit","poisson"}`` + ``vcov_type != "hc1"`` is
+        REJECTED at ``__init__``: the GLM QMLE sandwich path uses pseudo-
+        residuals, and CR2-BM composition with QMLE on canonical-link pseudo-
+        residuals needs derivation + R parity (tracked in TODO.md). Survey
+        designs combined with ``vcov_type != "hc1"`` raise
+        ``NotImplementedError`` at ``fit()`` because the survey TSL / replicate-
+        refit variance overrides the analytical sandwich.
     """
 
     def __init__(
@@ -311,20 +332,15 @@ def __init__(
         bootstrap_weights: str = "rademacher",
         seed: Optional[int] = None,
         rank_deficient_action: str = "warn",
+        vcov_type: str = "hc1",
     ) -> None:
-        if method not in _VALID_METHODS:
-            raise ValueError(f"method must be one of {_VALID_METHODS}, got {method!r}")
-        if control_group not in _VALID_CONTROL_GROUPS:
-            raise ValueError(
-                f"control_group must be one of {_VALID_CONTROL_GROUPS}, got {control_group!r}"
-            )
-        if anticipation < 0:
-            raise ValueError(f"anticipation must be >= 0, got {anticipation}")
-        if bootstrap_weights not in _VALID_BOOTSTRAP_WEIGHTS:
-            raise ValueError(
-                f"bootstrap_weights must be one of {_VALID_BOOTSTRAP_WEIGHTS}, "
-                f"got {bootstrap_weights!r}"
-            )
+        self._validate_constructor_args(
+            method=method,
+            control_group=control_group,
+            anticipation=anticipation,
+            bootstrap_weights=bootstrap_weights,
+            vcov_type=vcov_type,
+        )
 
         self.method = method
         self.control_group = control_group
@@ -336,10 +352,72 @@ def __init__(
         self.bootstrap_weights = bootstrap_weights
         self.seed = seed
         self.rank_deficient_action = rank_deficient_action
+        self.vcov_type = vcov_type
+        # Track whether the user explicitly opted out of the "hc1" default.
+        # The auto-cluster-at-unit default in `_fit_ols` is suppressed only
+        # when the user explicitly opts into a one-way family (``hc2``,
+        # ``classical``). ``hc1`` and ``hc2_bm`` preserve the auto-cluster
+        # (route to CR1 / CR2 Bell-McCaffrey at unit respectively). Mirrors
+        # the SunAbraham PR #472 pattern at ``sun_abraham.py:572``.
+        self._vcov_type_explicit = vcov_type != "hc1"
 
         self.is_fitted_: bool = False
         self._results: Optional[WooldridgeDiDResults] = None
 
+    @staticmethod
+    def _validate_constructor_args(
+        *,
+        method: str,
+        control_group: str,
+        anticipation: int,
+        bootstrap_weights: str,
+        vcov_type: str,
+    ) -> None:
+        """Shared validation for both ``__init__`` and ``set_params``.
+
+        Catches the input-contract surface (allowed sets, ranges, and the
+        ``method`` × ``vcov_type`` interaction) without depending on instance
+        state, so ``set_params`` can re-run it after mutation.
+        """
+        if method not in _VALID_METHODS:
+            raise ValueError(f"method must be one of {_VALID_METHODS}, got {method!r}")
+        if control_group not in _VALID_CONTROL_GROUPS:
+            raise ValueError(
+                f"control_group must be one of {_VALID_CONTROL_GROUPS}, got {control_group!r}"
+            )
+        if anticipation < 0:
+            raise ValueError(f"anticipation must be >= 0, got {anticipation}")
+        if bootstrap_weights not in _VALID_BOOTSTRAP_WEIGHTS:
+            raise ValueError(
+                f"bootstrap_weights must be one of {_VALID_BOOTSTRAP_WEIGHTS}, "
+                f"got {bootstrap_weights!r}"
+            )
+        if vcov_type not in ("classical", "hc1", "hc2", "hc2_bm"):
+            if vcov_type == "conley":
+                raise ValueError(
+                    "vcov_type='conley' is not yet wired up for WooldridgeDiD: "
+                    "would require threading conley_coords / conley_cutoff_km / "
+                    "conley_metric / conley_kernel / conley_time / conley_unit / "
+                    "conley_lag_cutoff through the solve_ols call. "
+                    "Tracked in TODO.md (WooldridgeDiD Conley follow-up row)."
+                )
+            raise ValueError(
+                f"vcov_type must be one of "
+                f"{{'classical','hc1','hc2','hc2_bm'}}; got '{vcov_type}'"
+            )
+        if method != "ols" and vcov_type != "hc1":
+            raise NotImplementedError(
+                f"WooldridgeDiD(method={method!r}, vcov_type={vcov_type!r}) is "
+                "not yet supported. The logit / poisson paths use a QMLE "
+                "sandwich with pseudo-residuals (probs*(1-probs) or mu_hat "
+                "weights); composing HC2 leverage and Bell-McCaffrey "
+                "Satterthwaite DOF with QMLE on canonical-link pseudo-"
+                "residuals needs derivation + R parity against "
+                "clubSandwich::vcovCR(glm(...)). Tracked in TODO.md "
+                "(WooldridgeDiD logit/poisson vcov_type follow-up row). "
+                "Use vcov_type='hc1' (default) for non-OLS methods."
+            )
+
     @property
     def results_(self) -> WooldridgeDiDResults:
         if not self.is_fitted_:
@@ -359,6 +437,7 @@ def get_params(self) -> Dict[str, Any]:
             "bootstrap_weights": self.bootstrap_weights,
             "seed": self.seed,
             "rank_deficient_action": self.rank_deficient_action,
+            "vcov_type": self.vcov_type,
         }
 
     def set_params(self, **params: Any) -> "WooldridgeDiD":
@@ -367,21 +446,17 @@ def set_params(self, **params: Any) -> "WooldridgeDiD":
             if not hasattr(self, key):
                 raise ValueError(f"Unknown parameter: {key!r}")
             setattr(self, key, value)
-        # Re-run validation after setting params
-        if self.method not in _VALID_METHODS:
-            raise ValueError(f"method must be one of {_VALID_METHODS}, got {self.method!r}")
-        if self.control_group not in _VALID_CONTROL_GROUPS:
-            raise ValueError(
-                f"control_group must be one of {_VALID_CONTROL_GROUPS}, "
-                f"got {self.control_group!r}"
-            )
-        if self.anticipation < 0:
-            raise ValueError(f"anticipation must be >= 0, got {self.anticipation}")
-        if self.bootstrap_weights not in _VALID_BOOTSTRAP_WEIGHTS:
-            raise ValueError(
-                f"bootstrap_weights must be one of {_VALID_BOOTSTRAP_WEIGHTS}, "
-                f"got {self.bootstrap_weights!r}"
-            )
+        # Re-run validation (catches mutations into invalid sets AND the
+        # method × vcov_type interaction) using the shared validator.
+        self._validate_constructor_args(
+            method=self.method,
+            control_group=self.control_group,
+            anticipation=self.anticipation,
+            bootstrap_weights=self.bootstrap_weights,
+            vcov_type=self.vcov_type,
+        )
+        # Recompute the explicit-vcov flag after any vcov_type mutation.
+        self._vcov_type_explicit = self.vcov_type != "hc1"
         return self
 
     def fit(
@@ -444,6 +519,42 @@ def fit(
                 "Set n_bootstrap=0 for analytic survey SEs."
             )
 
+        # 0d. Reject survey_design + non-hc1 analytical family. The survey-
+        # design TSL (or replicate-weight refit) variance overrides the
+        # analytical sandwich, so the requested HC2/HC2-BM/classical family
+        # would be silently discarded. Mirrors the SunAbraham PR #472 pattern
+        # at ``sun_abraham.py:688-705``. Use vcov_type='hc1' (default) for
+        # survey designs.
+        if survey_design is not None and self.vcov_type != "hc1":
+            raise NotImplementedError(
+                f"WooldridgeDiD(vcov_type={self.vcov_type!r}) with "
+                "survey_design is not yet supported: the survey-design TSL "
+                "(or replicate-weight refit) variance overrides the analytical "
+                "sandwich, so the requested HC2/HC2-BM/classical family would "
+                "be silently discarded. Use vcov_type='hc1' (default) for "
+                "survey designs; the survey TSL machinery computes the "
+                "design-based variance independently."
+            )
+
+        # 0e. Reject bootstrap + explicit one-way vcov_type without user-set
+        # cluster. The multiplier bootstrap is fundamentally clustered (it
+        # draws per-cluster weights); under explicit ``vcov_type in {"hc2",
+        # "classical"}`` with ``self.cluster=None``, the OLS path drops the
+        # unit auto-cluster for the analytical sandwich (mirrors SA), which
+        # would leave the bootstrap with no cluster ID to draw weights at.
+        # The user must either provide an explicit ``cluster=X`` or use a
+        # cluster-compatible ``vcov_type`` ("hc1" or "hc2_bm").
+        if self.n_bootstrap > 0 and self.vcov_type in ("hc2", "classical") and self.cluster is None:
+            raise ValueError(
+                f"WooldridgeDiD(vcov_type={self.vcov_type!r}, "
+                f"n_bootstrap={self.n_bootstrap}, cluster=None) is not "
+                "supported: the multiplier bootstrap is intrinsically "
+                "clustered, but the one-way vcov_type drops the unit "
+                "auto-cluster. Either set cluster='unit' (or another column) "
+                "or use vcov_type='hc1' / 'hc2_bm' for the analytical "
+                "sandwich."
+            )
+
         # 1. Filter to analysis sample
         sample = _filter_sample(df, unit, time, cohort, self.control_group, self.anticipation)
 
@@ -644,12 +755,42 @@ def _fit_ols(
         groups: List[Any],
         survey_design=None,
     ) -> WooldridgeDiDResults:
-        """OLS path: within-transform FE, solve_ols, cluster SE."""
+        """OLS path: within-transform FE, solve_ols, cluster SE.
+
+        Branches on ``self.vcov_type``: ``hc1`` (default) preserves the prior
+        within-transform path bit-equally; ``hc2``/``hc2_bm``/``classical``
+        auto-route to a full-dummy saturated design because FWL preserves
+        cohort coefficients but NOT the hat matrix (HC2 leverage and
+        Bell-McCaffrey Satterthwaite DOF require the full FE projection).
+        Mirrors the SunAbraham PR #472 pattern at ``sun_abraham.py:1364``.
+        """
         # Reset index so numpy positional indexing matches pandas groupby
         sample = sample.reset_index(drop=True)
-        # Cluster IDs (default: unit level) — needed before survey resolution
-        cluster_col = self.cluster if self.cluster else unit
-        cluster_ids = sample[cluster_col].values
+        # Cluster IDs: default to unit level for hc1/hc2_bm; drop the auto-
+        # cluster when the user opts into one-way ``vcov_type in {"hc2",
+        # "classical"}`` explicitly (one-way families don't compose with
+        # cluster_ids per the linalg validator). Explicit ``self.cluster=X``
+        # always wins. Mirrors SunAbraham PR #472 at ``sun_abraham.py:792-797``.
+        if self.cluster is not None:
+            cluster_col: Optional[str] = self.cluster
+            cluster_ids: Optional[np.ndarray] = sample[cluster_col].values
+        elif self.vcov_type in ("hc2", "classical") and self._vcov_type_explicit:
+            cluster_col = None
+            cluster_ids = None
+        else:
+            cluster_col = unit
+            cluster_ids = sample[cluster_col].values
+        # Bootstrap cluster level: user's ``self.cluster`` if set, else unit
+        # (the panel's natural unit of variation). This preserves the prior
+        # behavior — bootstrap matches the analytical cluster on hc1/hc2_bm
+        # paths and falls back to unit when the analytical sandwich drops the
+        # auto-cluster under explicit one-way. The fit() guard rejects
+        # ``n_bootstrap > 0`` + one-way + ``cluster=None``, so the fallback to
+        # unit only kicks in when the user explicitly set ``cluster=X``
+        # (already handled above) — in that explicit-one-way + explicit-cluster
+        # case the bootstrap matches the analytical cluster too.
+        bootstrap_cluster_col = self.cluster if self.cluster else unit
+        cluster_ids_bootstrap = sample[bootstrap_cluster_col].values
 
         # Resolve survey design, inject cluster as PSU only when user explicitly set cluster=
         survey_cluster_ids = cluster_ids if self.cluster else None
@@ -657,57 +798,111 @@ def _fit_ols(
             _resolve_survey_for_wooldridge(survey_design, sample, survey_cluster_ids, self.cluster)
         )
 
-        # 4. Within-transform: absorb unit + time FE
-        all_vars = [outcome] + [f"_x{i}" for i in range(X_design.shape[1])]
-        tmp = sample[[unit, time]].copy()
-        tmp[outcome] = sample[outcome].values
-        for i in range(X_design.shape[1]):
-            tmp[f"_x{i}"] = X_design[:, i]
-
-        # Use iterative alternating projections for demeaning (exact for
-        # both balanced and unbalanced panels).  Survey weights change the
-        # weighted FWL projection — all columns (treatment interactions +
-        # covariates) are demeaned together.
-        wt_weights = survey_weights if survey_weights is not None else np.ones(len(tmp))
-
-        # Guard: zero-weight unit/time groups cause 0/0 in within_transform
-        if survey_weights is not None and np.any(survey_weights == 0):
-            sw_series = pd.Series(survey_weights, index=sample.index)
-            for grp_col, grp_label in [(unit, "unit"), (time, "time period")]:
-                grp_sums = sw_series.groupby(sample[grp_col]).sum()
-                zero_grps = grp_sums[grp_sums == 0].index.tolist()
-                if zero_grps:
-                    raise ValueError(
-                        f"Survey weights sum to zero for {grp_label}(s) "
-                        f"{zero_grps[:3]}. Cannot compute weighted "
-                        f"within-transformation. Remove zero-weight "
-                        f"{grp_label}s or use non-zero weights."
-                    )
-
-        transformed = within_transform(
-            tmp, all_vars, unit=unit, time=time, suffix="_demeaned",
-            weights=wt_weights,
-        )
+        # Branch design build on vcov_type. ``hc1`` keeps within-transform
+        # (FWL preserves the CR1 cluster-robust score → bit-equal to prior
+        # at atol=1e-14). The full-dummy branch handles hc2 / hc2_bm /
+        # classical, which need the hat matrix on the full FE projection
+        # (FWL does not preserve it). ``coef_offset`` shifts gt_effects
+        # indexing to account for the intercept under full-dummy.
+        use_full_dummy = self.vcov_type in ("hc2", "hc2_bm", "classical")
+
+        if use_full_dummy:
+            # Full-dummy build: [intercept, X_design, unit_dummies,
+            # time_dummies]. Survey + non-hc1 was rejected at fit(), so
+            # survey_weights / resolved are None here. ``coef_offset = 1``
+            # shifts the gt_effects loop to skip the intercept.
+            n_obs = len(sample)
+            n_units_fe = int(sample[unit].nunique())
+            n_times_fe = int(sample[time].nunique())
+            dense_cells = n_obs * (1 + X_design.shape[1] + (n_units_fe - 1) + (n_times_fe - 1))
+            if dense_cells > 50_000_000:
+                warnings.warn(
+                    f"WooldridgeDiD(vcov_type={self.vcov_type!r}) builds a "
+                    f"dense full-dummy saturated design (~{dense_cells:,} "
+                    "float64 cells, >50M). FWL preserves coefficients but not "
+                    "the hat matrix, so HC2/HC2-BM/classical requires the full-"
+                    "dummy projection (within-transform would produce a "
+                    "methodologically different statistic). For very high-"
+                    "cardinality panels, consider vcov_type='hc1' (within-"
+                    "transform) or reducing the panel size.",
+                    UserWarning,
+                    stacklevel=3,
+                )
+            intercept_col = np.ones((n_obs, 1))
+            unit_dummies = pd.get_dummies(
+                sample[unit], prefix=f"_fe_{unit}", drop_first=True
+            ).values.astype(float)
+            time_dummies = pd.get_dummies(
+                sample[time], prefix=f"_fe_{time}", drop_first=True
+            ).values.astype(float)
+            X = np.hstack([intercept_col, X_design, unit_dummies, time_dummies])
+            y = sample[outcome].values.astype(float)
+            coef_offset = 1
+        else:
+            # Within-transform path (hc1 default; preserves prior bit-equal
+            # behavior). 4. Within-transform: absorb unit + time FE
+            all_vars = [outcome] + [f"_x{i}" for i in range(X_design.shape[1])]
+            tmp = sample[[unit, time]].copy()
+            tmp[outcome] = sample[outcome].values
+            for i in range(X_design.shape[1]):
+                tmp[f"_x{i}"] = X_design[:, i]
+
+            # Use iterative alternating projections for demeaning (exact for
+            # both balanced and unbalanced panels).  Survey weights change
+            # the weighted FWL projection — all columns (treatment
+            # interactions + covariates) are demeaned together.
+            wt_weights = survey_weights if survey_weights is not None else np.ones(len(tmp))
+
+            # Guard: zero-weight unit/time groups cause 0/0 in within_transform
+            if survey_weights is not None and np.any(survey_weights == 0):
+                sw_series = pd.Series(survey_weights, index=sample.index)
+                for grp_col, grp_label in [(unit, "unit"), (time, "time period")]:
+                    grp_sums = sw_series.groupby(sample[grp_col]).sum()
+                    zero_grps = grp_sums[grp_sums == 0].index.tolist()
+                    if zero_grps:
+                        raise ValueError(
+                            f"Survey weights sum to zero for {grp_label}(s) "
+                            f"{zero_grps[:3]}. Cannot compute weighted "
+                            f"within-transformation. Remove zero-weight "
+                            f"{grp_label}s or use non-zero weights."
+                        )
 
-        y = transformed[f"{outcome}_demeaned"].values
-        X_cols = [f"_x{i}_demeaned" for i in range(X_design.shape[1])]
-        X = transformed[X_cols].values
+            transformed = within_transform(
+                tmp,
+                all_vars,
+                unit=unit,
+                time=time,
+                suffix="_demeaned",
+                weights=wt_weights,
+            )
 
-        # 6. Solve OLS (skip cluster-robust vcov when survey will provide TSL vcov)
+            y = transformed[f"{outcome}_demeaned"].values
+            X_cols = [f"_x{i}_demeaned" for i in range(X_design.shape[1])]
+            X = transformed[X_cols].values
+            coef_offset = 0
+
+        # 6. Solve OLS (skip cluster-robust vcov when survey will provide TSL vcov).
+        # Pass ``column_names=col_names`` only on the within-transform branch;
+        # under full-dummy ``X`` has additional intercept + FE columns whose
+        # names aren't in ``col_names``, and ``solve_ols`` only uses names for
+        # rank-deficiency error messages (cosmetic). Omitting under full-dummy
+        # keeps rank-deficiency reporting consistent with the column count.
         coefs, resids, vcov = solve_ols(
             X,
             y,
             cluster_ids=cluster_ids,
             return_vcov=(resolved is None),
             rank_deficient_action=self.rank_deficient_action,
-            column_names=col_names,
+            column_names=col_names if not use_full_dummy else None,
             weights=survey_weights,
             weight_type=survey_weight_type,
+            vcov_type=self.vcov_type,
         )
 
         # Survey TSL vcov replaces cluster-robust vcov
         if resolved is not None:
             from diff_diff.survey import compute_survey_vcov
+
             nan_mask_ols = np.isnan(coefs)
             if np.any(nan_mask_ols):
                 kept = ~nan_mask_ols
@@ -718,17 +913,28 @@ def _fit_ols(
             else:
                 vcov = compute_survey_vcov(X, resids, resolved)
 
-        # 7. Extract β_{g,t} and build gt_effects dict
+        # 7. Extract β_{g,t} and build gt_effects dict. Under full-dummy
+        # (``coef_offset = 1``), treatment cells occupy columns
+        # ``1..1+n_int-1`` (intercept at 0); under within-transform
+        # (``coef_offset = 0``), treatment cells occupy columns
+        # ``0..n_int-1``. The shift only affects this loop's indexing into
+        # ``coefs`` / ``vcov`` — the (g, t) key space and the order of
+        # ``gt_keys`` are identical across branches.
         gt_effects: Dict[Tuple, Dict] = {}
         gt_weights: Dict[Tuple, int] = {}
         for idx, (g, t) in enumerate(gt_keys):
-            if idx >= len(coefs):
+            coef_idx = idx + coef_offset
+            if coef_idx >= len(coefs):
                 break
             # Skip cells whose coefficient was dropped (rank deficiency)
-            if np.isnan(coefs[idx]):
+            if np.isnan(coefs[coef_idx]):
                 continue
-            att = float(coefs[idx])
-            se = float(np.sqrt(max(vcov[idx, idx], 0.0))) if vcov is not None else float("nan")
+            att = float(coefs[coef_idx])
+            se = (
+                float(np.sqrt(max(vcov[coef_idx, coef_idx], 0.0)))
+                if vcov is not None
+                else float("nan")
+            )
             t_stat, p_value, conf_int = safe_inference(att, se, alpha=self.alpha, df=df_inf)
             gt_effects[(g, t)] = {
                 "att": att,
@@ -739,19 +945,111 @@ def _fit_ols(
             }
             gt_weights[(g, t)] = int(((sample[cohort] == g) & (sample[time] == t)).sum())
 
-        # Extract vcov submatrix for identified β_{g,t} only (skip NaN/dropped)
+        # Extract vcov submatrix for identified β_{g,t} only (skip NaN/dropped).
+        # Shift by coef_offset so the submatrix lands on treatment cells
+        # under full-dummy.
         gt_keys_ordered = list(gt_effects.keys())
         if vcov is not None and gt_keys_ordered:
-            # Map from gt_keys_ordered to original indices in the coef vector
-            orig_indices = [i for i, k in enumerate(gt_keys) if k in gt_effects]
+            orig_indices = [i + coef_offset for i, k in enumerate(gt_keys) if k in gt_effects]
             gt_vcov = vcov[np.ix_(orig_indices, orig_indices)]
         else:
             gt_vcov = None
 
-        # 8. Simple aggregation (always computed)
-        overall = _compute_weighted_agg(
-            gt_effects, gt_weights, gt_keys_ordered, gt_vcov, self.alpha, df=df_inf
-        )
+        # 8. Bell-McCaffrey contrast DOF threading for the overall ATT under
+        # ``vcov_type="hc2_bm"``. Per ``feedback_bm_contrast_dof_fail_closed``,
+        # when the BM DOF is unavailable (helper raises or returns non-finite)
+        # the user-facing aggregated inference must emit ALL-NaN
+        # (t_stat/p_value/conf_int) rather than fall back to
+        # ``safe_inference(df=None)`` which silently uses normal-theory.
+        # Mirrors the SunAbraham PR #472 pattern at
+        # ``sun_abraham.py:1008-1097`` and the StackedDiD PR #479 R3 fix.
+        overall_att_bm_dof: Optional[float] = None
+        if (
+            self.vcov_type == "hc2_bm"
+            and use_full_dummy
+            and resolved is None
+            and vcov is not None
+            and gt_keys_ordered
+            and cluster_ids is not None
+        ):
+            from diff_diff.linalg import _compute_cr2_bm_contrast_dof
+
+            n_coefs = X.shape[1]
+            # Build the overall-ATT post-period-average contrast in full-coef
+            # space. Post-period (g, t) cells have ``t >= g``; weights are
+            # ``gt_weights[k] / w_total_post`` where ``w_total_post`` is the
+            # sum of cell weights across post-period (g, t) keys present in
+            # ``gt_effects``. Non-post cells get zero weight; non-treatment
+            # columns (intercept, FE dummies) get zero weight.
+            post_keys = [(g, t) for (g, t) in gt_keys_ordered if t >= g]
+            w_total_post = sum(gt_weights.get(k, 0) for k in post_keys)
+            if w_total_post > 0:
+                contrast_vec = np.zeros(n_coefs)
+                for i, k in enumerate(gt_keys):
+                    if k in gt_effects and k in post_keys:
+                        contrast_vec[i + coef_offset] = gt_weights[k] / w_total_post
+                if np.any(contrast_vec != 0):
+                    bread_matrix = X.T @ X
+                    try:
+                        dof_vec = _compute_cr2_bm_contrast_dof(
+                            X,
+                            cluster_ids,
+                            bread_matrix,
+                            contrast_vec.reshape(-1, 1),
+                        )
+                        candidate = float(dof_vec[0])
+                        overall_att_bm_dof = candidate if np.isfinite(candidate) else float("nan")
+                    except (ValueError, np.linalg.LinAlgError) as exc:
+                        warnings.warn(
+                            f"WooldridgeDiD(vcov_type='hc2_bm') aggregated "
+                            f"inference could not compute Bell-McCaffrey "
+                            f"contrast DOF ({type(exc).__name__}: {exc}). "
+                            "Overall ATT inference (t_stat / p_value / "
+                            "conf_int) will be NaN to preserve the hc2_bm "
+                            "contract.",
+                            UserWarning,
+                            stacklevel=3,
+                        )
+                        overall_att_bm_dof = float("nan")
+
+        # 8a. Simple aggregation (always computed). Use BM contrast DOF for
+        # the overall ATT inference when ``vcov_type='hc2_bm'``; otherwise
+        # fall back to the shared df (survey df or None). Fail-closed: when
+        # BM DOF is NaN, the analytical sandwich inference fields are NaN
+        # too (see ``feedback_bm_contrast_dof_fail_closed``).
+        if self.vcov_type == "hc2_bm" and use_full_dummy and resolved is None:
+            if overall_att_bm_dof is not None and np.isfinite(overall_att_bm_dof):
+                overall = _compute_weighted_agg(
+                    gt_effects,
+                    gt_weights,
+                    gt_keys_ordered,
+                    gt_vcov,
+                    self.alpha,
+                    df=overall_att_bm_dof,
+                )
+            else:
+                # BM DOF unavailable: preserve att + se from a finite-df run
+                # (so the user-facing att/se still match the sandwich), then
+                # NaN-out the inference fields.
+                overall = _compute_weighted_agg(
+                    gt_effects,
+                    gt_weights,
+                    gt_keys_ordered,
+                    gt_vcov,
+                    self.alpha,
+                    df=df_inf,
+                )
+                overall = {
+                    "att": overall["att"],
+                    "se": overall["se"],
+                    "t_stat": float("nan"),
+                    "p_value": float("nan"),
+                    "conf_int": (float("nan"), float("nan")),
+                }
+        else:
+            overall = _compute_weighted_agg(
+                gt_effects, gt_weights, gt_keys_ordered, gt_vcov, self.alpha, df=df_inf
+            )
 
         # Metadata
         n_treated = int(sample[sample[cohort] > 0][unit].nunique())
@@ -775,17 +1073,26 @@ def _fit_ols(
             alpha=self.alpha,
             anticipation=self.anticipation,
             survey_metadata=survey_metadata,
+            vcov_type=self.vcov_type,
+            cluster_name=cluster_col,
+            n_clusters=(int(np.unique(cluster_ids).size) if cluster_ids is not None else None),
             _gt_weights=gt_weights,
             _gt_vcov=gt_vcov,
             _gt_keys=gt_keys_ordered,
             _df_survey=df_inf,
         )
 
-        # 9. Optional multiplier bootstrap (overrides analytic SE for overall ATT)
+        # 9. Optional multiplier bootstrap (overrides analytic SE for overall ATT).
+        # Always clusters at the unit level (via ``cluster_ids_bootstrap``)
+        # regardless of the analytical sandwich's cluster setting, so the
+        # bootstrap remains intrinsically clustered even when ``vcov_type in
+        # {"hc2","classical"}`` drops the auto-cluster for the analytical
+        # vcov. The fit() guard at the top rejects ``n_bootstrap > 0`` +
+        # one-way + ``cluster=None``, so under any combination that reaches
+        # here, clustering at the unit level matches user intent.
         if self.n_bootstrap > 0:
             rng = np.random.default_rng(self.seed)
-            # Draw weights at the analytic cluster level (not always unit)
-            unique_boot_clusters = np.unique(cluster_ids)
+            unique_boot_clusters = np.unique(cluster_ids_bootstrap)
             n_boot_clusters = len(unique_boot_clusters)
             post_keys = [(g, t) for (g, t) in gt_keys_ordered if t >= g]
             w_total_b = sum(gt_weights.get(k, 0) for k in post_keys)
@@ -805,21 +1112,29 @@ def _fit_ols(
                         p=[phi / np.sqrt(5), (phi - 1) / np.sqrt(5)],
                         size=n_boot_clusters,
                     )
-                obs_weights = cl_weights[np.searchsorted(unique_boot_clusters, cluster_ids)]
+                obs_weights = cl_weights[
+                    np.searchsorted(unique_boot_clusters, cluster_ids_bootstrap)
+                ]
                 y_boot = y + obs_weights * resids
+                # Thread vcov_type for grep consistency (no-op at runtime
+                # because ``return_vcov=False``). Pass the analytical
+                # ``cluster_ids`` (which may be ``None`` under one-way
+                # explicit + ``cluster=None`` — the fit() guard prevents
+                # that combination from reaching here).
                 coefs_b, _, _ = solve_ols(
                     X,
                     y_boot,
                     cluster_ids=cluster_ids,
                     return_vcov=False,
                     rank_deficient_action="silent",
+                    vcov_type=self.vcov_type,
                 )
                 if w_total_b > 0:
                     att_b = (
                         sum(
-                            gt_weights.get(k, 0) * float(coefs_b[i])
+                            gt_weights.get(k, 0) * float(coefs_b[i + coef_offset])
                             for i, k in enumerate(gt_keys)
-                            if k in post_keys and i < len(coefs_b)
+                            if k in post_keys and i + coef_offset < len(coefs_b)
                         )
                         / w_total_b
                     )
@@ -906,6 +1221,7 @@ def _fit_logit(
             # Bread: (X_tilde'WX_tilde)^{-1} = (X'diag(w*V)X)^{-1}
             # Scores: w*X_tilde*r_tilde = w*X*(y-mu)
             from diff_diff.survey import compute_survey_vcov
+
             V = probs * (1 - probs)
             sqrt_V = np.sqrt(np.clip(V, 1e-20, None))
             X_tilde = X_with_intercept * sqrt_V[:, None]
@@ -1031,7 +1347,9 @@ def _avg_ax0(a, cell_mask):
             overall_att = sum(gt_weights[k] * gt_effects[k]["att"] for k in post_keys) / w_total
             agg_grad = sum((gt_weights[k] / w_total) * gt_grads[k] for k in post_keys)
             overall_se = float(np.sqrt(max(agg_grad @ _vcov_se @ agg_grad, 0.0)))
-            t_stat, p_value, conf_int = safe_inference(overall_att, overall_se, alpha=self.alpha, df=df_inf)
+            t_stat, p_value, conf_int = safe_inference(
+                overall_att, overall_se, alpha=self.alpha, df=df_inf
+            )
             overall = {
                 "att": overall_att,
                 "se": overall_se,
@@ -1118,7 +1436,8 @@ def _fit_poisson(
         _has_survey = resolved is not None
 
         beta, mu_hat = solve_poisson(
-            X_full, y,
+            X_full,
+            y,
             rank_deficient_action=self.rank_deficient_action,
             weights=survey_weights,
         )
@@ -1137,6 +1456,7 @@ def _fit_poisson(
         if _has_survey:
             # X_tilde trick for nonlinear survey vcov (V = mu for Poisson)
             from diff_diff.survey import compute_survey_vcov
+
             sqrt_V = np.sqrt(np.clip(mu_hat, 1e-20, None))
             X_tilde = X_full * sqrt_V[:, None]
             r_tilde = resids / sqrt_V
@@ -1268,7 +1588,9 @@ def _avg_ax0(a, cell_mask):
             overall_att = sum(gt_weights[k] * gt_effects[k]["att"] for k in post_keys) / w_total
             agg_grad = sum((gt_weights[k] / w_total) * gt_grads[k] for k in post_keys)
             overall_se = float(np.sqrt(max(agg_grad @ _vcov_se @ agg_grad, 0.0)))
-            t_stat, p_value, conf_int = safe_inference(overall_att, overall_se, alpha=self.alpha, df=df_inf)
+            t_stat, p_value, conf_int = safe_inference(
+                overall_att, overall_se, alpha=self.alpha, df=df_inf
+            )
             overall = {
                 "att": overall_att,
                 "se": overall_se,
diff --git a/diff_diff/wooldridge_results.py b/diff_diff/wooldridge_results.py
index 1425f61e..fb3e5f98 100644
--- a/diff_diff/wooldridge_results.py
+++ b/diff_diff/wooldridge_results.py
@@ -56,6 +56,21 @@ class WooldridgeDiDResults:
     anticipation: int = 0
     survey_metadata: Optional[Any] = field(default=None, repr=False)
 
+    # Variance-family metadata. ``vcov_type`` records the configured analytical
+    # family ("classical", "hc1", "hc2", or "hc2_bm"); when ``survey_design=``
+    # is supplied the survey TSL (or replicate-weight refit) variance overrides
+    # this — the field still records the configured value and
+    # ``survey_metadata`` indicates the survey path was active. On bootstrap
+    # fits (``n_bootstrap > 0``) the SE comes from the multiplier bootstrap,
+    # not the analytical family. ``cluster_name`` / ``n_clusters`` are
+    # populated when the fit was clustered (default unit cluster, or
+    # user-set ``cluster=X``); both are ``None`` on explicit one-way
+    # (``vcov_type in {"classical","hc2"}`` + no user cluster) fits where
+    # the auto-cluster was dropped.
+    vcov_type: str = "hc1"
+    cluster_name: Optional[str] = None
+    n_clusters: Optional[int] = None
+
     # ------------------------------------------------------------------ #
     # Internal — used by aggregate() for delta-method SEs                 #
     # ------------------------------------------------------------------ #
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 2fff61d2..2bc727d8 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1486,6 +1486,17 @@ where `g(·)` is the link inverse (logistic or exp), `η_i` is the individual li
 - **Deviation from R:** R's `etwfe` package uses `fixest` for nonlinear paths; this implementation uses direct QMLE via `compute_robust_vcov` to avoid a statsmodels/fixest dependency.
 - **Note:** QMLE sandwich uses `weight_type="aweight"` which applies `(G/(G-1)) * ((n-1)/(n-k))` small-sample adjustment. Stata `jwdid` uses `G/(G-1)` only. The `(n-1)/(n-k)` term is conservative (inflates SEs slightly). For typical ETWFE panels where n >> k, the difference is negligible.
 
+*Variance families (`vcov_type`, OLS path only):*
+- `hc1` (default) — CR1 Liang-Zeger cluster-robust on the within-transformed design. Bit-equal to prior behavior (FWL preserves the score). The natural R anchor is `fixest::feols(y ~ <interactions> | unit + time, cluster=~unit)` or Stata `jwdid` (both within-transform). **Deviation from R `lm + clubSandwich::vcovCR(type="CR1S")`:** the full-dummy `lm` SE differs by a factor of `sqrt((n - k_within) / (n - k_total))` because clubSandwich's `(n-1)/(n-p)` finite-sample correction counts ALL columns (intercept + treatment + unit dummies + time dummies = `k_total`) while WooldridgeDiD's `solve_ols` on the within-transformed design counts only the treatment-cell columns (`k_within`). On the 240-obs / 51-column R-parity fixture this is ~11% ; on typical larger panels (n >> k_total) the gap shrinks to <2%. The user can recover the `lm + CR1S` SE by passing `vcov_type="hc2_bm"` (full-dummy auto-route) or by manually constructing a full-dummy `solve_ols` call. Same deviation pattern as SunAbraham PR #472 (`fixest::sunab` vs `lm + clubSandwich`).
+- `hc2_bm` — CR2 Bell-McCaffrey via auto-route to full-dummy design (`[intercept, X_design, unit_dummies, time_dummies]`), then `solve_ols(..., vcov_type="hc2_bm")` through the clubSandwich port (PR #475). FWL does NOT preserve the hat matrix; HC2 leverage + BM DOF require the full-projection design. Matches `clubSandwich::vcovCR(lm(...), cluster=~unit, type="CR2") + coef_test()$df_Satt` at atol=1e-10 (pinned in `tests/test_methodology_wooldridge.py`). The overall ATT's BM contrast DOF uses `_compute_cr2_bm_contrast_dof` on the post-period-aggregation contrast (matches `Wald_test(test="HTZ")$df_denom`).
+- `classical`, `hc2` — supported via auto-route to full-dummy AND auto-drop of the unit auto-cluster (one-way families don't compose with `cluster_ids` per the linalg validator). Set `self.cluster=None` (default) for these; explicit `cluster="state"` + one-way family raises at the linalg validator. Matches `summary(lm(...))$coefficients` (classical) and `sandwich::vcovHC(type="HC2")` respectively.
+- `conley` — REJECTED at `__init__` (deferral; would require threading `conley_*` params through `solve_ols`; tracked in TODO.md).
+- `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` — REJECTED at `__init__`. GLM QMLE sandwich with HC2 leverage on canonical-link pseudo-residuals (`w = p(1-p)` for logit, `w = μ_i` for Poisson) needs CR2-BM-on-GLM derivation + R parity against `clubSandwich::vcovCR(glm(...))`. Tracked in TODO.md (WooldridgeDiD logit/poisson follow-up row).
+- `survey_design=` + `vcov_type != "hc1"` — REJECTED at `fit()` with `NotImplementedError`. Survey TSL/replicate-refit overrides analytical sandwich. Use `vcov_type="hc1"` (default) for survey designs.
+- `n_bootstrap > 0` + `vcov_type ∈ {"hc2","classical"}` + `self.cluster=None` — REJECTED at `fit()`. The multiplier bootstrap is intrinsically clustered; under explicit one-way + no user cluster, the bootstrap has no cluster ID to draw weights at. User must provide explicit `cluster=X` or use `vcov_type='hc1'` / `'hc2_bm'`.
+- **Note:** This routing is a documented synthesis of two existing methodology ingredients: the full-dummy auto-route from the Phase 1b PR 1/8 SunAbraham pattern (PR #472, which itself reused the Phase 1a Gate 1 TWFE lift from PR #469), and the clubSandwich WLS-CR2 algebra from the Phase 1a port (PR #475). The BM contrast DOF threading reuses `_compute_cr2_bm_contrast_dof` from PR #465 (MPD). No new methodology choice is introduced — the change is purely surface: extending the existing pattern from SA-OLS to WooldridgeDiD-OLS.
+- **Note:** Bootstrap clusters at `self.cluster if self.cluster else unit` regardless of `vcov_type`. When the analytical sandwich is one-way + the user set an explicit `cluster=X`, the bootstrap matches the user's cluster. The bootstrap SE overrides the analytical SE for `overall_*` on `n_bootstrap > 0` paths; per-cell `(g, t)` SEs still come from the analytical vcov.
+
 *Aggregations (matching `jwdid_estat`):*
 - `simple`: Weighted average across all post-treatment (g, t) cells with weights `n_{g,t}`:
 
diff --git a/tests/test_methodology_wooldridge.py b/tests/test_methodology_wooldridge.py
new file mode 100644
index 00000000..b7007331
--- /dev/null
+++ b/tests/test_methodology_wooldridge.py
@@ -0,0 +1,161 @@
+"""R-parity tests for WooldridgeDiD OLS path vcov_type variants.
+
+Pins ``WooldridgeDiD(method='ols', vcov_type=...)`` against R's
+``clubSandwich`` (CR2 + Bell-McCaffrey Satterthwaite DOF) and ``sandwich``
+(HC2) on the fixed-seed panel at ``benchmarks/data/wooldridge_test_panel.csv``.
+Goldens are generated by ``benchmarks/R/generate_wooldridge_golden.R``
+(requires ``clubSandwich >= 0.7.0`` and ``sandwich >= 3.0.0``).
+
+The ``hc1`` variant is NOT pinned against R here because the diff-diff
+within-transform finite-sample correction ``(n-1)/(n-k_dm)`` differs from
+``lm + clubSandwich::vcovCR(type="CR1S")``'s ``(n-1)/(n-k_total)`` correction;
+see ``docs/methodology/REGISTRY.md`` "Variance families" → "Deviation from R"
+for the algebra. The hc1 path is locked instead by
+``tests/test_wooldridge.py::TestWooldridgeVcovType::test_hc1_se_bit_equal_to_pre_pr_baseline``
+at ``atol=1e-14``.
+"""
+
+from __future__ import annotations
+
+import json
+from pathlib import Path
+
+import pandas as pd
+import pytest
+from scipy import stats
+from scipy.optimize import brentq
+
+from diff_diff.wooldridge import WooldridgeDiD
+
+GOLDEN_PATH = Path(__file__).parent.parent / "benchmarks" / "data" / "wooldridge_golden.json"
+PANEL_PATH = Path(__file__).parent.parent / "benchmarks" / "data" / "wooldridge_test_panel.csv"
+
+_R_FIXTURE_AVAILABLE = GOLDEN_PATH.is_file() and PANEL_PATH.is_file()
+
+
+def _recover_dof_from_ci(att: float, se: float, ci_hi: float, alpha: float) -> float:
+    """Recover the t-distribution DOF used to build a CI from its half-width.
+
+    Inverts ``ci_hi = att + t.ppf(1 - alpha/2, df) * se`` for ``df``. Used to
+    cross-check the BM contrast DOF threaded into Python's aggregated
+    inference without requiring the dataclass to expose the DOF as a direct
+    field (mirrors the SunAbraham / StackedDiD R-parity pattern).
+    """
+    t_crit_implied = (ci_hi - att) / se
+    return brentq(
+        lambda df: stats.t.ppf(1 - alpha / 2, df) - t_crit_implied,
+        1.5,
+        10000.0,
+    )
+
+
+@pytest.fixture(scope="module")
+def golden() -> dict:
+    if not _R_FIXTURE_AVAILABLE:
+        pytest.skip(
+            "R-parity fixture not present. Run "
+            "`Rscript benchmarks/R/generate_wooldridge_golden.R` "
+            "to regenerate `benchmarks/data/wooldridge_golden.json`."
+        )
+    with GOLDEN_PATH.open("r") as f:
+        return json.loads(f.read())
+
+
+@pytest.fixture(scope="module")
+def panel() -> pd.DataFrame:
+    if not _R_FIXTURE_AVAILABLE:
+        pytest.skip("R-parity fixture not present.")
+    return pd.read_csv(PANEL_PATH)
+
+
+@pytest.mark.skipif(not _R_FIXTURE_AVAILABLE, reason="R-parity fixture not present.")
+class TestWooldridgeParityR:
+    """Pin Python WooldridgeDiD OLS vcov_type output against R `lm` + clubSandwich / sandwich."""
+
+    def test_interaction_coefs_match_lm(self, golden: dict, panel: pd.DataFrame) -> None:
+        """Point estimates (treatment-cell coefficients) match R `lm()` at atol=1e-10.
+
+        Identical across all 4 vcov_type variants (only SE differs); pin via
+        `vcov_type='hc2_bm'` (full-dummy branch).
+        """
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        r_keys = [(d["g"], d["t"]) for d in golden["point_estimates"]["gt_keys"]]
+        r_coefs = golden["point_estimates"]["interaction_coefs"]
+        for i, (g, t) in enumerate(r_keys):
+            py_att = res.group_time_effects[(g, t)]["att"]
+            assert py_att == pytest.approx(
+                r_coefs[i], abs=1e-10
+            ), f"(g={g}, t={t}): Py={py_att:.10f} R={r_coefs[i]:.10f}"
+
+    def test_hc2_bm_per_coef_se_matches_clubsandwich_cr2(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """Per-treatment-cell CR2-BM SE matches `clubSandwich::vcovCR(..., type="CR2")` at atol=1e-10."""
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        r_keys = [(d["g"], d["t"]) for d in golden["point_estimates"]["gt_keys"]]
+        r_ses = golden["hc2_bm"]["per_coef_se"]
+        for i, (g, t) in enumerate(r_keys):
+            py_se = res.group_time_effects[(g, t)]["se"]
+            assert py_se == pytest.approx(
+                r_ses[i], abs=1e-10
+            ), f"(g={g}, t={t}): Py SE={py_se:.10f} R SE={r_ses[i]:.10f}"
+
+    def test_hc2_bm_overall_att_se_matches_clubsandwich_cr2(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """Overall ATT SE matches the linear-combination SE from `clubSandwich::vcovCR(..., type="CR2")`."""
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        r_se = golden["hc2_bm"]["overall_att_se"]
+        assert res.overall_se == pytest.approx(r_se, abs=1e-10)
+
+    def test_hc2_bm_overall_att_contrast_dof_matches_wald_test_htz(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """Overall ATT BM contrast DOF matches `Wald_test(test="HTZ")$df_denom` at atol=1e-10.
+
+        Inverts the Python CI half-width to recover the t-distribution DOF
+        (the WooldridgeDiDResults dataclass does not expose the BM contrast
+        DOF as a direct field; same approach as SunAbraham PR #472 /
+        StackedDiD PR #479).
+        """
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        py_dof = _recover_dof_from_ci(
+            res.overall_att, res.overall_se, res.overall_conf_int[1], res.alpha
+        )
+        r_dof = golden["hc2_bm"]["overall_att_contrast_dof"]
+        # brentq inversion tolerance + scipy stats roundtrip: 1e-6 is comfortable
+        # for a DOF in the 1.5..1000 range. The underlying clubSandwich CR2
+        # vcov matches at machine precision (~6e-16 on per-coef SE).
+        assert py_dof == pytest.approx(r_dof, abs=1e-6)
+
+    def test_classical_se_matches_lm_summary(self, golden: dict, panel: pd.DataFrame) -> None:
+        """`vcov_type='classical'` (drops auto-cluster) matches `summary(lm(...))$coefficients` SE."""
+        res = WooldridgeDiD(method="ols", vcov_type="classical").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        r_keys = [(d["g"], d["t"]) for d in golden["point_estimates"]["gt_keys"]]
+        r_ses = golden["classical"]["per_coef_se"]
+        for i, (g, t) in enumerate(r_keys):
+            py_se = res.group_time_effects[(g, t)]["se"]
+            assert py_se == pytest.approx(r_ses[i], abs=1e-10)
+        assert res.overall_se == pytest.approx(golden["classical"]["overall_att_se"], abs=1e-10)
+
+    def test_hc2_se_matches_sandwich_vcovhc(self, golden: dict, panel: pd.DataFrame) -> None:
+        """`vcov_type='hc2'` (drops auto-cluster) matches `sandwich::vcovHC(type="HC2")` SE."""
+        res = WooldridgeDiD(method="ols", vcov_type="hc2").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        r_keys = [(d["g"], d["t"]) for d in golden["point_estimates"]["gt_keys"]]
+        r_ses = golden["hc2"]["per_coef_se"]
+        for i, (g, t) in enumerate(r_keys):
+            py_se = res.group_time_effects[(g, t)]["se"]
+            assert py_se == pytest.approx(r_ses[i], abs=1e-10)
+        assert res.overall_se == pytest.approx(golden["hc2"]["overall_att_se"], abs=1e-10)
diff --git a/tests/test_wooldridge.py b/tests/test_wooldridge.py
index 2cedbf56..3f609094 100644
--- a/tests/test_wooldridge.py
+++ b/tests/test_wooldridge.py
@@ -1640,3 +1640,256 @@ def test_select_sample_helper_warns(self):
                 df, unit="unit", time="time", cohort="cohort",
                 control_group="never_treated", anticipation=0,
             )
+
+
+def _make_vcov_panel(n_units=40, n_periods=6, seed=202605211230):
+    """Fixed-seed staggered panel for vcov_type tests.
+
+    Three cohorts (0=never, 3, 5), heterogeneous treatment effects (stronger
+    for cohort=3), 40 units × 6 periods. Heterogeneous effects per
+    ``feedback_homogeneous_dgp_no_twfe_bias`` — required for meaningful
+    TWFE-style bias-vs-corrected comparisons. The fixed seed and panel
+    shape are pinned by
+    ``TestWooldridgeVcovType::test_hc1_se_bit_equal_to_pre_pr_baseline``
+    against a hardcoded SE captured on the Phase 1b PR 3/8 branch
+    (commit-SHA-equivalent to ``origin/main`` at fork time: ``24de9062``).
+    """
+    rng = np.random.default_rng(seed)
+    units = np.repeat(np.arange(n_units), n_periods)
+    periods = np.tile(np.arange(1, n_periods + 1), n_units)
+    cohort_choices = [0, 3, 5]
+    cohorts = rng.choice(cohort_choices, size=n_units, p=[0.4, 0.3, 0.3])
+    cohort_per_obs = cohorts[units]
+    tau = np.where(
+        (cohort_per_obs > 0) & (periods >= cohort_per_obs),
+        0.4 + 0.25 * (periods - cohort_per_obs) + 0.3 * (cohort_per_obs == 3),
+        0.0,
+    )
+    y = 0.7 + 0.1 * periods + 0.05 * units + tau + 0.15 * rng.normal(size=len(units))
+    return pd.DataFrame(
+        {"unit": units, "time": periods, "cohort": cohort_per_obs, "y": y}
+    )
+
+
+class TestWooldridgeVcovType:
+    """Phase 1b PR 3/8: vcov_type input contract + branching for OLS path."""
+
+    def test_default_vcov_type_is_hc1(self):
+        est = WooldridgeDiD()
+        assert est.vcov_type == "hc1"
+        assert est._vcov_type_explicit is False
+
+    def test_hc1_se_bit_equal_to_pre_pr_baseline(self):
+        """HC1 within-transform path must match pre-PR baseline at atol=1e-14.
+
+        Baseline captured on the Phase 1b PR 3/8 branch with
+        ``_make_vcov_panel(seed=202605211230)``. FWL preserves the CR1
+        cluster-robust score, so the new ``vcov_type`` branching keeps HC1
+        bit-equal to the prior hard-coded HC1 behavior.
+        """
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="hc1").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        assert res.overall_att == pytest.approx(0.9178849934516247, abs=1e-14)
+        assert res.overall_se == pytest.approx(0.03149488781317814, abs=1e-14)
+
+    def test_hc2_bm_finite_and_inflates_over_hc1(self):
+        df = _make_vcov_panel()
+        res_hc1 = WooldridgeDiD(method="ols", vcov_type="hc1").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        res_bm = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        for k, eff in res_bm.group_time_effects.items():
+            assert np.isfinite(eff["se"])
+        assert np.isfinite(res_bm.overall_se)
+        assert res_bm.overall_se > res_hc1.overall_se
+        # ATT identity across vcov branches (only SE differs)
+        assert res_bm.overall_att == pytest.approx(res_hc1.overall_att, abs=1e-10)
+
+    def test_atts_identical_across_vcov_branches(self):
+        """Per-cell ATT estimates must be identical across all 4 vcov branches
+        (within-transform hc1 vs full-dummy hc2_bm/hc2/classical)."""
+        df = _make_vcov_panel()
+        results = {}
+        for vt in ("hc1", "hc2_bm", "hc2", "classical"):
+            results[vt] = WooldridgeDiD(method="ols", vcov_type=vt).fit(
+                df, outcome="y", unit="unit", time="time", cohort="cohort"
+            )
+        ref = results["hc1"]
+        for vt in ("hc2_bm", "hc2", "classical"):
+            assert results[vt].overall_att == pytest.approx(ref.overall_att, abs=1e-10)
+            for k in ref.group_time_effects:
+                assert results[vt].group_time_effects[k]["att"] == pytest.approx(
+                    ref.group_time_effects[k]["att"], abs=1e-10
+                ), f"per-cell ATT diverged for vcov_type={vt!r} at cell {k}"
+
+    def test_classical_with_explicit_user_cluster_rejected_by_linalg(self):
+        df = _make_vcov_panel()
+        est = WooldridgeDiD(method="ols", vcov_type="classical", cluster="unit")
+        with pytest.raises(ValueError):
+            est.fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
+
+    def test_classical_drops_auto_cluster(self):
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="classical").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        assert np.isfinite(res.overall_se)
+        assert res.cluster_name is None
+        assert res.n_clusters is None
+
+    def test_hc2_drops_auto_cluster(self):
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="hc2").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        assert np.isfinite(res.overall_se)
+        assert res.cluster_name is None
+        assert res.n_clusters is None
+
+    def test_conley_rejected_at_init_with_deferral(self):
+        with pytest.raises(ValueError, match="conley"):
+            WooldridgeDiD(vcov_type="conley")
+
+    def test_invalid_vcov_type_rejected(self):
+        with pytest.raises(ValueError, match="hc4"):
+            WooldridgeDiD(vcov_type="hc4")
+
+    def test_logit_plus_hc2_bm_rejected_at_init(self):
+        with pytest.raises(NotImplementedError, match=r"method='logit'"):
+            WooldridgeDiD(method="logit", vcov_type="hc2_bm")
+
+    def test_poisson_plus_hc2_bm_rejected_at_init(self):
+        with pytest.raises(NotImplementedError, match=r"method='poisson'"):
+            WooldridgeDiD(method="poisson", vcov_type="hc2_bm")
+
+    def test_logit_plus_hc1_default_preserved(self):
+        # method='logit' + vcov_type='hc1' (default) must NOT raise —
+        # preserves the prior nonlinear path bit-equally.
+        est = WooldridgeDiD(method="logit", vcov_type="hc1")
+        assert est.method == "logit"
+        assert est.vcov_type == "hc1"
+
+    def test_survey_design_plus_hc2_bm_rejected(self):
+        from diff_diff.survey import SurveyDesign
+
+        df = _make_vcov_panel()
+        df["w"] = 1.0
+        design = SurveyDesign(weights="w", weight_type="pweight")
+        est = WooldridgeDiD(method="ols", vcov_type="hc2_bm")
+        with pytest.raises(NotImplementedError, match=r"survey_design"):
+            est.fit(
+                df, outcome="y", unit="unit", time="time", cohort="cohort",
+                survey_design=design,
+            )
+
+    def test_survey_design_plus_classical_rejected(self):
+        from diff_diff.survey import SurveyDesign
+
+        df = _make_vcov_panel()
+        df["w"] = 1.0
+        design = SurveyDesign(weights="w", weight_type="pweight")
+        est = WooldridgeDiD(method="ols", vcov_type="classical")
+        with pytest.raises(NotImplementedError, match=r"survey_design"):
+            est.fit(
+                df, outcome="y", unit="unit", time="time", cohort="cohort",
+                survey_design=design,
+            )
+
+    def test_bootstrap_plus_one_way_no_user_cluster_rejected(self):
+        df = _make_vcov_panel()
+        est = WooldridgeDiD(
+            method="ols", vcov_type="classical", n_bootstrap=10, seed=0
+        )
+        with pytest.raises(ValueError, match=r"multiplier bootstrap"):
+            est.fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
+
+    def test_get_params_includes_vcov_type(self):
+        est = WooldridgeDiD(vcov_type="hc2_bm")
+        params = est.get_params()
+        assert params["vcov_type"] == "hc2_bm"
+        # Round-trip via get_params → __init__
+        est2 = WooldridgeDiD(**params)
+        assert est2.vcov_type == "hc2_bm"
+
+    def test_set_params_revalidates_vcov_type(self):
+        est = WooldridgeDiD()
+        with pytest.raises(ValueError, match="hc4"):
+            est.set_params(vcov_type="hc4")
+
+    def test_set_params_catches_method_vcov_interaction(self):
+        est = WooldridgeDiD(method="ols", vcov_type="hc1")
+        with pytest.raises(NotImplementedError):
+            est.set_params(method="logit", vcov_type="hc2_bm")
+
+    def test_set_params_updates_vcov_type_explicit_flag(self):
+        est = WooldridgeDiD(vcov_type="hc1")
+        assert est._vcov_type_explicit is False
+        est.set_params(vcov_type="hc2_bm")
+        assert est._vcov_type_explicit is True
+        est.set_params(vcov_type="hc1")
+        assert est._vcov_type_explicit is False
+
+    def test_results_carries_vcov_type(self):
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        assert res.vcov_type == "hc2_bm"
+
+    def test_results_carries_cluster_name_for_clustered_fit(self):
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="hc1").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        assert res.cluster_name == "unit"
+        assert res.n_clusters is not None
+        assert res.n_clusters > 0
+
+    def test_explicit_user_cluster_preserved_under_hc1(self):
+        df = _make_vcov_panel()
+        # Synthetic state column with 4 levels — 10 units per state on the
+        # 40-unit panel
+        df["state"] = (df["unit"] // 10).astype(int)
+        res = WooldridgeDiD(method="ols", vcov_type="hc1", cluster="state").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        assert res.cluster_name == "state"
+        assert res.n_clusters == 4
+
+    def test_fit_clone_idempotent_on_vcov_type(self):
+        """fit, clone via get_params, refit — SE must be bit-equal."""
+        df = _make_vcov_panel()
+        est = WooldridgeDiD(method="ols", vcov_type="hc2_bm")
+        res1 = est.fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
+        est2 = WooldridgeDiD(**est.get_params())
+        res2 = est2.fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
+        assert res1.overall_se == pytest.approx(res2.overall_se, abs=1e-14)
+        assert res1.overall_att == pytest.approx(res2.overall_att, abs=1e-14)
+
+    def test_bm_dof_nan_fails_closed(self, monkeypatch):
+        """When ``_compute_cr2_bm_contrast_dof`` returns NaN, the overall ATT
+        inference fields (t_stat / p_value / conf_int) MUST be NaN — do NOT
+        fall back to ``safe_inference(df=None)`` which silently uses
+        normal-theory. Per ``feedback_bm_contrast_dof_fail_closed``.
+        """
+        df = _make_vcov_panel()
+        import diff_diff.linalg as linalg_mod
+
+        def _fake_dof(X, cluster_ids, bread, contrasts):
+            return np.full(contrasts.shape[1], np.nan)
+
+        monkeypatch.setattr(linalg_mod, "_compute_cr2_bm_contrast_dof", _fake_dof)
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        # att and se preserved (sandwich is finite); inference fields NaN
+        assert np.isfinite(res.overall_att)
+        assert np.isfinite(res.overall_se)
+        assert np.isnan(res.overall_t_stat)
+        assert np.isnan(res.overall_p_value)
+        assert np.isnan(res.overall_conf_int[0])
+        assert np.isnan(res.overall_conf_int[1])

From ce18fca9a3f582bc600e9d089f993df3f8e146dc Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 15:19:25 -0400
Subject: [PATCH 02/11] wooldridge: thread BM Satterthwaite DOF across all
 hc2_bm inference surfaces

Address codex R1 P1 findings: per-cell group_time_effects and
.aggregate("group"/"calendar"/"event") were using df=None
(normal-theory) on the hc2_bm path. Both surfaces now use BM contrast
DOFs to match the documented clubSandwich parity contract:

- Per-cell df_Satt computed in one batched _compute_cr2_bm_contrast_dof
  call at fit time (one-hot contrasts per (g,t)); matches
  coef_test()$df_Satt at atol=1e-6 (CI inversion).
- Overall ATT DOF computed in the same batched call (post-period
  aggregation contrast).
- BM artifacts (X_full, cluster_ids, bread, coef_index_map) stashed on
  WooldridgeDiDResults so .aggregate("group"/"calendar"/"event")
  recomputes contrast-specific BM DOFs lazily.
- Fail-closed everywhere: NaN inference fields when BM DOF unavailable,
  never silent normal-theory fallback (per
  feedback_bm_contrast_dof_fail_closed).

New tests: per-cell df_Satt R-parity (1), aggregate(group/event/calendar)
BM DOF threading (3), aggregate fail-closed on helper failure (1),
extended existing fail-closed test to cover per-cell NaN propagation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                         |   2 +-
 diff_diff/wooldridge.py              | 147 +++++++++++++++++++--------
 diff_diff/wooldridge_results.py      | 139 ++++++++++++++++++++++---
 docs/methodology/REGISTRY.md         |   2 +-
 tests/test_methodology_wooldridge.py |  25 +++++
 tests/test_wooldridge.py             |  96 ++++++++++++++++-
 6 files changed, 348 insertions(+), 63 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 852f55ad..cfed85c2 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
-- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. **Bell-McCaffrey Satterthwaite DOF is threaded into the overall ATT inference for hc2_bm** via `_compute_cr2_bm_contrast_dof` on the post-period-aggregation contrast (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + explicit one-way (`hc2`/`classical`) + `cluster=None` raises at `fit()` (multiplier bootstrap is intrinsically clustered). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
+- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. **Bell-McCaffrey Satterthwaite DOF is threaded across ALL hc2_bm user-facing inference surfaces**: (1) per-cell `group_time_effects[(g, t)]` use `coef_test()$df_Satt` (matches R at atol=1e-6 from CI inversion); (2) overall ATT uses the post-period-aggregation contrast DOF from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); (3) `.aggregate("group" | "calendar" | "event")` recomputes contrast-specific BM DOFs lazily from BM artifacts (`X_full`, `cluster_ids`, bread matrix, coef-index map) stored on the Results object. Fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + explicit one-way (`hc2`/`classical`) + `cluster=None` raises at `fit()` (multiplier bootstrap is intrinsically clustered). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
 - **ChaisemartinDHaultfoeuille (DCDH) methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473) and ContinuousDiD precedent (PR #476). REGISTRY `## ChaisemartinDHaultfoeuille` gains a formal `### Deviations from the paper / from R / library extensions` block consolidating 7 documented deviations into a single AI-review-recognized labeled surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)"): (D1) equal-cell weighting (deviation from BOTH AER 2020 Equation 3 AND R `DIDmultiplegtDYN`); (D2) period-based vs cohort-based stable controls; (D3) balanced-baseline panel + interior-gap drops + terminal-missingness retention + cell-period-allocator targeted `ValueError`; (D4) SE normalization `N_l` vs R `G` (~4% smaller analytical SE); (D5) singleton-cohort degeneracy → NaN with `UserWarning`; (D6) `<50%` switcher warning at far horizons (library extension citing Favara-Imbs application, footnote 14 of NBER WP 29873); (D7) Phase 3 `DID^X` covariate first-stage equal-cell weights. R cross-language coverage holds at documented tolerance bands in `tests/test_chaisemartin_dhaultfoeuille_parity.py` (`POINT_RTOL = 1e-4` on pure-direction point estimates, `MIXED_POINT_RTOL = 0.025` on mixed-direction, `PURE_DIRECTION_SE_RTOL = 0.05` on pure-direction SE, `SE_RTOL = 0.10` on multi-horizon SE, `se_rtol=0.15` on the long-panel `L_max=5` joiners-only scenario where cell-count-weighting compounds). No source code changes, no new tests, no new docstrings — consolidation only against the existing 12 methodology tests (`tests/test_methodology_chaisemartin_dhaultfoeuille.py`), 26 R-parity tests (`tests/test_chaisemartin_dhaultfoeuille_parity.py`), 352 unit tests (`tests/test_chaisemartin_dhaultfoeuille.py`), survey suites (`tests/test_survey_dcdh.py`, `tests/test_survey_dcdh_replicate_psu.py`, three cell-period coverage suites), and two primary-source DCDH paper reviews on disk (2020 AER + 2022/2023 NBER WP 29873 via PR #478; the `dechaisemartin-2026-review.md` on disk is HAD's primary source, not DCDH's, and is referenced as adjacent context only). The REGISTRY Deviations block uses semantic section-name anchors (rather than fragile line numbers) for back-references to other parts of the DCDH section — an intentional divergence from the PR #476 ContinuousDiD precedent reflecting PR-A wording-drift CI feedback that flagged line-number cross-references as drift-prone in long sections. `METHODOLOGY_REVIEW.md` DCDH row promoted **In Progress** → **Complete**; L27 In Progress example paragraph re-pointed to WooldridgeDiD; L1289 priority-order queue item #6 (DCDH) removed and items #7-#11 renumbered to #6-#10.
 
 ## [3.4.1] - 2026-05-21
diff --git a/diff_diff/wooldridge.py b/diff_diff/wooldridge.py
index 31aa1dca..314a2a01 100644
--- a/diff_diff/wooldridge.py
+++ b/diff_diff/wooldridge.py
@@ -43,7 +43,7 @@ def _compute_weighted_agg(
     gt_keys: List,
     gt_vcov: Optional[np.ndarray],
     alpha: float,
-    df: Optional[int] = None,
+    df: Optional[float] = None,
 ) -> Dict:
     """Compute simple (overall) weighted average ATT and SE via delta method."""
     post_keys = [(g, t) for (g, t) in gt_keys if t >= g]
@@ -920,8 +920,14 @@ def _fit_ols(
         # ``0..n_int-1``. The shift only affects this loop's indexing into
         # ``coefs`` / ``vcov`` — the (g, t) key space and the order of
         # ``gt_keys`` are identical across branches.
+        #
+        # First pass: collect non-dropped (g, t) cells + att + se. Per-cell
+        # df_Satt is computed in a single batched call below (BM DOF section)
+        # so per-cell inference fields use the Satterthwaite DOF rather than
+        # df_inf=None (normal-theory).
         gt_effects: Dict[Tuple, Dict] = {}
         gt_weights: Dict[Tuple, int] = {}
+        gt_coef_index_map: Dict[Tuple, int] = {}  # (g, t) -> full-coef-space index
         for idx, (g, t) in enumerate(gt_keys):
             coef_idx = idx + coef_offset
             if coef_idx >= len(coefs):
@@ -935,15 +941,15 @@ def _fit_ols(
                 if vcov is not None
                 else float("nan")
             )
-            t_stat, p_value, conf_int = safe_inference(att, se, alpha=self.alpha, df=df_inf)
             gt_effects[(g, t)] = {
                 "att": att,
                 "se": se,
-                "t_stat": t_stat,
-                "p_value": p_value,
-                "conf_int": conf_int,
+                "t_stat": float("nan"),
+                "p_value": float("nan"),
+                "conf_int": (float("nan"), float("nan")),
             }
             gt_weights[(g, t)] = int(((sample[cohort] == g) & (sample[time] == t)).sum())
+            gt_coef_index_map[(g, t)] = coef_idx
 
         # Extract vcov submatrix for identified β_{g,t} only (skip NaN/dropped).
         # Shift by coef_offset so the submatrix lands on treatment cells
@@ -955,15 +961,26 @@ def _fit_ols(
         else:
             gt_vcov = None
 
-        # 8. Bell-McCaffrey contrast DOF threading for the overall ATT under
-        # ``vcov_type="hc2_bm"``. Per ``feedback_bm_contrast_dof_fail_closed``,
-        # when the BM DOF is unavailable (helper raises or returns non-finite)
-        # the user-facing aggregated inference must emit ALL-NaN
-        # (t_stat/p_value/conf_int) rather than fall back to
-        # ``safe_inference(df=None)`` which silently uses normal-theory.
+        # 8. Bell-McCaffrey contrast DOF threading for hc2_bm. Computes (in
+        # one batched call) the per-coefficient ``df_Satt`` for every
+        # present ``(g, t)`` cell AND the post-period-average overall ATT
+        # contrast DOF. Per-cell DOFs are applied to ``gt_effects`` inference
+        # below (8a); the overall DOF is applied to the simple aggregation
+        # (8b). BM artifacts (X, cluster_ids, bread, coef_index_map) are
+        # also stashed on the Results object so that downstream
+        # ``aggregate("group" | "calendar" | "event")`` can compute contrast-
+        # specific DOFs lazily without recomputing the full-dummy fit.
         # Mirrors the SunAbraham PR #472 pattern at
         # ``sun_abraham.py:1008-1097`` and the StackedDiD PR #479 R3 fix.
+        # Per ``feedback_bm_contrast_dof_fail_closed``: when DOF is
+        # unavailable (helper raises or returns non-finite), affected
+        # user-facing inference is NaN rather than falling back to
+        # ``safe_inference(df=None)`` (silent normal-theory).
         overall_att_bm_dof: Optional[float] = None
+        per_cell_bm_dof: Dict[Tuple, float] = {}
+        bm_artifacts: Optional[
+            Tuple[np.ndarray, np.ndarray, np.ndarray, Dict[Tuple, int]]
+        ] = None
         if (
             self.vcov_type == "hc2_bm"
             and use_full_dummy
@@ -975,44 +992,88 @@ def _fit_ols(
             from diff_diff.linalg import _compute_cr2_bm_contrast_dof
 
             n_coefs = X.shape[1]
-            # Build the overall-ATT post-period-average contrast in full-coef
-            # space. Post-period (g, t) cells have ``t >= g``; weights are
-            # ``gt_weights[k] / w_total_post`` where ``w_total_post`` is the
-            # sum of cell weights across post-period (g, t) keys present in
-            # ``gt_effects``. Non-post cells get zero weight; non-treatment
-            # columns (intercept, FE dummies) get zero weight.
+            bread_matrix = X.T @ X
+            # Per-cell one-hot contrasts (one column per present (g, t) cell).
+            per_cell_keys = list(gt_keys_ordered)
+            # Overall ATT post-period-average contrast (matches the default
+            # ``_compute_weighted_agg`` weights ``n_{g,t}``).
             post_keys = [(g, t) for (g, t) in gt_keys_ordered if t >= g]
             w_total_post = sum(gt_weights.get(k, 0) for k in post_keys)
+            overall_contrast = np.zeros(n_coefs)
             if w_total_post > 0:
-                contrast_vec = np.zeros(n_coefs)
-                for i, k in enumerate(gt_keys):
-                    if k in gt_effects and k in post_keys:
-                        contrast_vec[i + coef_offset] = gt_weights[k] / w_total_post
-                if np.any(contrast_vec != 0):
-                    bread_matrix = X.T @ X
-                    try:
-                        dof_vec = _compute_cr2_bm_contrast_dof(
-                            X,
-                            cluster_ids,
-                            bread_matrix,
-                            contrast_vec.reshape(-1, 1),
+                for k in post_keys:
+                    overall_contrast[gt_coef_index_map[k]] = (
+                        gt_weights[k] / w_total_post
+                    )
+            include_overall = w_total_post > 0 and bool(np.any(overall_contrast != 0))
+            cols: List[np.ndarray] = []
+            for k in per_cell_keys:
+                col = np.zeros(n_coefs)
+                col[gt_coef_index_map[k]] = 1.0
+                cols.append(col)
+            if include_overall:
+                cols.append(overall_contrast)
+            if cols:
+                contrasts_matrix = np.column_stack(cols)
+                try:
+                    dof_vec = _compute_cr2_bm_contrast_dof(
+                        X, cluster_ids, bread_matrix, contrasts_matrix
+                    )
+                    for i, k in enumerate(per_cell_keys):
+                        candidate = float(dof_vec[i])
+                        per_cell_bm_dof[k] = (
+                            candidate if np.isfinite(candidate) else float("nan")
                         )
-                        candidate = float(dof_vec[0])
-                        overall_att_bm_dof = candidate if np.isfinite(candidate) else float("nan")
-                    except (ValueError, np.linalg.LinAlgError) as exc:
-                        warnings.warn(
-                            f"WooldridgeDiD(vcov_type='hc2_bm') aggregated "
-                            f"inference could not compute Bell-McCaffrey "
-                            f"contrast DOF ({type(exc).__name__}: {exc}). "
-                            "Overall ATT inference (t_stat / p_value / "
-                            "conf_int) will be NaN to preserve the hc2_bm "
-                            "contract.",
-                            UserWarning,
-                            stacklevel=3,
+                    if include_overall:
+                        candidate = float(dof_vec[-1])
+                        overall_att_bm_dof = (
+                            candidate if np.isfinite(candidate) else float("nan")
                         )
+                except (ValueError, np.linalg.LinAlgError) as exc:
+                    warnings.warn(
+                        f"WooldridgeDiD(vcov_type='hc2_bm') analytical "
+                        f"inference could not compute Bell-McCaffrey "
+                        f"contrast DOF ({type(exc).__name__}: {exc}). "
+                        "Affected per-cell and overall inference (t_stat / "
+                        "p_value / conf_int) will be NaN to preserve the "
+                        "hc2_bm contract.",
+                        UserWarning,
+                        stacklevel=3,
+                    )
+                    for k in per_cell_keys:
+                        per_cell_bm_dof[k] = float("nan")
+                    if include_overall:
                         overall_att_bm_dof = float("nan")
+            # Stash artifacts for ``aggregate()`` regardless of whether the
+            # batched DOF call succeeded — the dataclass-side helper will
+            # retry contrast-specific DOFs lazily and fail-closed on its
+            # own errors.
+            bm_artifacts = (X, cluster_ids, bread_matrix, dict(gt_coef_index_map))
+
+        # 8a. Apply per-cell BM DOFs (or fail-closed NaN) to ``gt_effects``
+        # for hc2_bm; otherwise use the shared ``df_inf`` (survey df or None).
+        # Per ``feedback_bm_contrast_dof_fail_closed``: when per-cell DOF
+        # is NaN, the cell's inference fields are NaN.
+        for (g, t), eff in gt_effects.items():
+            if self.vcov_type == "hc2_bm" and use_full_dummy and resolved is None:
+                cell_dof = per_cell_bm_dof.get((g, t), float("nan"))
+                if np.isfinite(cell_dof):
+                    t_stat, p_value, conf_int = safe_inference(
+                        eff["att"], eff["se"], alpha=self.alpha, df=cell_dof
+                    )
+                else:
+                    t_stat = float("nan")
+                    p_value = float("nan")
+                    conf_int = (float("nan"), float("nan"))
+            else:
+                t_stat, p_value, conf_int = safe_inference(
+                    eff["att"], eff["se"], alpha=self.alpha, df=df_inf
+                )
+            eff["t_stat"] = t_stat
+            eff["p_value"] = p_value
+            eff["conf_int"] = conf_int
 
-        # 8a. Simple aggregation (always computed). Use BM contrast DOF for
+        # 8b. Simple aggregation (always computed). Use BM contrast DOF for
         # the overall ATT inference when ``vcov_type='hc2_bm'``; otherwise
         # fall back to the shared df (survey df or None). Fail-closed: when
         # BM DOF is NaN, the analytical sandwich inference fields are NaN
@@ -1080,6 +1141,8 @@ def _fit_ols(
             _gt_vcov=gt_vcov,
             _gt_keys=gt_keys_ordered,
             _df_survey=df_inf,
+            _bm_per_cell_dof=per_cell_bm_dof,
+            _bm_artifacts=bm_artifacts,
         )
 
         # 9. Optional multiplier bootstrap (overrides analytic SE for overall ATT).
diff --git a/diff_diff/wooldridge_results.py b/diff_diff/wooldridge_results.py
index fb3e5f98..bd7c9e45 100644
--- a/diff_diff/wooldridge_results.py
+++ b/diff_diff/wooldridge_results.py
@@ -2,6 +2,7 @@
 
 from __future__ import annotations
 
+import warnings
 from dataclasses import dataclass, field
 from typing import Any, Dict, List, Optional, Tuple
 
@@ -81,6 +82,14 @@ class WooldridgeDiDResults:
     """Ordered list of (g,t) keys corresponding to _gt_vcov columns."""
     _df_survey: Optional[int] = field(default=None, repr=False)
     """Survey degrees of freedom for t-distribution inference."""
+    _bm_per_cell_dof: Dict[Tuple[Any, Any], float] = field(default_factory=dict, repr=False)
+    """Per-cell Bell-McCaffrey Satterthwaite DOF (only populated for vcov_type='hc2_bm').
+    Used by group_time_effects[(g, t)] inference fields at fit time."""
+    _bm_artifacts: Optional[Tuple[np.ndarray, np.ndarray, np.ndarray, Dict[Tuple[Any, Any], int]]] = field(
+        default=None, repr=False
+    )
+    """(X_full, cluster_ids, bread_matrix, gt_coef_index_map) for hc2_bm; enables
+    lazy BM contrast-DOF computation in aggregate()."""
 
     # ------------------------------------------------------------------ #
     # Public methods                                                      #
@@ -94,6 +103,15 @@ def aggregate(self, type: str) -> "WooldridgeDiDResults":  # noqa: A002
         type : "simple" | "group" | "calendar" | "event"
 
         Returns self for chaining.
+
+        Notes
+        -----
+        When ``vcov_type == "hc2_bm"``, aggregated inference (t_stat / p_value /
+        conf_int) uses Bell-McCaffrey Satterthwaite contrast-specific DOFs
+        rather than the survey/None default. The BM DOFs are computed lazily
+        from ``_bm_artifacts`` via ``_compute_cr2_bm_contrast_dof`` and
+        fail-closed (NaN inference) when the helper raises or returns NaN —
+        per ``feedback_bm_contrast_dof_fail_closed``.
         """
         valid = ("simple", "group", "calendar", "event")
         if type not in valid:
@@ -110,10 +128,91 @@ def _agg_se(w_vec: np.ndarray) -> float:
                 return float("nan")
             return float(np.sqrt(max(w_vec @ vcov @ w_vec, 0.0)))
 
-        def _build_effect(att: float, se: float) -> Dict[str, Any]:
-            t_stat, p_value, conf_int = safe_inference(
-                att, se, alpha=self.alpha, df=self._df_survey
-            )
+        # Compute BM contrast DOFs lazily for hc2_bm. ``cells_by_key`` is an
+        # ordered mapping of aggregation_key -> list of (g, t) cells; the
+        # contrast for each key sums the per-cell one-hot vectors weighted
+        # by ``weights[(g, t)] / w_total``. Returns a dict mapping
+        # aggregation_key -> df (or NaN on fail-closed). For non-hc2_bm,
+        # returns an empty dict (caller falls back to ``self._df_survey``).
+        def _bm_contrast_dofs_for(
+            cells_by_key: Dict[Any, List[Tuple[Any, Any]]],
+        ) -> Dict[Any, float]:
+            if self.vcov_type != "hc2_bm" or self._bm_artifacts is None:
+                return {}
+            X_full, cluster_ids_full, bread_matrix, coef_idx_map = self._bm_artifacts
+            n_total = X_full.shape[1]
+            contrast_cols: List[np.ndarray] = []
+            agg_keys: List[Any] = []
+            for agg_key, cells in cells_by_key.items():
+                if not cells:
+                    continue
+                w_total = sum(weights.get(c, 0) for c in cells)
+                if w_total == 0:
+                    continue
+                col = np.zeros(n_total)
+                contributed = False
+                for c in cells:
+                    if c not in coef_idx_map:
+                        continue
+                    col[coef_idx_map[c]] = weights.get(c, 0) / w_total
+                    contributed = True
+                if not contributed:
+                    continue
+                contrast_cols.append(col)
+                agg_keys.append(agg_key)
+            if not contrast_cols:
+                return {k: float("nan") for k in cells_by_key}
+            from diff_diff.linalg import _compute_cr2_bm_contrast_dof
+
+            contrasts_matrix = np.column_stack(contrast_cols)
+            dof_map: Dict[Any, float] = {}
+            try:
+                dof_vec = _compute_cr2_bm_contrast_dof(
+                    X_full, cluster_ids_full, bread_matrix, contrasts_matrix
+                )
+                for i, k in enumerate(agg_keys):
+                    candidate = float(dof_vec[i])
+                    dof_map[k] = candidate if np.isfinite(candidate) else float("nan")
+            except (ValueError, np.linalg.LinAlgError) as exc:
+                warnings.warn(
+                    f"WooldridgeDiDResults.aggregate({type!r}) could not "
+                    f"compute Bell-McCaffrey contrast DOF "
+                    f"({exc.__class__.__name__}: {exc}). "
+                    "Affected aggregated inference (t_stat / p_value / "
+                    "conf_int) will be NaN to preserve the hc2_bm contract.",
+                    UserWarning,
+                    stacklevel=3,
+                )
+                for k in agg_keys:
+                    dof_map[k] = float("nan")
+            # Fill non-computed keys with NaN to fail-closed.
+            for k in cells_by_key:
+                dof_map.setdefault(k, float("nan"))
+            return dof_map
+
+        def _build_effect(att: float, se: float, df_for_inference: Optional[float]) -> Dict[str, Any]:
+            """Build an effect dict using ``df_for_inference`` for the t-distribution.
+
+            When ``self.vcov_type == "hc2_bm"``, ``df_for_inference`` should be
+            the BM contrast DOF (NaN → fail-closed). Otherwise it falls back
+            to ``self._df_survey`` (None → normal-theory).
+            """
+            if self.vcov_type == "hc2_bm":
+                if df_for_inference is None or not np.isfinite(df_for_inference):
+                    return {
+                        "att": att,
+                        "se": se,
+                        "t_stat": float("nan"),
+                        "p_value": float("nan"),
+                        "conf_int": (float("nan"), float("nan")),
+                    }
+                t_stat, p_value, conf_int = safe_inference(
+                    att, se, alpha=self.alpha, df=df_for_inference
+                )
+            else:
+                t_stat, p_value, conf_int = safe_inference(
+                    att, se, alpha=self.alpha, df=self._df_survey
+                )
             return {
                 "att": att,
                 "se": se,
@@ -128,27 +227,36 @@ def _build_effect(att: float, se: float) -> Dict[str, Any]:
             pass
 
         elif type == "group":
-            result: Dict[Any, Dict] = {}
+            cells_by_g: Dict[Any, List[Tuple[Any, Any]]] = {}
             for g in self.groups:
-                cells = [(g2, t) for (g2, t) in keys_ordered if g2 == g and t >= g]
+                cells_by_g[g] = [
+                    (g2, t) for (g2, t) in keys_ordered if g2 == g and t >= g
+                ]
+            dofs = _bm_contrast_dofs_for(cells_by_g)
+            result: Dict[Any, Dict] = {}
+            for g, cells in cells_by_g.items():
                 if not cells:
                     continue
                 w_total = sum(weights.get(c, 0) for c in cells)
                 if w_total == 0:
                     continue
                 att = sum(weights.get(c, 0) * gt[c]["att"] for c in cells) / w_total
-                # delta-method weights vector over all keys_ordered
                 w_vec = np.array(
                     [weights.get(c, 0) / w_total if c in cells else 0.0 for c in keys_ordered]
                 )
                 se = _agg_se(w_vec)
-                result[g] = _build_effect(att, se)
+                result[g] = _build_effect(att, se, dofs.get(g))
             self.group_effects = result
 
         elif type == "calendar":
-            result = {}
+            cells_by_t: Dict[Any, List[Tuple[Any, Any]]] = {}
             for t in self.time_periods:
-                cells = [(g, t2) for (g, t2) in keys_ordered if t2 == t and t >= g]
+                cells_by_t[t] = [
+                    (g, t2) for (g, t2) in keys_ordered if t2 == t and t >= g
+                ]
+            dofs = _bm_contrast_dofs_for(cells_by_t)
+            result = {}
+            for t, cells in cells_by_t.items():
                 if not cells:
                     continue
                 w_total = sum(weights.get(c, 0) for c in cells)
@@ -159,14 +267,17 @@ def _build_effect(att: float, se: float) -> Dict[str, Any]:
                     [weights.get(c, 0) / w_total if c in cells else 0.0 for c in keys_ordered]
                 )
                 se = _agg_se(w_vec)
-                result[t] = _build_effect(att, se)
+                result[t] = _build_effect(att, se, dofs.get(t))
             self.calendar_effects = result
 
         elif type == "event":
             all_k = sorted({t - g for (g, t) in keys_ordered})
-            result = {}
+            cells_by_k: Dict[int, List[Tuple[Any, Any]]] = {}
             for k in all_k:
-                cells = [(g, t) for (g, t) in keys_ordered if t - g == k]
+                cells_by_k[k] = [(g, t) for (g, t) in keys_ordered if t - g == k]
+            dofs = _bm_contrast_dofs_for(cells_by_k)
+            result = {}
+            for k, cells in cells_by_k.items():
                 if not cells:
                     continue
                 w_total = sum(weights.get(c, 0) for c in cells)
@@ -177,7 +288,7 @@ def _build_effect(att: float, se: float) -> Dict[str, Any]:
                     [weights.get(c, 0) / w_total if c in cells else 0.0 for c in keys_ordered]
                 )
                 se = _agg_se(w_vec)
-                result[k] = _build_effect(att, se)
+                result[k] = _build_effect(att, se, dofs.get(k))
             self.event_study_effects = result
 
         return self
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 2bc727d8..aa99435e 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1488,7 +1488,7 @@ where `g(·)` is the link inverse (logistic or exp), `η_i` is the individual li
 
 *Variance families (`vcov_type`, OLS path only):*
 - `hc1` (default) — CR1 Liang-Zeger cluster-robust on the within-transformed design. Bit-equal to prior behavior (FWL preserves the score). The natural R anchor is `fixest::feols(y ~ <interactions> | unit + time, cluster=~unit)` or Stata `jwdid` (both within-transform). **Deviation from R `lm + clubSandwich::vcovCR(type="CR1S")`:** the full-dummy `lm` SE differs by a factor of `sqrt((n - k_within) / (n - k_total))` because clubSandwich's `(n-1)/(n-p)` finite-sample correction counts ALL columns (intercept + treatment + unit dummies + time dummies = `k_total`) while WooldridgeDiD's `solve_ols` on the within-transformed design counts only the treatment-cell columns (`k_within`). On the 240-obs / 51-column R-parity fixture this is ~11% ; on typical larger panels (n >> k_total) the gap shrinks to <2%. The user can recover the `lm + CR1S` SE by passing `vcov_type="hc2_bm"` (full-dummy auto-route) or by manually constructing a full-dummy `solve_ols` call. Same deviation pattern as SunAbraham PR #472 (`fixest::sunab` vs `lm + clubSandwich`).
-- `hc2_bm` — CR2 Bell-McCaffrey via auto-route to full-dummy design (`[intercept, X_design, unit_dummies, time_dummies]`), then `solve_ols(..., vcov_type="hc2_bm")` through the clubSandwich port (PR #475). FWL does NOT preserve the hat matrix; HC2 leverage + BM DOF require the full-projection design. Matches `clubSandwich::vcovCR(lm(...), cluster=~unit, type="CR2") + coef_test()$df_Satt` at atol=1e-10 (pinned in `tests/test_methodology_wooldridge.py`). The overall ATT's BM contrast DOF uses `_compute_cr2_bm_contrast_dof` on the post-period-aggregation contrast (matches `Wald_test(test="HTZ")$df_denom`).
+- `hc2_bm` — CR2 Bell-McCaffrey via auto-route to full-dummy design (`[intercept, X_design, unit_dummies, time_dummies]`), then `solve_ols(..., vcov_type="hc2_bm")` through the clubSandwich port (PR #475). FWL does NOT preserve the hat matrix; HC2 leverage + BM DOF require the full-projection design. Per-coefficient SE matches `clubSandwich::vcovCR(lm(...), cluster=~unit, type="CR2")` at atol=1e-10. Per-cell `(g, t)` inference fields use `coef_test()$df_Satt` Bell-McCaffrey DOF (pinned at atol=1e-6 from CI half-width inversion). Aggregated inference (overall ATT + `.aggregate("group" | "calendar" | "event")`) uses contrast-specific BM DOFs from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(constraints=matrix(w, 1), vcov=vcov_CR2, test="HTZ")$df_denom`); the overall ATT contrast DOF is computed at fit time, the other three aggregations lazily on each `.aggregate(...)` call from BM artifacts (`X_full`, `cluster_ids`, bread matrix, coef-index map) stored on the Results object. Fail-closed across all surfaces: when BM DOF is unavailable (helper raises or returns non-finite), the affected inference fields are NaN — not normal-theory fallback (per `feedback_bm_contrast_dof_fail_closed`).
 - `classical`, `hc2` — supported via auto-route to full-dummy AND auto-drop of the unit auto-cluster (one-way families don't compose with `cluster_ids` per the linalg validator). Set `self.cluster=None` (default) for these; explicit `cluster="state"` + one-way family raises at the linalg validator. Matches `summary(lm(...))$coefficients` (classical) and `sandwich::vcovHC(type="HC2")` respectively.
 - `conley` — REJECTED at `__init__` (deferral; would require threading `conley_*` params through `solve_ols`; tracked in TODO.md).
 - `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` — REJECTED at `__init__`. GLM QMLE sandwich with HC2 leverage on canonical-link pseudo-residuals (`w = p(1-p)` for logit, `w = μ_i` for Poisson) needs CR2-BM-on-GLM derivation + R parity against `clubSandwich::vcovCR(glm(...))`. Tracked in TODO.md (WooldridgeDiD logit/poisson follow-up row).
diff --git a/tests/test_methodology_wooldridge.py b/tests/test_methodology_wooldridge.py
index b7007331..34ad7263 100644
--- a/tests/test_methodology_wooldridge.py
+++ b/tests/test_methodology_wooldridge.py
@@ -104,6 +104,31 @@ def test_hc2_bm_per_coef_se_matches_clubsandwich_cr2(
                 r_ses[i], abs=1e-10
             ), f"(g={g}, t={t}): Py SE={py_se:.10f} R SE={r_ses[i]:.10f}"
 
+    def test_hc2_bm_per_coef_df_satt_matches_coef_test(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """Per-treatment-cell Bell-McCaffrey Satterthwaite DOF matches R
+        ``clubSandwich::coef_test()$df_Satt`` at atol=1e-6.
+
+        Recovered from the Python CI half-width via t-distribution inversion
+        (the dataclass doesn't expose per-cell DOF directly). The underlying
+        BM DOF computation matches R at machine precision (~6e-16 on per-coef
+        SE); brentq inversion adds the only material tolerance.
+        """
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        r_keys = [(d["g"], d["t"]) for d in golden["point_estimates"]["gt_keys"]]
+        r_dfs = golden["hc2_bm"]["per_coef_df_satt"]
+        for i, (g, t) in enumerate(r_keys):
+            eff = res.group_time_effects[(g, t)]
+            py_df = _recover_dof_from_ci(
+                eff["att"], eff["se"], eff["conf_int"][1], res.alpha
+            )
+            assert py_df == pytest.approx(r_dfs[i], abs=1e-6), (
+                f"(g={g}, t={t}): Py df={py_df:.4f} R df={r_dfs[i]:.4f}"
+            )
+
     def test_hc2_bm_overall_att_se_matches_clubsandwich_cr2(
         self, golden: dict, panel: pd.DataFrame
     ) -> None:
diff --git a/tests/test_wooldridge.py b/tests/test_wooldridge.py
index 3f609094..2d9c0e8d 100644
--- a/tests/test_wooldridge.py
+++ b/tests/test_wooldridge.py
@@ -1871,10 +1871,10 @@ def test_fit_clone_idempotent_on_vcov_type(self):
         assert res1.overall_att == pytest.approx(res2.overall_att, abs=1e-14)
 
     def test_bm_dof_nan_fails_closed(self, monkeypatch):
-        """When ``_compute_cr2_bm_contrast_dof`` returns NaN, the overall ATT
-        inference fields (t_stat / p_value / conf_int) MUST be NaN — do NOT
-        fall back to ``safe_inference(df=None)`` which silently uses
-        normal-theory. Per ``feedback_bm_contrast_dof_fail_closed``.
+        """When ``_compute_cr2_bm_contrast_dof`` returns NaN, BOTH per-cell
+        AND overall ATT inference fields (t_stat / p_value / conf_int) MUST
+        be NaN — do NOT fall back to ``safe_inference(df=None)`` which
+        silently uses normal-theory. Per ``feedback_bm_contrast_dof_fail_closed``.
         """
         df = _make_vcov_panel()
         import diff_diff.linalg as linalg_mod
@@ -1886,10 +1886,96 @@ def _fake_dof(X, cluster_ids, bread, contrasts):
         res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
             df, outcome="y", unit="unit", time="time", cohort="cohort"
         )
-        # att and se preserved (sandwich is finite); inference fields NaN
+        # Overall: att + se preserved (sandwich is finite); inference NaN
         assert np.isfinite(res.overall_att)
         assert np.isfinite(res.overall_se)
         assert np.isnan(res.overall_t_stat)
         assert np.isnan(res.overall_p_value)
         assert np.isnan(res.overall_conf_int[0])
         assert np.isnan(res.overall_conf_int[1])
+        # Per-cell: same pattern (att + se preserved, inference NaN)
+        for (g, t), eff in res.group_time_effects.items():
+            assert np.isfinite(eff["att"]), f"cell ({g},{t}) att should be finite"
+            assert np.isfinite(eff["se"]), f"cell ({g},{t}) se should be finite"
+            assert np.isnan(eff["t_stat"]), f"cell ({g},{t}) t_stat should be NaN"
+            assert np.isnan(eff["p_value"]), f"cell ({g},{t}) p_value should be NaN"
+            assert np.isnan(eff["conf_int"][0]), f"cell ({g},{t}) conf_int[0] should be NaN"
+            assert np.isnan(eff["conf_int"][1]), f"cell ({g},{t}) conf_int[1] should be NaN"
+
+    def test_aggregate_group_under_hc2_bm_uses_bm_contrast_dof(self):
+        """aggregate('group') under hc2_bm produces finite p-values using
+        Bell-McCaffrey contrast DOFs; reverts to NaN under monkeypatch-
+        induced fail-closed."""
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        res.aggregate("group")
+        assert res.group_effects is not None
+        for g, eff in res.group_effects.items():
+            assert np.isfinite(eff["att"])
+            assert np.isfinite(eff["se"])
+            assert np.isfinite(eff["t_stat"]), f"group {g} t_stat NaN — BM DOF threading regressed"
+            assert np.isfinite(eff["p_value"])
+            assert np.isfinite(eff["conf_int"][0])
+            assert np.isfinite(eff["conf_int"][1])
+
+    def test_aggregate_event_under_hc2_bm_uses_bm_contrast_dof(self):
+        """aggregate('event') under hc2_bm produces finite p-values using
+        Bell-McCaffrey contrast DOFs."""
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        res.aggregate("event")
+        assert res.event_study_effects is not None
+        for k, eff in res.event_study_effects.items():
+            assert np.isfinite(eff["att"])
+            assert np.isfinite(eff["se"])
+            assert np.isfinite(eff["t_stat"]), f"event k={k} t_stat NaN — BM DOF threading regressed"
+            assert np.isfinite(eff["p_value"])
+            assert np.isfinite(eff["conf_int"][0])
+            assert np.isfinite(eff["conf_int"][1])
+
+    def test_aggregate_calendar_under_hc2_bm_uses_bm_contrast_dof(self):
+        """aggregate('calendar') under hc2_bm produces finite p-values using
+        Bell-McCaffrey contrast DOFs."""
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        res.aggregate("calendar")
+        assert res.calendar_effects is not None
+        for t, eff in res.calendar_effects.items():
+            assert np.isfinite(eff["att"])
+            assert np.isfinite(eff["se"])
+            assert np.isfinite(eff["t_stat"]), f"calendar t={t} t_stat NaN — BM DOF threading regressed"
+            assert np.isfinite(eff["p_value"])
+            assert np.isfinite(eff["conf_int"][0])
+            assert np.isfinite(eff["conf_int"][1])
+
+    def test_aggregate_under_hc2_bm_fail_closed_on_dof_helper_error(self, monkeypatch):
+        """When _compute_cr2_bm_contrast_dof raises in aggregate(), the
+        affected aggregate inference fields are NaN (fail-closed),
+        att + se preserved."""
+        df = _make_vcov_panel()
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        # Patch the helper AFTER fit so that aggregate() retry fails.
+        import diff_diff.linalg as linalg_mod
+
+        def _raise(X, cluster_ids, bread, contrasts):
+            raise ValueError("induced failure for fail-closed test")
+
+        monkeypatch.setattr(linalg_mod, "_compute_cr2_bm_contrast_dof", _raise)
+        with pytest.warns(UserWarning, match=r"could not compute Bell-McCaffrey"):
+            res.aggregate("group")
+        assert res.group_effects is not None
+        for g, eff in res.group_effects.items():
+            assert np.isfinite(eff["att"])
+            assert np.isfinite(eff["se"])
+            assert np.isnan(eff["t_stat"])
+            assert np.isnan(eff["p_value"])
+            assert np.isnan(eff["conf_int"][0])
+            assert np.isnan(eff["conf_int"][1])

From 21910137fb14a204c1cc6282b7b306eabcd43f11 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 15:31:29 -0400
Subject: [PATCH 03/11] =?UTF-8?q?wooldridge:=20address=20codex=20R2=20P2/P?=
 =?UTF-8?q?3=20=E2=80=94=20bootstrap=20contract,=20metadata,=20R-parity?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address codex R2 findings:

P2 (Code Quality): Bootstrap + one-way analytical vcov_type was rejected
only under cluster=None. Tightened to reject the full Cartesian (cluster
None OR set) — under cluster=X the linalg validator would reject one-way
+ cluster_ids with a less-informative downstream message, so preempt at
the estimator boundary.

P3 (Maintainability): _fit_logit and _fit_poisson now thread vcov_type,
cluster_name, n_clusters into WooldridgeDiDResults (locked to vcov_type=
"hc1" by the __init__ guard on non-OLS + non-hc1, but the metadata
contract is now consistent across all three method paths).

P2 (Doc): REGISTRY hc1 note incorrectly claimed users could recover the
lm + CR1S SE via vcov_type="hc2_bm". hc2_bm is the CR2 Bell-McCaffrey
sandwich, not CR1S. Rewrote to state explicitly that no public
WooldridgeDiD path exposes lm + CR1S parity.

P2 (Tests): aggregate("group"/"calendar"/"event") under hc2_bm previously
only asserted finite p-values, which would have passed under the prior
normal-theory fallback bug. Added 3 R-parity tests that invert the CI
half-width to recover Python's BM contrast DOF and assert atol=1e-6
parity with R clubSandwich::Wald_test(test="HTZ")$df_denom. Extended
benchmarks/R/generate_wooldridge_golden.R to compute the per-key BM DOF
for each aggregation type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 benchmarks/R/generate_wooldridge_golden.R | 72 ++++++++++++++++++++++-
 benchmarks/data/wooldridge_golden.json    | 21 ++++++-
 diff_diff/wooldridge.py                   | 46 ++++++++++-----
 docs/methodology/REGISTRY.md              |  2 +-
 tests/test_methodology_wooldridge.py      | 67 +++++++++++++++++++++
 tests/test_wooldridge.py                  | 20 +++++--
 6 files changed, 206 insertions(+), 22 deletions(-)

diff --git a/benchmarks/R/generate_wooldridge_golden.R b/benchmarks/R/generate_wooldridge_golden.R
index 5115a50d..9b70d3c6 100644
--- a/benchmarks/R/generate_wooldridge_golden.R
+++ b/benchmarks/R/generate_wooldridge_golden.R
@@ -154,6 +154,72 @@ overall_se_hc2 <- sqrt(
   t(overall_contrast) %*% vcov_hc2 %*% overall_contrast
 )[1, 1]
 
+# 5. Aggregate hc2_bm BM contrast DOFs for group / calendar / event
+# aggregations. These mirror WooldridgeDiDResults.aggregate(...) at fit time:
+# each aggregation key gets a 1-row constraint matrix in full-coef space whose
+# entries are the per-cell `n_{g,t} / w_total` weights at the (g, t) coefficient
+# columns. Compute the BM Satterthwaite DOF via Wald_test(test="HTZ"). diff-diff
+# uses lazy contrast-DOF computation in aggregate() with the same algebra;
+# pinning here proves R-parity across all three non-simple aggregation surfaces.
+build_contrast_for_cells <- function(cells, weights_by_pair) {
+  col <- numeric(n_total_coef)
+  if (length(cells) == 0L) return(NULL)
+  w_total <- sum(vapply(cells, function(p) weights_by_pair[[paste(p, collapse = "_")]], numeric(1)))
+  if (w_total == 0) return(NULL)
+  for (p in cells) {
+    key <- paste(p, collapse = "_")
+    cell_w <- weights_by_pair[[key]]
+    # find the lm coef index for D_{g}_{t}
+    nm <- sprintf("D_%d_%d", p[1], p[2])
+    pos <- match(nm, names(coef(fit)))
+    if (!is.na(pos)) {
+      col[pos] <- cell_w / w_total
+    }
+  }
+  col
+}
+weights_by_pair <- setNames(as.list(n_gt), vapply(gt_pairs, function(p) paste(p, collapse = "_"), character(1)))
+
+compute_bm_dof_for_contrast <- function(col) {
+  if (is.null(col)) return(NA_real_)
+  cm <- matrix(col, nrow = 1)
+  wt <- tryCatch(
+    Wald_test(fit, constraints = cm, vcov = vcov_cr2, test = "HTZ"),
+    error = function(e) NULL
+  )
+  if (is.null(wt)) NA_real_ else wt$df_denom
+}
+
+# group: one contrast per treated cohort g, cells = (g, t) for t >= g
+agg_group_dofs <- list()
+agg_group_keys <- treated_cohorts
+for (g in treated_cohorts) {
+  cells <- lapply(gt_pairs, function(p) if (p[1] == g && p[2] >= g) p else NULL)
+  cells <- Filter(Negate(is.null), cells)
+  col <- build_contrast_for_cells(cells, weights_by_pair)
+  agg_group_dofs[[as.character(g)]] <- compute_bm_dof_for_contrast(col)
+}
+
+# calendar: one contrast per time period t, cells = (g, t) for g > 0 and t >= g
+agg_calendar_dofs <- list()
+agg_calendar_keys <- times
+for (t in times) {
+  cells <- lapply(gt_pairs, function(p) if (p[2] == t && p[1] <= t) p else NULL)
+  cells <- Filter(Negate(is.null), cells)
+  col <- build_contrast_for_cells(cells, weights_by_pair)
+  agg_calendar_dofs[[as.character(t)]] <- compute_bm_dof_for_contrast(col)
+}
+
+# event: one contrast per relative period k = t - g
+all_k <- sort(unique(vapply(gt_pairs, function(p) p[2] - p[1], numeric(1))))
+agg_event_dofs <- list()
+for (k in all_k) {
+  cells <- lapply(gt_pairs, function(p) if ((p[2] - p[1]) == k) p else NULL)
+  cells <- Filter(Negate(is.null), cells)
+  col <- build_contrast_for_cells(cells, weights_by_pair)
+  agg_event_dofs[[as.character(k)]] <- compute_bm_dof_for_contrast(col)
+}
+
 # Coefficient point estimates (for cross-check; identical across all 4 variants
 # since they share the lm fit).
 beta_int <- coef(fit)[int_idx]
@@ -184,7 +250,11 @@ golden <- list(
     per_coef_se = unname(se_hc2_bm),
     per_coef_df_satt = unname(df_satt_hc2_bm),
     overall_att_se = overall_se_hc2_bm,
-    overall_att_contrast_dof = overall_att_contrast_dof
+    overall_att_contrast_dof = overall_att_contrast_dof,
+    aggregate_group_dof = agg_group_dofs,
+    aggregate_calendar_dof = agg_calendar_dofs,
+    aggregate_event_dof = agg_event_dofs,
+    aggregate_event_keys = all_k
   ),
   classical = list(
     per_coef_se = unname(se_classical),
diff --git a/benchmarks/data/wooldridge_golden.json b/benchmarks/data/wooldridge_golden.json
index 46f5b328..1b6b491f 100644
--- a/benchmarks/data/wooldridge_golden.json
+++ b/benchmarks/data/wooldridge_golden.json
@@ -74,7 +74,26 @@
     "per_coef_se": [0.055501994847834073, 0.053348051350068385, 0.06354112948601906, 0.058782844441069397, 0.058847181337849344, 0.075290767121827779],
     "per_coef_df_satt": [18.095161160028869, 18.095161160028777, 20.439187947573622, 20.439187947573593, 15.498545101842772, 15.49854510184279],
     "overall_att_se": 0.031917611670264516,
-    "overall_att_contrast_dof": 28.533525200424727
+    "overall_att_contrast_dof": 28.533525200424727,
+    "aggregate_group_dof": {
+      "3": 18.95430345302632,
+      "5": 15.498545101842698
+    },
+    "aggregate_calendar_dof": {
+      "1": "NA",
+      "2": "NA",
+      "3": 18.095161160029459,
+      "4": 18.095161160029345,
+      "5": 36.778155481749216,
+      "6": 36.778155481749209
+    },
+    "aggregate_event_dof": {
+      "0": 28.756457678280594,
+      "1": 28.75645767828054,
+      "2": 20.439187947573629,
+      "3": 20.439187947573586
+    },
+    "aggregate_event_keys": [0, 1, 2, 3]
   },
   "classical": {
     "per_coef_se": [0.067005350609178102, 0.067005350609178199, 0.070375557382814979, 0.070375557382815007, 0.069334111291963985, 0.069334111291964068],
diff --git a/diff_diff/wooldridge.py b/diff_diff/wooldridge.py
index 314a2a01..de49269c 100644
--- a/diff_diff/wooldridge.py
+++ b/diff_diff/wooldridge.py
@@ -536,23 +536,26 @@ def fit(
                 "design-based variance independently."
             )
 
-        # 0e. Reject bootstrap + explicit one-way vcov_type without user-set
-        # cluster. The multiplier bootstrap is fundamentally clustered (it
-        # draws per-cluster weights); under explicit ``vcov_type in {"hc2",
-        # "classical"}`` with ``self.cluster=None``, the OLS path drops the
-        # unit auto-cluster for the analytical sandwich (mirrors SA), which
-        # would leave the bootstrap with no cluster ID to draw weights at.
-        # The user must either provide an explicit ``cluster=X`` or use a
-        # cluster-compatible ``vcov_type`` ("hc1" or "hc2_bm").
-        if self.n_bootstrap > 0 and self.vcov_type in ("hc2", "classical") and self.cluster is None:
+        # 0e. Reject bootstrap + one-way analytical vcov_type. The multiplier
+        # bootstrap is intrinsically clustered (draws per-cluster weights);
+        # one-way ``vcov_type in {"hc2","classical"}`` either drops the unit
+        # auto-cluster (``cluster=None`` → bootstrap has no cluster to draw
+        # at) OR is rejected by the linalg validator (``cluster=X`` + one-way
+        # + cluster_ids). Both fail paths produce a less-informative downstream
+        # error, so reject at the estimator boundary across both. The user
+        # must drop bootstrap (``n_bootstrap=0``) or pick a cluster-compatible
+        # ``vcov_type`` (``hc1`` or ``hc2_bm``).
+        if self.n_bootstrap > 0 and self.vcov_type in ("hc2", "classical"):
             raise ValueError(
                 f"WooldridgeDiD(vcov_type={self.vcov_type!r}, "
-                f"n_bootstrap={self.n_bootstrap}, cluster=None) is not "
-                "supported: the multiplier bootstrap is intrinsically "
-                "clustered, but the one-way vcov_type drops the unit "
-                "auto-cluster. Either set cluster='unit' (or another column) "
-                "or use vcov_type='hc1' / 'hc2_bm' for the analytical "
-                "sandwich."
+                f"n_bootstrap={self.n_bootstrap}) is not supported: the "
+                "multiplier bootstrap is intrinsically clustered, but the "
+                "one-way vcov_type does not compose with cluster_ids — "
+                "either the unit auto-cluster is dropped (when cluster=None) "
+                "leaving the bootstrap with no cluster to draw weights at, "
+                "or the linalg validator rejects one-way + cluster_ids at "
+                "fit (when cluster=X). Use vcov_type='hc1' / 'hc2_bm' for "
+                "the analytical sandwich, or set n_bootstrap=0."
             )
 
         # 1. Filter to analysis sample
@@ -1442,6 +1445,13 @@ def _avg_ax0(a, cell_mask):
             alpha=self.alpha,
             anticipation=self.anticipation,
             survey_metadata=survey_metadata,
+            # vcov_type is locked to "hc1" on the nonlinear paths (the
+            # __init__ guard rejects non-hc1 + method != "ols"). Surface
+            # cluster_name / n_clusters for shared introspection contract
+            # with the OLS path.
+            vcov_type=self.vcov_type,
+            cluster_name=cluster_col,
+            n_clusters=int(np.unique(cluster_ids).size),
             _gt_weights=gt_weights,
             _gt_vcov=gt_vcov,
             _gt_keys=gt_keys_ordered,
@@ -1683,6 +1693,12 @@ def _avg_ax0(a, cell_mask):
             alpha=self.alpha,
             anticipation=self.anticipation,
             survey_metadata=survey_metadata,
+            # vcov_type locked to "hc1" on Poisson path (the __init__ guard
+            # rejects non-hc1 + method != "ols"). Surface cluster_name /
+            # n_clusters for shared introspection contract.
+            vcov_type=self.vcov_type,
+            cluster_name=cluster_col,
+            n_clusters=int(np.unique(cluster_ids).size),
             _gt_weights=gt_weights,
             _gt_vcov=gt_vcov,
             _gt_keys=gt_keys_ordered,
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index aa99435e..82d0fc13 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1487,7 +1487,7 @@ where `g(·)` is the link inverse (logistic or exp), `η_i` is the individual li
 - **Note:** QMLE sandwich uses `weight_type="aweight"` which applies `(G/(G-1)) * ((n-1)/(n-k))` small-sample adjustment. Stata `jwdid` uses `G/(G-1)` only. The `(n-1)/(n-k)` term is conservative (inflates SEs slightly). For typical ETWFE panels where n >> k, the difference is negligible.
 
 *Variance families (`vcov_type`, OLS path only):*
-- `hc1` (default) — CR1 Liang-Zeger cluster-robust on the within-transformed design. Bit-equal to prior behavior (FWL preserves the score). The natural R anchor is `fixest::feols(y ~ <interactions> | unit + time, cluster=~unit)` or Stata `jwdid` (both within-transform). **Deviation from R `lm + clubSandwich::vcovCR(type="CR1S")`:** the full-dummy `lm` SE differs by a factor of `sqrt((n - k_within) / (n - k_total))` because clubSandwich's `(n-1)/(n-p)` finite-sample correction counts ALL columns (intercept + treatment + unit dummies + time dummies = `k_total`) while WooldridgeDiD's `solve_ols` on the within-transformed design counts only the treatment-cell columns (`k_within`). On the 240-obs / 51-column R-parity fixture this is ~11% ; on typical larger panels (n >> k_total) the gap shrinks to <2%. The user can recover the `lm + CR1S` SE by passing `vcov_type="hc2_bm"` (full-dummy auto-route) or by manually constructing a full-dummy `solve_ols` call. Same deviation pattern as SunAbraham PR #472 (`fixest::sunab` vs `lm + clubSandwich`).
+- `hc1` (default) — CR1 Liang-Zeger cluster-robust on the within-transformed design. Bit-equal to prior behavior (FWL preserves the score). The natural R anchor is `fixest::feols(y ~ <interactions> | unit + time, cluster=~unit)` or Stata `jwdid` (both within-transform). **Deviation from R `lm + clubSandwich::vcovCR(type="CR1S")`:** the full-dummy `lm` SE differs by a factor of `sqrt((n - k_within) / (n - k_total))` because clubSandwich's `(n-1)/(n-p)` finite-sample correction counts ALL columns (intercept + treatment + unit dummies + time dummies = `k_total`) while WooldridgeDiD's `solve_ols` on the within-transformed design counts only the treatment-cell columns (`k_within`). On the 240-obs / 51-column R-parity fixture this is ~11%; on typical larger panels (n >> k_total) the gap shrinks to <2%. No public WooldridgeDiD code path exposes the `lm + CR1S` (CR1 cluster-robust on the full-dummy design) finite-sample correction — `vcov_type="hc2_bm"` routes to the CR2 Bell-McCaffrey sandwich on the full-dummy design (different variance estimator entirely), not CR1S. Users who need exact `lm + clubSandwich::vcovCR(type="CR1S")` parity must call `solve_ols` directly on a full-dummy design or fit via R. Same deviation pattern as SunAbraham PR #472 (`fixest::sunab` vs `lm + clubSandwich`).
 - `hc2_bm` — CR2 Bell-McCaffrey via auto-route to full-dummy design (`[intercept, X_design, unit_dummies, time_dummies]`), then `solve_ols(..., vcov_type="hc2_bm")` through the clubSandwich port (PR #475). FWL does NOT preserve the hat matrix; HC2 leverage + BM DOF require the full-projection design. Per-coefficient SE matches `clubSandwich::vcovCR(lm(...), cluster=~unit, type="CR2")` at atol=1e-10. Per-cell `(g, t)` inference fields use `coef_test()$df_Satt` Bell-McCaffrey DOF (pinned at atol=1e-6 from CI half-width inversion). Aggregated inference (overall ATT + `.aggregate("group" | "calendar" | "event")`) uses contrast-specific BM DOFs from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(constraints=matrix(w, 1), vcov=vcov_CR2, test="HTZ")$df_denom`); the overall ATT contrast DOF is computed at fit time, the other three aggregations lazily on each `.aggregate(...)` call from BM artifacts (`X_full`, `cluster_ids`, bread matrix, coef-index map) stored on the Results object. Fail-closed across all surfaces: when BM DOF is unavailable (helper raises or returns non-finite), the affected inference fields are NaN — not normal-theory fallback (per `feedback_bm_contrast_dof_fail_closed`).
 - `classical`, `hc2` — supported via auto-route to full-dummy AND auto-drop of the unit auto-cluster (one-way families don't compose with `cluster_ids` per the linalg validator). Set `self.cluster=None` (default) for these; explicit `cluster="state"` + one-way family raises at the linalg validator. Matches `summary(lm(...))$coefficients` (classical) and `sandwich::vcovHC(type="HC2")` respectively.
 - `conley` — REJECTED at `__init__` (deferral; would require threading `conley_*` params through `solve_ols`; tracked in TODO.md).
diff --git a/tests/test_methodology_wooldridge.py b/tests/test_methodology_wooldridge.py
index 34ad7263..854270a3 100644
--- a/tests/test_methodology_wooldridge.py
+++ b/tests/test_methodology_wooldridge.py
@@ -184,3 +184,70 @@ def test_hc2_se_matches_sandwich_vcovhc(self, golden: dict, panel: pd.DataFrame)
             py_se = res.group_time_effects[(g, t)]["se"]
             assert py_se == pytest.approx(r_ses[i], abs=1e-10)
         assert res.overall_se == pytest.approx(golden["hc2"]["overall_att_se"], abs=1e-10)
+
+    def test_aggregate_group_bm_dof_matches_wald_test_htz(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """``aggregate('group')`` BM contrast DOF per cohort matches R
+        ``clubSandwich::Wald_test(test="HTZ")$df_denom`` at atol=1e-6 (CI
+        inversion tolerance)."""
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        res.aggregate("group")
+        r_dofs = golden["hc2_bm"]["aggregate_group_dof"]
+        assert res.group_effects is not None
+        for g, eff in res.group_effects.items():
+            r_key = str(g)
+            if r_key not in r_dofs or r_dofs[r_key] in (None, "NA"):
+                continue
+            py_dof = _recover_dof_from_ci(
+                eff["att"], eff["se"], eff["conf_int"][1], res.alpha
+            )
+            assert py_dof == pytest.approx(float(r_dofs[r_key]), abs=1e-6), (
+                f"group g={g}: Py df={py_dof:.4f} R df={r_dofs[r_key]}"
+            )
+
+    def test_aggregate_calendar_bm_dof_matches_wald_test_htz(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """``aggregate('calendar')`` BM contrast DOF per treated time period
+        matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-6."""
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        res.aggregate("calendar")
+        r_dofs = golden["hc2_bm"]["aggregate_calendar_dof"]
+        assert res.calendar_effects is not None
+        for t, eff in res.calendar_effects.items():
+            r_key = str(t)
+            if r_key not in r_dofs or r_dofs[r_key] in (None, "NA"):
+                continue
+            py_dof = _recover_dof_from_ci(
+                eff["att"], eff["se"], eff["conf_int"][1], res.alpha
+            )
+            assert py_dof == pytest.approx(float(r_dofs[r_key]), abs=1e-6), (
+                f"calendar t={t}: Py df={py_dof:.4f} R df={r_dofs[r_key]}"
+            )
+
+    def test_aggregate_event_bm_dof_matches_wald_test_htz(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """``aggregate('event')`` BM contrast DOF per relative-period k
+        matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-6."""
+        res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        res.aggregate("event")
+        r_dofs = golden["hc2_bm"]["aggregate_event_dof"]
+        assert res.event_study_effects is not None
+        for k, eff in res.event_study_effects.items():
+            r_key = str(k)
+            if r_key not in r_dofs or r_dofs[r_key] in (None, "NA"):
+                continue
+            py_dof = _recover_dof_from_ci(
+                eff["att"], eff["se"], eff["conf_int"][1], res.alpha
+            )
+            assert py_dof == pytest.approx(float(r_dofs[r_key]), abs=1e-6), (
+                f"event k={k}: Py df={py_dof:.4f} R df={r_dofs[r_key]}"
+            )
diff --git a/tests/test_wooldridge.py b/tests/test_wooldridge.py
index 2d9c0e8d..7260beae 100644
--- a/tests/test_wooldridge.py
+++ b/tests/test_wooldridge.py
@@ -1799,13 +1799,25 @@ def test_survey_design_plus_classical_rejected(self):
                 survey_design=design,
             )
 
-    def test_bootstrap_plus_one_way_no_user_cluster_rejected(self):
+    def test_bootstrap_plus_one_way_rejected_regardless_of_cluster(self):
+        """Bootstrap + one-way analytical vcov_type is rejected at the
+        estimator boundary regardless of ``self.cluster`` — under
+        ``cluster=None`` the auto-cluster is dropped (no cluster for the
+        bootstrap to draw at); under ``cluster=X`` the linalg validator
+        rejects one-way + cluster_ids. Both fail paths produce a less-
+        informative downstream error, so the estimator rejects up front."""
         df = _make_vcov_panel()
-        est = WooldridgeDiD(
-            method="ols", vcov_type="classical", n_bootstrap=10, seed=0
-        )
+        # Case 1: cluster=None (default) — bootstrap reject fires
+        est = WooldridgeDiD(method="ols", vcov_type="classical", n_bootstrap=10, seed=0)
         with pytest.raises(ValueError, match=r"multiplier bootstrap"):
             est.fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
+        # Case 2: cluster=X — also rejected at the estimator boundary (would
+        # otherwise hit the linalg validator with a less-informative message)
+        est_cl = WooldridgeDiD(
+            method="ols", vcov_type="hc2", n_bootstrap=10, cluster="unit", seed=0
+        )
+        with pytest.raises(ValueError, match=r"multiplier bootstrap"):
+            est_cl.fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
 
     def test_get_params_includes_vcov_type(self):
         est = WooldridgeDiD(vcov_type="hc2_bm")

From 155fdcf994ccada2c8a51431badab435d924a1d1 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 15:42:25 -0400
Subject: [PATCH 04/11] =?UTF-8?q?wooldridge:=20address=20codex=20R3=20P1/P?=
 =?UTF-8?q?2=20=E2=80=94=20rank-deficient=20hc2=5Fbm=20+=20doc=20alignment?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P1 (Methodology, codex R3): hc2_bm BM contrast DOF computation operated
on the unreduced full-dummy design (X / X.T @ X). When solve_ols dropped
collinear columns under rank-deficient ETWFE specs (all-eventually-
treated not_yet_treated, time-invariant exovar collinear with unit FE,
etc.), _compute_cr2_bm_contrast_dof would LinAlgError on the singular
bread and the fail-closed branch zeroed every per-cell + aggregate
inference field to NaN — even for the still-identified cells that
solve_ols correctly returned.

Fix: subset X / bread / contrasts to the kept-column space (where
solve_ols produced non-NaN coefs) before passing to
_compute_cr2_bm_contrast_dof. Reduced artifacts (X_red, cluster_ids,
bread_red, reduced_coef_idx_map) are stashed on the Results object so
the lazy aggregate() path also operates in the reduced subspace.
Mirrors the MPD pattern at estimators.py:1860-1913. Cells whose
treatment-cell coefficient was dropped retain NaN inference fields
(fail-closed correctly for them).

P2 (Docs): CHANGELOG + REGISTRY bootstrap-reject prose was narrower than
the actual implementation (cluster=None only); updated to reflect the
broader reject regardless of cluster=.

Two new regression tests in tests/test_wooldridge.py exercise the
previously-broken rank-deficient hc2_bm paths:
- All-eventually-treated panel with not_yet_treated control (late
  cohort cells dropped by solve_ols)
- Unit-invariant exovar covariate (collinear with unit FE)

Both assert per-cell + aggregate inference is finite on identified cells.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                 |  2 +-
 diff_diff/wooldridge.py      | 74 ++++++++++++++++++++++---------
 docs/methodology/REGISTRY.md |  2 +-
 tests/test_wooldridge.py     | 84 ++++++++++++++++++++++++++++++++++++
 4 files changed, 139 insertions(+), 23 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index cfed85c2..d9eee16b 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
-- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. **Bell-McCaffrey Satterthwaite DOF is threaded across ALL hc2_bm user-facing inference surfaces**: (1) per-cell `group_time_effects[(g, t)]` use `coef_test()$df_Satt` (matches R at atol=1e-6 from CI inversion); (2) overall ATT uses the post-period-aggregation contrast DOF from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); (3) `.aggregate("group" | "calendar" | "event")` recomputes contrast-specific BM DOFs lazily from BM artifacts (`X_full`, `cluster_ids`, bread matrix, coef-index map) stored on the Results object. Fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + explicit one-way (`hc2`/`classical`) + `cluster=None` raises at `fit()` (multiplier bootstrap is intrinsically clustered). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
+- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. **Bell-McCaffrey Satterthwaite DOF is threaded across ALL hc2_bm user-facing inference surfaces**: (1) per-cell `group_time_effects[(g, t)]` use `coef_test()$df_Satt` (matches R at atol=1e-6 from CI inversion); (2) overall ATT uses the post-period-aggregation contrast DOF from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); (3) `.aggregate("group" | "calendar" | "event")` recomputes contrast-specific BM DOFs lazily from BM artifacts (`X_full`, `cluster_ids`, bread matrix, coef-index map) stored on the Results object. Fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + one-way (`hc2`/`classical`) raises at `fit()` regardless of `cluster=` setting (multiplier bootstrap is intrinsically clustered, but one-way vcov_type does not compose with cluster_ids — either the auto-cluster is dropped when `cluster=None` leaving the bootstrap with no cluster to draw at, or the linalg validator rejects one-way + cluster_ids when `cluster=X`). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
 - **ChaisemartinDHaultfoeuille (DCDH) methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473) and ContinuousDiD precedent (PR #476). REGISTRY `## ChaisemartinDHaultfoeuille` gains a formal `### Deviations from the paper / from R / library extensions` block consolidating 7 documented deviations into a single AI-review-recognized labeled surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)"): (D1) equal-cell weighting (deviation from BOTH AER 2020 Equation 3 AND R `DIDmultiplegtDYN`); (D2) period-based vs cohort-based stable controls; (D3) balanced-baseline panel + interior-gap drops + terminal-missingness retention + cell-period-allocator targeted `ValueError`; (D4) SE normalization `N_l` vs R `G` (~4% smaller analytical SE); (D5) singleton-cohort degeneracy → NaN with `UserWarning`; (D6) `<50%` switcher warning at far horizons (library extension citing Favara-Imbs application, footnote 14 of NBER WP 29873); (D7) Phase 3 `DID^X` covariate first-stage equal-cell weights. R cross-language coverage holds at documented tolerance bands in `tests/test_chaisemartin_dhaultfoeuille_parity.py` (`POINT_RTOL = 1e-4` on pure-direction point estimates, `MIXED_POINT_RTOL = 0.025` on mixed-direction, `PURE_DIRECTION_SE_RTOL = 0.05` on pure-direction SE, `SE_RTOL = 0.10` on multi-horizon SE, `se_rtol=0.15` on the long-panel `L_max=5` joiners-only scenario where cell-count-weighting compounds). No source code changes, no new tests, no new docstrings — consolidation only against the existing 12 methodology tests (`tests/test_methodology_chaisemartin_dhaultfoeuille.py`), 26 R-parity tests (`tests/test_chaisemartin_dhaultfoeuille_parity.py`), 352 unit tests (`tests/test_chaisemartin_dhaultfoeuille.py`), survey suites (`tests/test_survey_dcdh.py`, `tests/test_survey_dcdh_replicate_psu.py`, three cell-period coverage suites), and two primary-source DCDH paper reviews on disk (2020 AER + 2022/2023 NBER WP 29873 via PR #478; the `dechaisemartin-2026-review.md` on disk is HAD's primary source, not DCDH's, and is referenced as adjacent context only). The REGISTRY Deviations block uses semantic section-name anchors (rather than fragile line numbers) for back-references to other parts of the DCDH section — an intentional divergence from the PR #476 ContinuousDiD precedent reflecting PR-A wording-drift CI feedback that flagged line-number cross-references as drift-prone in long sections. `METHODOLOGY_REVIEW.md` DCDH row promoted **In Progress** → **Complete**; L27 In Progress example paragraph re-pointed to WooldridgeDiD; L1289 priority-order queue item #6 (DCDH) removed and items #7-#11 renumbered to #6-#10.
 
 ## [3.4.1] - 2026-05-21
diff --git a/diff_diff/wooldridge.py b/diff_diff/wooldridge.py
index de49269c..93cce3d6 100644
--- a/diff_diff/wooldridge.py
+++ b/diff_diff/wooldridge.py
@@ -994,25 +994,54 @@ def _fit_ols(
         ):
             from diff_diff.linalg import _compute_cr2_bm_contrast_dof
 
-            n_coefs = X.shape[1]
-            bread_matrix = X.T @ X
-            # Per-cell one-hot contrasts (one column per present (g, t) cell).
-            per_cell_keys = list(gt_keys_ordered)
-            # Overall ATT post-period-average contrast (matches the default
-            # ``_compute_weighted_agg`` weights ``n_{g,t}``).
+            # Honor rank deficiency: solve_ols sets coefs[i] = NaN for dropped
+            # columns. The full-design bread (X.T @ X) is singular on the
+            # dropped columns, so _compute_cr2_bm_contrast_dof would
+            # LinAlgError on it. Reduce X / bread / contrasts to the kept-
+            # column subspace before computing BM DOF (matches the existing
+            # MPD pattern at estimators.py:1860-1913 and SA's full-dummy
+            # behavior). Identified (g, t) cells survive; cells whose
+            # treatment-cell coefficient was dropped get per_cell_bm_dof=NaN
+            # and the gt_effects loop fail-closes their inference.
+            nan_mask = np.isnan(coefs)
+            kept_indices = np.where(~nan_mask)[0]
+            kept_set = set(int(i) for i in kept_indices.tolist())
+            X_red = X[:, kept_indices]
+            bread_red = X_red.T @ X_red
+            # Map full-coef index → reduced-coef index for kept columns only.
+            full_to_reduced: Dict[int, int] = {
+                int(full_idx): red_pos for red_pos, full_idx in enumerate(kept_indices)
+            }
+            # Reduced coef-index map for (g, t) cells whose coefficient was
+            # kept; cells with dropped coefficients are absent here and will
+            # be fail-closed at gt_effects inference + aggregate() time.
+            reduced_coef_idx_map: Dict[Tuple, int] = {
+                k: full_to_reduced[v]
+                for k, v in gt_coef_index_map.items()
+                if int(v) in kept_set
+            }
+            n_red = X_red.shape[1]
+            # Per-cell one-hot contrasts (kept cells only). Dropped cells get
+            # NaN per_cell_bm_dof (caller fail-closes inference fields).
+            per_cell_keys_kept = [k for k in gt_keys_ordered if k in reduced_coef_idx_map]
+            per_cell_keys_dropped = [
+                k for k in gt_keys_ordered if k not in reduced_coef_idx_map
+            ]
+            # Overall ATT contrast across post-period kept cells.
             post_keys = [(g, t) for (g, t) in gt_keys_ordered if t >= g]
-            w_total_post = sum(gt_weights.get(k, 0) for k in post_keys)
-            overall_contrast = np.zeros(n_coefs)
+            post_keys_kept = [k for k in post_keys if k in reduced_coef_idx_map]
+            w_total_post = sum(gt_weights.get(k, 0) for k in post_keys_kept)
+            overall_contrast = np.zeros(n_red)
             if w_total_post > 0:
-                for k in post_keys:
-                    overall_contrast[gt_coef_index_map[k]] = (
+                for k in post_keys_kept:
+                    overall_contrast[reduced_coef_idx_map[k]] = (
                         gt_weights[k] / w_total_post
                     )
             include_overall = w_total_post > 0 and bool(np.any(overall_contrast != 0))
             cols: List[np.ndarray] = []
-            for k in per_cell_keys:
-                col = np.zeros(n_coefs)
-                col[gt_coef_index_map[k]] = 1.0
+            for k in per_cell_keys_kept:
+                col = np.zeros(n_red)
+                col[reduced_coef_idx_map[k]] = 1.0
                 cols.append(col)
             if include_overall:
                 cols.append(overall_contrast)
@@ -1020,9 +1049,9 @@ def _fit_ols(
                 contrasts_matrix = np.column_stack(cols)
                 try:
                     dof_vec = _compute_cr2_bm_contrast_dof(
-                        X, cluster_ids, bread_matrix, contrasts_matrix
+                        X_red, cluster_ids, bread_red, contrasts_matrix
                     )
-                    for i, k in enumerate(per_cell_keys):
+                    for i, k in enumerate(per_cell_keys_kept):
                         candidate = float(dof_vec[i])
                         per_cell_bm_dof[k] = (
                             candidate if np.isfinite(candidate) else float("nan")
@@ -1043,15 +1072,18 @@ def _fit_ols(
                         UserWarning,
                         stacklevel=3,
                     )
-                    for k in per_cell_keys:
+                    for k in per_cell_keys_kept:
                         per_cell_bm_dof[k] = float("nan")
                     if include_overall:
                         overall_att_bm_dof = float("nan")
-            # Stash artifacts for ``aggregate()`` regardless of whether the
-            # batched DOF call succeeded — the dataclass-side helper will
-            # retry contrast-specific DOFs lazily and fail-closed on its
-            # own errors.
-            bm_artifacts = (X, cluster_ids, bread_matrix, dict(gt_coef_index_map))
+            # Cells whose coefficient was dropped get NaN regardless of
+            # whether the batched call succeeded.
+            for k in per_cell_keys_dropped:
+                per_cell_bm_dof[k] = float("nan")
+            # Stash REDUCED artifacts for ``aggregate()`` so the lazy
+            # contrast DOF computation operates in the same reduced
+            # coefficient space and avoids the singular full-design bread.
+            bm_artifacts = (X_red, cluster_ids, bread_red, reduced_coef_idx_map)
 
         # 8a. Apply per-cell BM DOFs (or fail-closed NaN) to ``gt_effects``
         # for hc2_bm; otherwise use the shared ``df_inf`` (survey df or None).
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 82d0fc13..ebc3bba9 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1493,7 +1493,7 @@ where `g(·)` is the link inverse (logistic or exp), `η_i` is the individual li
 - `conley` — REJECTED at `__init__` (deferral; would require threading `conley_*` params through `solve_ols`; tracked in TODO.md).
 - `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` — REJECTED at `__init__`. GLM QMLE sandwich with HC2 leverage on canonical-link pseudo-residuals (`w = p(1-p)` for logit, `w = μ_i` for Poisson) needs CR2-BM-on-GLM derivation + R parity against `clubSandwich::vcovCR(glm(...))`. Tracked in TODO.md (WooldridgeDiD logit/poisson follow-up row).
 - `survey_design=` + `vcov_type != "hc1"` — REJECTED at `fit()` with `NotImplementedError`. Survey TSL/replicate-refit overrides analytical sandwich. Use `vcov_type="hc1"` (default) for survey designs.
-- `n_bootstrap > 0` + `vcov_type ∈ {"hc2","classical"}` + `self.cluster=None` — REJECTED at `fit()`. The multiplier bootstrap is intrinsically clustered; under explicit one-way + no user cluster, the bootstrap has no cluster ID to draw weights at. User must provide explicit `cluster=X` or use `vcov_type='hc1'` / `'hc2_bm'`.
+- `n_bootstrap > 0` + `vcov_type ∈ {"hc2","classical"}` — REJECTED at `fit()` regardless of `self.cluster` setting. The multiplier bootstrap is intrinsically clustered, but one-way vcov_type does not compose with `cluster_ids`: with `cluster=None` the auto-cluster is dropped (bootstrap has no cluster to draw weights at); with `cluster=X` the linalg validator rejects one-way + cluster_ids downstream with a less-informative error. User must drop bootstrap (`n_bootstrap=0`) or pick a cluster-compatible `vcov_type` (`hc1` or `hc2_bm`).
 - **Note:** This routing is a documented synthesis of two existing methodology ingredients: the full-dummy auto-route from the Phase 1b PR 1/8 SunAbraham pattern (PR #472, which itself reused the Phase 1a Gate 1 TWFE lift from PR #469), and the clubSandwich WLS-CR2 algebra from the Phase 1a port (PR #475). The BM contrast DOF threading reuses `_compute_cr2_bm_contrast_dof` from PR #465 (MPD). No new methodology choice is introduced — the change is purely surface: extending the existing pattern from SA-OLS to WooldridgeDiD-OLS.
 - **Note:** Bootstrap clusters at `self.cluster if self.cluster else unit` regardless of `vcov_type`. When the analytical sandwich is one-way + the user set an explicit `cluster=X`, the bootstrap matches the user's cluster. The bootstrap SE overrides the analytical SE for `overall_*` on `n_bootstrap > 0` paths; per-cell `(g, t)` SEs still come from the analytical vcov.
 
diff --git a/tests/test_wooldridge.py b/tests/test_wooldridge.py
index 7260beae..b84e305c 100644
--- a/tests/test_wooldridge.py
+++ b/tests/test_wooldridge.py
@@ -1,5 +1,7 @@
 """Tests for WooldridgeDiD estimator and WooldridgeDiDResults."""
 
+import warnings
+
 import numpy as np
 import pandas as pd
 import pytest
@@ -1966,6 +1968,88 @@ def test_aggregate_calendar_under_hc2_bm_uses_bm_contrast_dof(self):
             assert np.isfinite(eff["conf_int"][0])
             assert np.isfinite(eff["conf_int"][1])
 
+    def test_hc2_bm_handles_rank_deficient_all_eventually_treated(self):
+        """All-eventually-treated panel with not_yet_treated control: late
+        cohorts have no valid post-treatment comparison and get dropped by
+        solve_ols's rank-deficiency handling. hc2_bm must compute BM DOF
+        on the REDUCED design (kept-column subspace) — operating on the
+        unreduced full-dummy bread would LinAlgError and fail-close every
+        inference field to NaN (codex R3 P1). Per-cell + aggregate
+        inference on identified cells must remain finite."""
+        rng = np.random.default_rng(42)
+        n_units, n_periods = 20, 8
+        units = np.repeat(np.arange(n_units), n_periods)
+        periods = np.tile(np.arange(1, n_periods + 1), n_units)
+        cohorts = rng.choice([3, 5, 7], size=n_units)
+        cohort_per_obs = cohorts[units]
+        tau = np.where(
+            periods >= cohort_per_obs, 0.5 + 0.2 * (periods - cohort_per_obs), 0.0
+        )
+        y = 1.0 + 0.1 * periods + tau + 0.1 * rng.normal(size=len(units))
+        df = pd.DataFrame(
+            {"unit": units, "time": periods, "cohort": cohort_per_obs, "y": y}
+        )
+        # Expect a rank-deficient warning from solve_ols (late-cohort drop).
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", UserWarning)
+            res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+                df, outcome="y", unit="unit", time="time", cohort="cohort"
+            )
+        # Per-cell inference: all identified cells finite (att + se + p +
+        # CI). solve_ols already excluded the dropped cells from
+        # group_time_effects, so every key here is identified.
+        assert len(res.group_time_effects) > 0
+        for k, eff in res.group_time_effects.items():
+            assert np.isfinite(eff["att"]), f"({k}) att NaN"
+            assert np.isfinite(eff["se"]), f"({k}) se NaN"
+            assert np.isfinite(eff["t_stat"]), f"({k}) t_stat NaN — BM DOF not threaded on reduced design"
+            assert np.isfinite(eff["p_value"]), f"({k}) p_value NaN"
+            assert np.isfinite(eff["conf_int"][0])
+            assert np.isfinite(eff["conf_int"][1])
+        # Overall ATT inference: finite end-to-end.
+        assert np.isfinite(res.overall_t_stat)
+        assert np.isfinite(res.overall_p_value)
+        # Aggregate("event"): finite p-values across event-time bins
+        res.aggregate("event")
+        assert res.event_study_effects is not None
+        for k, eff in res.event_study_effects.items():
+            assert np.isfinite(eff["t_stat"]), f"event k={k} t_stat NaN — aggregate BM DOF on reduced design regressed"
+            assert np.isfinite(eff["p_value"])
+
+    def test_hc2_bm_handles_rank_deficient_with_unit_invariant_exovar(self):
+        """Unit-invariant exovar covariate is collinear with unit FE under
+        full-dummy: solve_ols drops it as rank-deficient. hc2_bm must
+        compute BM DOF on the reduced design (P1 codex R3 regression)."""
+        df = _make_vcov_panel(n_units=30, n_periods=6, seed=20260521)
+        # Unit-invariant covariate: x = f(unit) only → collinear with unit FE
+        df["x_unit"] = df["unit"].astype(float)
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", UserWarning)
+            res = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+                df,
+                outcome="y",
+                unit="unit",
+                time="time",
+                cohort="cohort",
+                exovar=["x_unit"],
+            )
+        # Per-cell + overall inference finite on identified cells
+        assert len(res.group_time_effects) > 0
+        for k, eff in res.group_time_effects.items():
+            assert np.isfinite(eff["att"]), f"({k}) att NaN"
+            assert np.isfinite(eff["se"]), f"({k}) se NaN"
+            assert np.isfinite(eff["t_stat"]), f"({k}) t_stat NaN under rank-deficient exovar — BM DOF not threaded"
+            assert np.isfinite(eff["p_value"])
+        assert np.isfinite(res.overall_t_stat)
+        assert np.isfinite(res.overall_p_value)
+        # Group + event aggregates should also produce finite inference
+        for agg_type in ("group", "event"):
+            res.aggregate(agg_type)
+        for g, eff in (res.group_effects or {}).items():
+            assert np.isfinite(eff["t_stat"]), f"group g={g} t_stat NaN under rank-deficient exovar"
+        for k, eff in (res.event_study_effects or {}).items():
+            assert np.isfinite(eff["t_stat"]), f"event k={k} t_stat NaN under rank-deficient exovar"
+
     def test_aggregate_under_hc2_bm_fail_closed_on_dof_helper_error(self, monkeypatch):
         """When _compute_cr2_bm_contrast_dof raises in aggregate(), the
         affected aggregate inference fields are NaN (fail-closed),

From 5ae458f873e54fa3860cfb8f9bcbd93718198fe6 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 15:47:52 -0400
Subject: [PATCH 05/11] wooldridge: extend rank-deficient hc2_bm regression to
 cover aggregate('calendar')
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address codex R4 P2: the rank-deficient hc2_bm regression tests
exercised group + event aggregates but not calendar. Extends both
regressions to also call aggregate('calendar') and assert finite BM
inference on surviving calendar contrasts — locks the reduced-space
BM DOF threading across all three non-simple aggregation surfaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 tests/test_wooldridge.py | 24 ++++++++++++++++++++----
 1 file changed, 20 insertions(+), 4 deletions(-)

diff --git a/tests/test_wooldridge.py b/tests/test_wooldridge.py
index b84e305c..2030d300 100644
--- a/tests/test_wooldridge.py
+++ b/tests/test_wooldridge.py
@@ -2009,12 +2009,25 @@ def test_hc2_bm_handles_rank_deficient_all_eventually_treated(self):
         # Overall ATT inference: finite end-to-end.
         assert np.isfinite(res.overall_t_stat)
         assert np.isfinite(res.overall_p_value)
-        # Aggregate("event"): finite p-values across event-time bins
-        res.aggregate("event")
+        # All three aggregations (group/calendar/event) must produce finite
+        # inference on identified contrasts under the reduced-design BM path.
+        for agg_type in ("group", "calendar", "event"):
+            res.aggregate(agg_type)
         assert res.event_study_effects is not None
         for k, eff in res.event_study_effects.items():
             assert np.isfinite(eff["t_stat"]), f"event k={k} t_stat NaN — aggregate BM DOF on reduced design regressed"
             assert np.isfinite(eff["p_value"])
+        assert res.group_effects is not None
+        for g, eff in res.group_effects.items():
+            assert np.isfinite(eff["t_stat"]), f"group g={g} t_stat NaN — aggregate BM DOF on reduced design regressed"
+            assert np.isfinite(eff["p_value"])
+        assert res.calendar_effects is not None
+        # Calendar entries with at least one identified treated cell should
+        # have finite BM inference; entirely-pre-treatment calendar periods
+        # are absent from calendar_effects (their cells aren't post-treatment).
+        for t, eff in res.calendar_effects.items():
+            assert np.isfinite(eff["t_stat"]), f"calendar t={t} t_stat NaN — aggregate BM DOF on reduced design regressed"
+            assert np.isfinite(eff["p_value"])
 
     def test_hc2_bm_handles_rank_deficient_with_unit_invariant_exovar(self):
         """Unit-invariant exovar covariate is collinear with unit FE under
@@ -2042,11 +2055,14 @@ def test_hc2_bm_handles_rank_deficient_with_unit_invariant_exovar(self):
             assert np.isfinite(eff["p_value"])
         assert np.isfinite(res.overall_t_stat)
         assert np.isfinite(res.overall_p_value)
-        # Group + event aggregates should also produce finite inference
-        for agg_type in ("group", "event"):
+        # Group + calendar + event aggregates should all produce finite
+        # inference under the reduced-design BM path.
+        for agg_type in ("group", "calendar", "event"):
             res.aggregate(agg_type)
         for g, eff in (res.group_effects or {}).items():
             assert np.isfinite(eff["t_stat"]), f"group g={g} t_stat NaN under rank-deficient exovar"
+        for t, eff in (res.calendar_effects or {}).items():
+            assert np.isfinite(eff["t_stat"]), f"calendar t={t} t_stat NaN under rank-deficient exovar"
         for k, eff in (res.event_study_effects or {}).items():
             assert np.isfinite(eff["t_stat"]), f"event k={k} t_stat NaN under rank-deficient exovar"
 

From 9c4db17ff9f93aed5255f1396f6eb26f623faa24 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 15:58:12 -0400
Subject: [PATCH 06/11] wooldridge: atomic set_params + survey-path cluster
 metadata cleanup (R5)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P1 (Code Quality, codex R5): set_params() mutated self BEFORE running
validation. A rejected method × vcov_type combination left the
estimator half-mutated: a caller that caught the exception could then
run fit() on a partially-applied configuration (e.g., logit HC1 fit
while self.vcov_type silently reads "hc2_bm"). Reworked set_params()
to compute pending values on locals, validate the full overlay first,
and only mutate self after validation succeeds. Mirrors the
DifferenceInDifferences.set_params atomic pattern at
estimators.py:995-1023.

P2 (Maintainability, codex R5): under survey TSL, the analytical
sandwich (including its cluster_ids) is replaced by the survey-design
variance — but the WooldridgeDiDResults dataclass was still surfacing
cluster_name='unit' / n_clusters=N_units from the analytical-path
default, misrepresenting what was actually computed. Set both to None
when survey_metadata is active; the survey-design stratification
lives in survey_metadata. Applied across _fit_ols, _fit_logit, and
_fit_poisson.

New regression tests:
- test_set_params_is_atomic_on_validation_failure (3 reject scenarios:
  method×vcov interaction, unknown vcov_type batch, unknown param key —
  each asserts get_params() unchanged after the rejection).
- test_survey_design_clears_cluster_metadata (survey + hc1 fit on OLS
  path asserts cluster_name/n_clusters are None).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/wooldridge.py  | 73 +++++++++++++++++++++++++++++-----------
 tests/test_wooldridge.py | 52 ++++++++++++++++++++++++++++
 2 files changed, 106 insertions(+), 19 deletions(-)

diff --git a/diff_diff/wooldridge.py b/diff_diff/wooldridge.py
index 93cce3d6..9e60cf81 100644
--- a/diff_diff/wooldridge.py
+++ b/diff_diff/wooldridge.py
@@ -441,20 +441,39 @@ def get_params(self) -> Dict[str, Any]:
         }
 
     def set_params(self, **params: Any) -> "WooldridgeDiD":
-        """Set estimator parameters (sklearn-compatible). Returns self."""
-        for key, value in params.items():
+        """Set estimator parameters (sklearn-compatible). Returns self.
+
+        Atomic: if validation rejects the incoming combination (unknown
+        parameter, invalid value, or the ``method`` × ``vcov_type``
+        interaction guard fires), ``self`` is unchanged so a caller that
+        catches ``ValueError`` / ``NotImplementedError`` can keep using
+        the estimator with its previous configuration. Mirrors the
+        ``DifferenceInDifferences.set_params`` pattern at
+        ``estimators.py:995-1023``.
+        """
+        # First pass: validate all incoming keys are known attributes so
+        # we don't partially apply a batch that ends in "Unknown parameter".
+        for key in params:
             if not hasattr(self, key):
                 raise ValueError(f"Unknown parameter: {key!r}")
+
+        # Compute pending values by overlaying ``params`` on the current
+        # configuration; validate on those locals (catches invalid sets +
+        # the method × vcov_type interaction) BEFORE mutating ``self``.
+        pending = {
+            "method": params.get("method", self.method),
+            "control_group": params.get("control_group", self.control_group),
+            "anticipation": params.get("anticipation", self.anticipation),
+            "bootstrap_weights": params.get(
+                "bootstrap_weights", self.bootstrap_weights
+            ),
+            "vcov_type": params.get("vcov_type", self.vcov_type),
+        }
+        self._validate_constructor_args(**pending)
+
+        # All validation passed — apply mutations atomically.
+        for key, value in params.items():
             setattr(self, key, value)
-        # Re-run validation (catches mutations into invalid sets AND the
-        # method × vcov_type interaction) using the shared validator.
-        self._validate_constructor_args(
-            method=self.method,
-            control_group=self.control_group,
-            anticipation=self.anticipation,
-            bootstrap_weights=self.bootstrap_weights,
-            vcov_type=self.vcov_type,
-        )
         # Recompute the explicit-vcov flag after any vcov_type mutation.
         self._vcov_type_explicit = self.vcov_type != "hc1"
         return self
@@ -1170,8 +1189,17 @@ def _fit_ols(
             anticipation=self.anticipation,
             survey_metadata=survey_metadata,
             vcov_type=self.vcov_type,
-            cluster_name=cluster_col,
-            n_clusters=(int(np.unique(cluster_ids).size) if cluster_ids is not None else None),
+            # Cluster metadata: ``None`` under survey TSL because the
+            # analytical sandwich (and its cluster_ids) was overridden by
+            # the survey variance; ``survey_metadata`` carries the
+            # design-side stratification/PSU instead. Under non-survey, the
+            # analytical cluster (default unit, dropped for explicit one-way).
+            cluster_name=(None if resolved is not None else cluster_col),
+            n_clusters=(
+                None
+                if resolved is not None
+                else (int(np.unique(cluster_ids).size) if cluster_ids is not None else None)
+            ),
             _gt_weights=gt_weights,
             _gt_vcov=gt_vcov,
             _gt_keys=gt_keys_ordered,
@@ -1480,10 +1508,14 @@ def _avg_ax0(a, cell_mask):
             # vcov_type is locked to "hc1" on the nonlinear paths (the
             # __init__ guard rejects non-hc1 + method != "ols"). Surface
             # cluster_name / n_clusters for shared introspection contract
-            # with the OLS path.
+            # with the OLS path. Under survey TSL the analytical sandwich
+            # (including its cluster_ids) is replaced — surface ``None``
+            # so downstream introspection doesn't claim a unit-cluster
+            # that wasn't actually applied; survey_metadata carries the
+            # design-side structure instead.
             vcov_type=self.vcov_type,
-            cluster_name=cluster_col,
-            n_clusters=int(np.unique(cluster_ids).size),
+            cluster_name=(None if _has_survey else cluster_col),
+            n_clusters=(None if _has_survey else int(np.unique(cluster_ids).size)),
             _gt_weights=gt_weights,
             _gt_vcov=gt_vcov,
             _gt_keys=gt_keys_ordered,
@@ -1727,10 +1759,13 @@ def _avg_ax0(a, cell_mask):
             survey_metadata=survey_metadata,
             # vcov_type locked to "hc1" on Poisson path (the __init__ guard
             # rejects non-hc1 + method != "ols"). Surface cluster_name /
-            # n_clusters for shared introspection contract.
+            # n_clusters for shared introspection contract. Under survey
+            # TSL the analytical sandwich is replaced — surface ``None``
+            # so downstream introspection doesn't claim a unit-cluster
+            # that wasn't actually applied.
             vcov_type=self.vcov_type,
-            cluster_name=cluster_col,
-            n_clusters=int(np.unique(cluster_ids).size),
+            cluster_name=(None if _has_survey else cluster_col),
+            n_clusters=(None if _has_survey else int(np.unique(cluster_ids).size)),
             _gt_weights=gt_weights,
             _gt_vcov=gt_vcov,
             _gt_keys=gt_keys_ordered,
diff --git a/tests/test_wooldridge.py b/tests/test_wooldridge.py
index 2030d300..6d0f86b4 100644
--- a/tests/test_wooldridge.py
+++ b/tests/test_wooldridge.py
@@ -1839,6 +1839,58 @@ def test_set_params_catches_method_vcov_interaction(self):
         with pytest.raises(NotImplementedError):
             est.set_params(method="logit", vcov_type="hc2_bm")
 
+    def test_set_params_is_atomic_on_validation_failure(self):
+        """Per codex R5 P1: rejected set_params must leave the estimator
+        unchanged so subsequent fit() runs on the validated configuration,
+        not a half-mutated one. Without atomicity, a caller that catches
+        the exception could later run e.g. a logit HC1 fit while
+        ``self.vcov_type`` silently reads ``'hc2_bm'``."""
+        est = WooldridgeDiD(method="ols", vcov_type="hc1")
+        original_params = est.get_params()
+        # Reject: method=logit + vcov_type=hc2_bm (interaction guard)
+        with pytest.raises(NotImplementedError):
+            est.set_params(method="logit", vcov_type="hc2_bm")
+        # Estimator must be unchanged
+        assert est.get_params() == original_params
+        assert est.method == "ols"
+        assert est.vcov_type == "hc1"
+        assert est._vcov_type_explicit is False
+        # Reject: unknown vcov_type. Try changing multiple params at once
+        # to verify atomicity catches partial application.
+        with pytest.raises(ValueError, match="hc4"):
+            est.set_params(method="poisson", vcov_type="hc4")
+        # method must NOT have changed to "poisson" — the validator rejected
+        # the batch before any setattr() ran.
+        assert est.method == "ols"
+        assert est.vcov_type == "hc1"
+        # Unknown parameter key: same atomicity guarantee.
+        with pytest.raises(ValueError, match="bogus_param"):
+            est.set_params(vcov_type="hc2_bm", bogus_param=42)
+        assert est.vcov_type == "hc1"
+        assert est._vcov_type_explicit is False
+
+    def test_survey_design_clears_cluster_metadata(self):
+        """Per codex R5 P2: under survey TSL the analytical sandwich (and
+        its cluster_ids) is replaced — cluster_name / n_clusters should be
+        ``None`` (the survey design's stratification lives in
+        ``survey_metadata``), not a misleading echo of the default unit
+        cluster."""
+        from diff_diff.survey import SurveyDesign
+
+        df = _make_vcov_panel()
+        df["w"] = 1.0
+        design = SurveyDesign(weights="w", weight_type="pweight")
+        # OLS + survey + default hc1: the analytical fall-through would
+        # have surfaced cluster_name='unit', n_clusters=N — but survey TSL
+        # replaces that vcov, so the dataclass must report None.
+        res = WooldridgeDiD(method="ols", vcov_type="hc1").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort",
+            survey_design=design,
+        )
+        assert res.survey_metadata is not None
+        assert res.cluster_name is None
+        assert res.n_clusters is None
+
     def test_set_params_updates_vcov_type_explicit_flag(self):
         est = WooldridgeDiD(vcov_type="hc1")
         assert est._vcov_type_explicit is False

From 6510250c09700cc9fe7144737141c3606c09b464 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 16:05:29 -0400
Subject: [PATCH 07/11] wooldridge: rename BM artifact bindings to
 X_red/bread_red (R6 P3)
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

Address codex R6 P3: internal naming in aggregate() and the
_bm_artifacts docstring used the misleading X_full / bread_matrix
labels even though the stored artifacts are the REDUCED kept-column
design (post rank-deficient drops). Renamed locals to X_red /
bread_red and expanded the dataclass field docstring to spell out
that the artifacts are reduced — using the singular full-design
bread is exactly the failure mode the rank-deficient threading was
introduced to avoid. REGISTRY entry updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 diff_diff/wooldridge_results.py | 24 ++++++++++++++++++------
 docs/methodology/REGISTRY.md    |  2 +-
 2 files changed, 19 insertions(+), 7 deletions(-)

diff --git a/diff_diff/wooldridge_results.py b/diff_diff/wooldridge_results.py
index bd7c9e45..77765140 100644
--- a/diff_diff/wooldridge_results.py
+++ b/diff_diff/wooldridge_results.py
@@ -88,8 +88,16 @@ class WooldridgeDiDResults:
     _bm_artifacts: Optional[Tuple[np.ndarray, np.ndarray, np.ndarray, Dict[Tuple[Any, Any], int]]] = field(
         default=None, repr=False
     )
-    """(X_full, cluster_ids, bread_matrix, gt_coef_index_map) for hc2_bm; enables
-    lazy BM contrast-DOF computation in aggregate()."""
+    """(X_red, cluster_ids, bread_red, coef_idx_map) for hc2_bm; enables
+    lazy BM contrast-DOF computation in aggregate().
+
+    ``X_red`` / ``bread_red`` are the REDUCED (kept-column) design and bread
+    matrix produced by ``_fit_ols`` after rank-deficient column drops — the
+    same subspace ``solve_ols`` returned non-NaN coefficients in.
+    ``coef_idx_map`` maps each ``(g, t)`` cell present in
+    ``group_time_effects`` to its column index in ``X_red``. Storing reduced
+    artifacts avoids the singular full-design bread that
+    ``_compute_cr2_bm_contrast_dof`` would otherwise reject."""
 
     # ------------------------------------------------------------------ #
     # Public methods                                                      #
@@ -139,8 +147,12 @@ def _bm_contrast_dofs_for(
         ) -> Dict[Any, float]:
             if self.vcov_type != "hc2_bm" or self._bm_artifacts is None:
                 return {}
-            X_full, cluster_ids_full, bread_matrix, coef_idx_map = self._bm_artifacts
-            n_total = X_full.shape[1]
+            # ``X_red`` / ``bread_red`` are the REDUCED kept-column artifacts
+            # from ``_fit_ols`` (post rank-deficient drops). ``coef_idx_map``
+            # maps (g, t) → column index in ``X_red``. See
+            # ``_bm_artifacts`` docstring above for the rationale.
+            X_red, cluster_ids_full, bread_red, coef_idx_map = self._bm_artifacts
+            n_red = X_red.shape[1]
             contrast_cols: List[np.ndarray] = []
             agg_keys: List[Any] = []
             for agg_key, cells in cells_by_key.items():
@@ -149,7 +161,7 @@ def _bm_contrast_dofs_for(
                 w_total = sum(weights.get(c, 0) for c in cells)
                 if w_total == 0:
                     continue
-                col = np.zeros(n_total)
+                col = np.zeros(n_red)
                 contributed = False
                 for c in cells:
                     if c not in coef_idx_map:
@@ -168,7 +180,7 @@ def _bm_contrast_dofs_for(
             dof_map: Dict[Any, float] = {}
             try:
                 dof_vec = _compute_cr2_bm_contrast_dof(
-                    X_full, cluster_ids_full, bread_matrix, contrasts_matrix
+                    X_red, cluster_ids_full, bread_red, contrasts_matrix
                 )
                 for i, k in enumerate(agg_keys):
                     candidate = float(dof_vec[i])
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index ebc3bba9..59eb177b 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1488,7 +1488,7 @@ where `g(·)` is the link inverse (logistic or exp), `η_i` is the individual li
 
 *Variance families (`vcov_type`, OLS path only):*
 - `hc1` (default) — CR1 Liang-Zeger cluster-robust on the within-transformed design. Bit-equal to prior behavior (FWL preserves the score). The natural R anchor is `fixest::feols(y ~ <interactions> | unit + time, cluster=~unit)` or Stata `jwdid` (both within-transform). **Deviation from R `lm + clubSandwich::vcovCR(type="CR1S")`:** the full-dummy `lm` SE differs by a factor of `sqrt((n - k_within) / (n - k_total))` because clubSandwich's `(n-1)/(n-p)` finite-sample correction counts ALL columns (intercept + treatment + unit dummies + time dummies = `k_total`) while WooldridgeDiD's `solve_ols` on the within-transformed design counts only the treatment-cell columns (`k_within`). On the 240-obs / 51-column R-parity fixture this is ~11%; on typical larger panels (n >> k_total) the gap shrinks to <2%. No public WooldridgeDiD code path exposes the `lm + CR1S` (CR1 cluster-robust on the full-dummy design) finite-sample correction — `vcov_type="hc2_bm"` routes to the CR2 Bell-McCaffrey sandwich on the full-dummy design (different variance estimator entirely), not CR1S. Users who need exact `lm + clubSandwich::vcovCR(type="CR1S")` parity must call `solve_ols` directly on a full-dummy design or fit via R. Same deviation pattern as SunAbraham PR #472 (`fixest::sunab` vs `lm + clubSandwich`).
-- `hc2_bm` — CR2 Bell-McCaffrey via auto-route to full-dummy design (`[intercept, X_design, unit_dummies, time_dummies]`), then `solve_ols(..., vcov_type="hc2_bm")` through the clubSandwich port (PR #475). FWL does NOT preserve the hat matrix; HC2 leverage + BM DOF require the full-projection design. Per-coefficient SE matches `clubSandwich::vcovCR(lm(...), cluster=~unit, type="CR2")` at atol=1e-10. Per-cell `(g, t)` inference fields use `coef_test()$df_Satt` Bell-McCaffrey DOF (pinned at atol=1e-6 from CI half-width inversion). Aggregated inference (overall ATT + `.aggregate("group" | "calendar" | "event")`) uses contrast-specific BM DOFs from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(constraints=matrix(w, 1), vcov=vcov_CR2, test="HTZ")$df_denom`); the overall ATT contrast DOF is computed at fit time, the other three aggregations lazily on each `.aggregate(...)` call from BM artifacts (`X_full`, `cluster_ids`, bread matrix, coef-index map) stored on the Results object. Fail-closed across all surfaces: when BM DOF is unavailable (helper raises or returns non-finite), the affected inference fields are NaN — not normal-theory fallback (per `feedback_bm_contrast_dof_fail_closed`).
+- `hc2_bm` — CR2 Bell-McCaffrey via auto-route to full-dummy design (`[intercept, X_design, unit_dummies, time_dummies]`), then `solve_ols(..., vcov_type="hc2_bm")` through the clubSandwich port (PR #475). FWL does NOT preserve the hat matrix; HC2 leverage + BM DOF require the full-projection design. Per-coefficient SE matches `clubSandwich::vcovCR(lm(...), cluster=~unit, type="CR2")` at atol=1e-10. Per-cell `(g, t)` inference fields use `coef_test()$df_Satt` Bell-McCaffrey DOF (pinned at atol=1e-6 from CI half-width inversion). Aggregated inference (overall ATT + `.aggregate("group" | "calendar" | "event")`) uses contrast-specific BM DOFs from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(constraints=matrix(w, 1), vcov=vcov_CR2, test="HTZ")$df_denom`); the overall ATT contrast DOF is computed at fit time, the other three aggregations lazily on each `.aggregate(...)` call from BM artifacts (the REDUCED kept-column `X` / `cluster_ids` / bread matrix + the reduced-space coef-index map) stored on the Results object — using the reduced design after rank-deficient drops keeps the bread non-singular and matches the subspace `solve_ols` actually estimated in. Fail-closed across all surfaces: when BM DOF is unavailable (helper raises or returns non-finite), the affected inference fields are NaN — not normal-theory fallback (per `feedback_bm_contrast_dof_fail_closed`).
 - `classical`, `hc2` — supported via auto-route to full-dummy AND auto-drop of the unit auto-cluster (one-way families don't compose with `cluster_ids` per the linalg validator). Set `self.cluster=None` (default) for these; explicit `cluster="state"` + one-way family raises at the linalg validator. Matches `summary(lm(...))$coefficients` (classical) and `sandwich::vcovHC(type="HC2")` respectively.
 - `conley` — REJECTED at `__init__` (deferral; would require threading `conley_*` params through `solve_ols`; tracked in TODO.md).
 - `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` — REJECTED at `__init__`. GLM QMLE sandwich with HC2 leverage on canonical-link pseudo-residuals (`w = p(1-p)` for logit, `w = μ_i` for Poisson) needs CR2-BM-on-GLM derivation + R parity against `clubSandwich::vcovCR(glm(...))`. Tracked in TODO.md (WooldridgeDiD logit/poisson follow-up row).

From e4ccbdf4e9e61fe1835ad2bca0ac096f054ea84c Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 16:12:33 -0400
Subject: [PATCH 08/11] =?UTF-8?q?wooldridge:=20R7=20P3=20wording=20?=
 =?UTF-8?q?=E2=80=94=20bootstrap=20cluster=20comment=20+=20R=20script=20he?=
 =?UTF-8?q?ader?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P3 #1 (Maintainability, codex R7): bootstrap comment in _fit_ols said
"Always clusters at the unit level" but the implementation uses
``self.cluster if self.cluster else unit``. Updated the comment to
match the actual behavior (and the registry wording).

P3 #2 (Docs/Tests, codex R7): R golden-generator header claimed hc1
"matches CR1S" and described classical as "heteroskedasticity-only".
Both contradicted the registry deviation note + the methodology test
header. Rewrote: hc1 is REFERENCE-ONLY (not pinned at parity due to
the (n-1)/(n-k) deviation); classical is homoskedastic OLS SE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 benchmarks/R/generate_wooldridge_golden.R | 22 +++++++++++++++-------
 diff_diff/wooldridge.py                   | 16 +++++++++-------
 2 files changed, 24 insertions(+), 14 deletions(-)

diff --git a/benchmarks/R/generate_wooldridge_golden.R b/benchmarks/R/generate_wooldridge_golden.R
index 9b70d3c6..86f3b435 100644
--- a/benchmarks/R/generate_wooldridge_golden.R
+++ b/benchmarks/R/generate_wooldridge_golden.R
@@ -5,12 +5,19 @@
 # panel from `benchmarks/data/wooldridge_test_panel.csv`.
 #
 # Variants generated:
-#   - hc1 (CR1 Liang-Zeger cluster-robust at unit; matches `type="CR1S"` —
-#     Stata-style G/(G-1) * (n-1)/(n-p) correction)
-#   - hc2_bm (CR2 Bell-McCaffrey at unit; per-coef DOF via coef_test()$df_Satt;
-#     overall ATT BM contrast DOF via Wald_test(test="HTZ")$df_denom)
-#   - classical (lm() summary's heteroskedasticity-only SE)
-#   - hc2 (sandwich::vcovHC type="HC2"; no clustering)
+#   - hc1: CR1 Liang-Zeger cluster-robust at unit via clubSandwich `type="CR1S"`
+#     (Stata-style G/(G-1) * (n-1)/(n-p) correction on the full-dummy lm design).
+#     REFERENCE ONLY — diff-diff's WooldridgeDiD(vcov_type='hc1') uses the
+#     within-transformed design and is NOT pinned at parity here. See REGISTRY
+#     "Variance families" → "Deviation from R" for the (n-1)/(n-k) factor
+#     difference. The hc1 SE in this JSON is for diagnostic comparison only;
+#     do NOT add a Python parity test against it.
+#   - hc2_bm: CR2 Bell-McCaffrey cluster-robust at unit (per-coef DOF via
+#     coef_test()$df_Satt; overall ATT BM contrast DOF via Wald_test(test="HTZ")$df_denom).
+#   - classical: lm() summary's homoskedastic OLS SE (no robust correction).
+#     Python's vcov_type='classical' drops the unit auto-cluster to match this.
+#   - hc2: sandwich::vcovHC type="HC2" with NO clustering. Python's
+#     vcov_type='hc2' also drops the unit auto-cluster to match.
 #
 # clubSandwich >= 0.7.0 required (matches PR #475 / PR #479 pin).
 
@@ -140,7 +147,8 @@ overall_se_hc2_bm <- sqrt(
   t(overall_contrast) %*% vcov_cr2 %*% overall_contrast
 )[1, 1]
 
-# 3. classical (lm summary SE; OLS sigma^2 * (X'X)^-1)
+# 3. classical (lm summary SE — homoskedastic OLS sigma^2 * (X'X)^-1; no
+#    robust correction. Python WooldridgeDiD(vcov_type='classical') matches.)
 vcov_classical <- vcov(fit)
 se_classical <- sqrt(diag(vcov_classical)[int_idx])
 overall_se_classical <- sqrt(
diff --git a/diff_diff/wooldridge.py b/diff_diff/wooldridge.py
index 9e60cf81..5a77cb71 100644
--- a/diff_diff/wooldridge.py
+++ b/diff_diff/wooldridge.py
@@ -1209,13 +1209,15 @@ def _fit_ols(
         )
 
         # 9. Optional multiplier bootstrap (overrides analytic SE for overall ATT).
-        # Always clusters at the unit level (via ``cluster_ids_bootstrap``)
-        # regardless of the analytical sandwich's cluster setting, so the
-        # bootstrap remains intrinsically clustered even when ``vcov_type in
-        # {"hc2","classical"}`` drops the auto-cluster for the analytical
-        # vcov. The fit() guard at the top rejects ``n_bootstrap > 0`` +
-        # one-way + ``cluster=None``, so under any combination that reaches
-        # here, clustering at the unit level matches user intent.
+        # Clusters at ``self.cluster if self.cluster else unit`` (via the
+        # ``cluster_ids_bootstrap`` variable set just below the cluster-
+        # handling block at the top of _fit_ols) — i.e., the bootstrap
+        # honors the user's explicit ``cluster=X`` and falls back to unit
+        # only when ``cluster=None`` (the panel's natural unit of variation).
+        # The fit() guard at the top rejects ``n_bootstrap > 0`` +
+        # ``vcov_type in {"hc2","classical"}`` regardless of cluster, so
+        # under any combination that reaches here, the bootstrap cluster
+        # matches the analytical cluster on hc1/hc2_bm paths.
         if self.n_bootstrap > 0:
             rng = np.random.default_rng(self.seed)
             unique_boot_clusters = np.unique(cluster_ids_bootstrap)

From d38178055de4288811d11146911e64b0ac29d051 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 16:19:04 -0400
Subject: [PATCH 09/11] =?UTF-8?q?wooldridge:=20R8=20P3=20=E2=80=94=20remov?=
 =?UTF-8?q?e=20dead=20R=20script=20block=20+=20sync=20CHANGELOG=20to=20X?=
 =?UTF-8?q?=5Fred?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P3 (Maintainability, codex R8): the R script computed an unused
overall_dof_hc2_bm via constrain_equal(int_idx) before the actual
scalar-contrast Wald_test path that produces overall_att_contrast_dof.
Removed the dead block so the generator has a single authoritative
path for the overall ATT BM DOF, matching what Python computes.

P3 (Doc, codex R8): the R script inline hc1 comment still implied
CR1S "matches diff-diff's hc1+cluster"; expanded to mark hc1 output
as reference-only with a pointer to the registry deviation note.

P3 (CHANGELOG, codex R8): Wooldridge entry still referenced X_full as
the stored BM artifact even though the implementation now stores the
REDUCED kept-column design (X_red). Updated CHANGELOG to match the
implementation and the dataclass docstring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                              |  2 +-
 benchmarks/R/generate_wooldridge_golden.R | 35 ++++++++---------------
 2 files changed, 13 insertions(+), 24 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index d9eee16b..39a39c65 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
-- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. **Bell-McCaffrey Satterthwaite DOF is threaded across ALL hc2_bm user-facing inference surfaces**: (1) per-cell `group_time_effects[(g, t)]` use `coef_test()$df_Satt` (matches R at atol=1e-6 from CI inversion); (2) overall ATT uses the post-period-aggregation contrast DOF from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); (3) `.aggregate("group" | "calendar" | "event")` recomputes contrast-specific BM DOFs lazily from BM artifacts (`X_full`, `cluster_ids`, bread matrix, coef-index map) stored on the Results object. Fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + one-way (`hc2`/`classical`) raises at `fit()` regardless of `cluster=` setting (multiplier bootstrap is intrinsically clustered, but one-way vcov_type does not compose with cluster_ids — either the auto-cluster is dropped when `cluster=None` leaving the bootstrap with no cluster to draw at, or the linalg validator rejects one-way + cluster_ids when `cluster=X`). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
+- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. **Bell-McCaffrey Satterthwaite DOF is threaded across ALL hc2_bm user-facing inference surfaces**: (1) per-cell `group_time_effects[(g, t)]` use `coef_test()$df_Satt` (matches R at atol=1e-6 from CI inversion); (2) overall ATT uses the post-period-aggregation contrast DOF from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); (3) `.aggregate("group" | "calendar" | "event")` recomputes contrast-specific BM DOFs lazily from BM artifacts stored on the Results object — the REDUCED kept-column design (`X_red`), cluster_ids, reduced bread matrix, and reduced-space coef-index map (using the reduced kept-column design after rank-deficient drops keeps the bread non-singular and matches the subspace `solve_ols` actually estimated in). Fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + one-way (`hc2`/`classical`) raises at `fit()` regardless of `cluster=` setting (multiplier bootstrap is intrinsically clustered, but one-way vcov_type does not compose with cluster_ids — either the auto-cluster is dropped when `cluster=None` leaving the bootstrap with no cluster to draw at, or the linalg validator rejects one-way + cluster_ids when `cluster=X`). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
 - **ChaisemartinDHaultfoeuille (DCDH) methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473) and ContinuousDiD precedent (PR #476). REGISTRY `## ChaisemartinDHaultfoeuille` gains a formal `### Deviations from the paper / from R / library extensions` block consolidating 7 documented deviations into a single AI-review-recognized labeled surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)"): (D1) equal-cell weighting (deviation from BOTH AER 2020 Equation 3 AND R `DIDmultiplegtDYN`); (D2) period-based vs cohort-based stable controls; (D3) balanced-baseline panel + interior-gap drops + terminal-missingness retention + cell-period-allocator targeted `ValueError`; (D4) SE normalization `N_l` vs R `G` (~4% smaller analytical SE); (D5) singleton-cohort degeneracy → NaN with `UserWarning`; (D6) `<50%` switcher warning at far horizons (library extension citing Favara-Imbs application, footnote 14 of NBER WP 29873); (D7) Phase 3 `DID^X` covariate first-stage equal-cell weights. R cross-language coverage holds at documented tolerance bands in `tests/test_chaisemartin_dhaultfoeuille_parity.py` (`POINT_RTOL = 1e-4` on pure-direction point estimates, `MIXED_POINT_RTOL = 0.025` on mixed-direction, `PURE_DIRECTION_SE_RTOL = 0.05` on pure-direction SE, `SE_RTOL = 0.10` on multi-horizon SE, `se_rtol=0.15` on the long-panel `L_max=5` joiners-only scenario where cell-count-weighting compounds). No source code changes, no new tests, no new docstrings — consolidation only against the existing 12 methodology tests (`tests/test_methodology_chaisemartin_dhaultfoeuille.py`), 26 R-parity tests (`tests/test_chaisemartin_dhaultfoeuille_parity.py`), 352 unit tests (`tests/test_chaisemartin_dhaultfoeuille.py`), survey suites (`tests/test_survey_dcdh.py`, `tests/test_survey_dcdh_replicate_psu.py`, three cell-period coverage suites), and two primary-source DCDH paper reviews on disk (2020 AER + 2022/2023 NBER WP 29873 via PR #478; the `dechaisemartin-2026-review.md` on disk is HAD's primary source, not DCDH's, and is referenced as adjacent context only). The REGISTRY Deviations block uses semantic section-name anchors (rather than fragile line numbers) for back-references to other parts of the DCDH section — an intentional divergence from the PR #476 ContinuousDiD precedent reflecting PR-A wording-drift CI feedback that flagged line-number cross-references as drift-prone in long sections. `METHODOLOGY_REVIEW.md` DCDH row promoted **In Progress** → **Complete**; L27 In Progress example paragraph re-pointed to WooldridgeDiD; L1289 priority-order queue item #6 (DCDH) removed and items #7-#11 renumbered to #6-#10.
 
 ## [3.4.1] - 2026-05-21
diff --git a/benchmarks/R/generate_wooldridge_golden.R b/benchmarks/R/generate_wooldridge_golden.R
index 86f3b435..df292b05 100644
--- a/benchmarks/R/generate_wooldridge_golden.R
+++ b/benchmarks/R/generate_wooldridge_golden.R
@@ -96,7 +96,13 @@ n_total_coef <- length(coef_names)
 overall_contrast <- numeric(n_total_coef)
 overall_contrast[int_idx] <- contrast_weights
 
-# 1. hc1 + CR1S (Stata-style cluster-robust; matches diff-diff's hc1+cluster)
+# 1. hc1 + CR1S (Stata-style cluster-robust on the full-dummy `lm` design).
+#    REFERENCE ONLY — see header: diff-diff's WooldridgeDiD(vcov_type='hc1')
+#    uses the within-transformed design with a different (n-1)/(n-k)
+#    correction and is NOT pinned at parity against these numbers. The hc1
+#    JSON output is retained for diagnostic comparison; tests never assert
+#    parity. See REGISTRY "Variance families" → "Deviation from R" for the
+#    derivation of the gap.
 vcov_cr1s <- vcovCR(fit, cluster = df$unit, type = "CR1S")
 se_hc1 <- sqrt(diag(vcov_cr1s)[int_idx])
 overall_se_hc1 <- sqrt(
@@ -109,29 +115,12 @@ se_hc2_bm <- sqrt(diag(vcov_cr2)[int_idx])
 coef_test_out <- coef_test(fit, vcov = vcov_cr2, test = "Satterthwaite")
 df_satt_hc2_bm <- coef_test_out$df[int_idx]
 
-# Overall ATT BM contrast DOF via Wald_test (HTZ reduces to Satterthwaite on
-# 1-row constraint matrices; df_denom is the BM contrast DOF).
+# Overall ATT BM contrast DOF via Wald_test (HTZ on a 1-row constraint matrix
+# reduces to the Satterthwaite t-test; df_denom is the scalar-contrast BM
+# DOF). For an arbitrary linear contrast we pass the matrix directly via
+# `constraints = matrix(...)`; this is the form Python's
+# _compute_cr2_bm_contrast_dof emits for the post-period overall ATT.
 constraint_matrix <- matrix(overall_contrast, nrow = 1)
-overall_dof_hc2_bm <- tryCatch(
-  {
-    wt <- Wald_test(
-      fit,
-      constraints = constrain_equal(int_idx, reg_ex = FALSE),
-      vcov = vcov_cr2,
-      test = "HTZ"
-    )
-    # HTZ test on multi-row constraints reports a single F + df_num/df_denom
-    # row; df_denom is the Bell-McCaffrey-style aggregated DOF.
-    wt$df_denom
-  },
-  error = function(e) NA_real_
-)
-
-# For the OVERALL ATT scalar contrast (1-row weights vector), build directly:
-# Wald_test with `constraints` requiring a list of `constrain_*` calls
-# (clubSandwich >= 0.5.0); for an arbitrary linear contrast pass the matrix
-# directly via `constraints = matrix(...)`. The `df_denom` is the BM
-# Satterthwaite DOF for the scalar contrast.
 overall_wt <- tryCatch(
   Wald_test(
     fit,

From c7107d7a2a445dc1cbff4957682f94ca12d6c753 Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 17:48:21 -0400
Subject: [PATCH 10/11] =?UTF-8?q?wooldridge:=20address=20CI=20codex=20R1?=
 =?UTF-8?q?=20P2=20+=20P3=20=E2=80=94=20residual=20df=20+=20hc2=5Fbm=20boo?=
 =?UTF-8?q?tstrap=20test?=
MIME-Version: 1.0
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: 8bit

P2 (Methodology, CI codex R1): for non-survey `classical` and `hc2` fits,
per-cell + aggregate user-facing inference flowed through
`safe_inference(df=None)` even though the registry text said SE matches
`summary(lm(...))$coefficients` / `sandwich::vcovHC`. R's lm() / coef_test()
use the t-distribution with residual DOF = n - rank(X), not normal-theory.
Threaded `df_one_way = X.shape[0] - n_kept_columns` through both per-cell
inference and `aggregate()` for the classical/hc2 paths. New
`_df_one_way` field on WooldridgeDiDResults makes it accessible to the
lazy aggregate. New R-parity tests
`test_classical_per_cell_inference_uses_residual_df` and
`test_hc2_per_cell_inference_uses_residual_df` pin recovered DOF to
`n - rank(X) = 189` on the existing 240-obs / 51-column fixture via CI
half-width inversion.

P3 (Tests, CI codex R1): no positive `hc2_bm + n_bootstrap > 0`
regression existed. The new full-dummy branch's bootstrap closure
(coef_offset=1 indexing under the rank-deficient kept-column logic)
was thus only exposed to the regression-via-failure direction (the
analytical path tests). Added two positive bootstrap tests:
`test_hc2_bm_plus_bootstrap_finite_inference` (full-rank panel; asserts
ATT bit-equality vs analytical fit, finite bootstrap SE, finite
event-study aggregation) and `test_hc2_bm_plus_bootstrap_rank_deficient`
(all-eventually-treated panel where solve_ols drops late-cohort
columns; locks that the bootstrap loop survives rank deficiency).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 CHANGELOG.md                         |  2 +-
 diff_diff/wooldridge.py              | 59 ++++++++++++++++++----
 diff_diff/wooldridge_results.py      | 21 +++++++-
 docs/methodology/REGISTRY.md         |  2 +-
 tests/test_methodology_wooldridge.py | 39 +++++++++++++++
 tests/test_wooldridge.py             | 73 ++++++++++++++++++++++++++++
 6 files changed, 183 insertions(+), 13 deletions(-)

diff --git a/CHANGELOG.md b/CHANGELOG.md
index 9c3a064d..8feaf634 100644
--- a/CHANGELOG.md
+++ b/CHANGELOG.md
@@ -8,7 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
 ## [Unreleased]
 
 ### Added
-- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. **Bell-McCaffrey Satterthwaite DOF is threaded across ALL hc2_bm user-facing inference surfaces**: (1) per-cell `group_time_effects[(g, t)]` use `coef_test()$df_Satt` (matches R at atol=1e-6 from CI inversion); (2) overall ATT uses the post-period-aggregation contrast DOF from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); (3) `.aggregate("group" | "calendar" | "event")` recomputes contrast-specific BM DOFs lazily from BM artifacts stored on the Results object — the REDUCED kept-column design (`X_red`), cluster_ids, reduced bread matrix, and reduced-space coef-index map (using the reduced kept-column design after rank-deficient drops keeps the bread non-singular and matches the subspace `solve_ols` actually estimated in). Fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + one-way (`hc2`/`classical`) raises at `fit()` regardless of `cluster=` setting (multiplier bootstrap is intrinsically clustered, but one-way vcov_type does not compose with cluster_ids — either the auto-cluster is dropped when `cluster=None` leaving the bootstrap with no cluster to draw at, or the linalg validator rejects one-way + cluster_ids when `cluster=X`). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
+- **WooldridgeDiD `vcov_type` parameter, OLS path (Phase 1b PR 3/8).** `WooldridgeDiD(vcov_type=...)` now accepts `{"classical","hc1","hc2","hc2_bm"}` on `method="ols"` (defaults to `"hc1"`, preserves prior behavior at machine precision — the WLS-CR1 sandwich is algebraically invariant between the prior within-transform path and the new branched path, differing only by float64 multiplication ordering at sub-ULP scale; the full 106-test `tests/test_wooldridge.py` baseline still passes unchanged). `hc2_bm` auto-routes to a full-dummy saturated design (`[intercept, X_design, unit_dummies, time_dummies]`) + clubSandwich WLS-CR2 algebra (PR #475) — matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/wooldridge_golden.json` fixture. `classical`/`hc2` supported via full-dummy + auto-drop of the unit auto-cluster (one-way families); explicit `cluster="X"` + one-way family raises at the linalg validator. Per-cell + aggregate p-values/CIs on `classical`/`hc2` paths use the residual DOF `n - rank(X)` (matches R `lm()` / `coef_test()` t-distribution), not normal-theory. **Bell-McCaffrey Satterthwaite DOF is threaded across ALL hc2_bm user-facing inference surfaces**: (1) per-cell `group_time_effects[(g, t)]` use `coef_test()$df_Satt` (matches R at atol=1e-6 from CI inversion); (2) overall ATT uses the post-period-aggregation contrast DOF from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(test="HTZ")$df_denom` at atol=1e-10); (3) `.aggregate("group" | "calendar" | "event")` recomputes contrast-specific BM DOFs lazily from BM artifacts stored on the Results object — the REDUCED kept-column design (`X_red`), cluster_ids, reduced bread matrix, and reduced-space coef-index map (using the reduced kept-column design after rank-deficient drops keeps the bread non-singular and matches the subspace `solve_ols` actually estimated in). Fail-closed (all-NaN inference) when BM DOF unavailable, mirrors PR #475 R7 and PR #479 R3. `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` raises `NotImplementedError` at `__init__` (GLM CR2-BM-on-pseudo-residuals composition needs derivation; deferred to follow-up TODO row). `SurveyDesign` + `vcov_type != "hc1"` raises `NotImplementedError` at `fit()` (survey TSL overrides analytical sandwich). `n_bootstrap > 0` + one-way (`hc2`/`classical`) raises at `fit()` regardless of `cluster=` setting (multiplier bootstrap is intrinsically clustered, but one-way vcov_type does not compose with cluster_ids — either the auto-cluster is dropped when `cluster=None` leaving the bootstrap with no cluster to draw at, or the linalg validator rejects one-way + cluster_ids when `cluster=X`). `conley` rejected at `__init__` with a deferral pointer. `vcov_type`, `cluster_name`, `n_clusters` added to `WooldridgeDiDResults` for downstream introspection (per `feedback_results_vcov_label_cluster_metadata`). Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).
 - **`SpilloverDiD(survey_design=SurveyDesign.subpopulation(...))` full-design retention via zero-pad scores (Wave E.3).** Closes the Wave E.1/E.2/follow-up documented limitation at `REGISTRY.md:3249`: `SurveyDesign.subpopulation()`-derived designs AND warn-and-drop fits now preserve the full-domain resolved survey design — `n_psu` / `n_strata` / `df_survey` / Binder TSL per-stratum centering reflect the FULL domain rather than the post-`finite_mask` fit sample. **Documented synthesis (library-convention adoption, NOT new methodology):** Wave E.3 adopts the canonical "zero-pad scores to full panel + retain full-design resolved survey" pattern from R `survey::svyrecvar(subset())` (Lumley 2010 §2.5) already established in `diff_diff/imputation.py:2175-2183` (PreTrendsImputation lead regression — Omega_0 scores zero-padded back to full panel length) and `diff_diff/prep.py:1401-1432` (DCDH cell variance — IF zero-padded outside the cell). Wave E.3 propagates the same convention to SpilloverDiD's Wave E.1 Binder TSL × Wave D Gardner GMM × Wave E.2/follow-up stratified-Conley + serial Bartlett meat. **Mechanical realization (one new `_compute_gmm_corrected_meat` kwarg):** the gamma_hat / Psi build stays on SURVEY-FINITE-MASK inputs (`X_1_sparse_fit`, `X_10_sparse_fit`, `eps_10_fit` built on `survey_finite_mask = finite_mask & survey_weights > 0`; `X_2_kept_gamma`, `eps_2_fit_gamma`, `survey_weights_fit_gamma` projected from the fit-sample frame down to survey_finite_mask) so the drop-first stage-1 FE column space is bit-identical to the pre-E.3 path. `_compute_gmm_corrected_meat` gains a new optional kwarg `score_pad_mask: Optional[np.ndarray] = None`: when supplied, the helper zero-pads the fit-sample `Psi` to full panel length AFTER construction but BEFORE kernel dispatch via `Psi_padded[score_pad_mask] = Psi`. Kernel-dispatch arrays (`cluster_ids`, `conley_coords`, `conley_time`, `conley_unit`, `resolved_survey`) are passed at FULL length so the meat helpers (Binder TSL / stratified-Conley / serial Bartlett) see the full-domain PSU / strata / centroid / time geometry. The `_validate_conley_kwargs` call inside the helper reads `n_for_conley = len(score_pad_mask)` when the kwarg is set so the Conley shape checks see the full-length geometry. **`gamma_hat` invariance:** the gamma_hat solve operates on fit-sample inputs throughout — bit-identical to the pre-E.3 path (critical for the case where `_build_butts_fe_design_csr`'s `pd.factorize` re-compaction would drop a different unit's column under a full-length FE build than under a fit-length one). **Bread invariance:** `A_22 = X_2_kept' W X_2_kept` at `spillover.py:3187-3214` still uses fit-length `X_2_kept` because `A_22_full = X_2_full' W_full X_2_full` equals `A_22_kept` when zero-weight rows contribute zero. **A2 invariant:** warn-and-drop and `SurveyDesign.subpopulation()` drops are treated identically — both apply the zero-pad mechanism. The "both mechanisms compose cleanly" case (subpop-excluded row that is ALSO warn-and-dropped) produces `Psi = 0` from either cause; the PSU still counts toward `n_psu_full`. Hand-computation methodology anchor at `_scratch/wave_e3_smoke.py` codifies the A2 invariant on 4 PSU × 4 period × 3 obs synthetic. **Subpopulation parity vs upstream-subset:** `df_survey` matches the full domain regardless of how many rows the subpopulation mask excludes (mirrors R `svyglm(design=subset(d, mask))` vs `svyglm(design=svydesign(data=data[mask], ...))`). SE may differ by design — subpopulation retains zero-padded PSU geometry; upstream-subset drops PSUs entirely. **Pre-E.3 baseline parity:** when `finite_mask.all() == True` AND all weights `> 0`, the Wave E.3 zero-pad is a no-op — ATT + SE + n_psu + df_survey match pre-E.3 baseline values via FIXED GOLDEN values at `test_c` (`rtol=1e-12, atol=1e-12`). **Cross-surface n_psu consistency:** top-level `res.n_psu` reads from `len(resolved_survey_fit.weights)` on the implicit-PSU branch (was `int(finite_mask.sum())` pre-codex-R1-P2-fix); this keeps `res.n_psu == res.survey_metadata.n_psu` on weights-only / strata-only survey designs under warn-and-drop. Regression at `test_c2`. **Restrictions inherited:** replicate-weight variance + subpopulation continues to raise `NotImplementedError` at the Wave E.1 gate. TwoStageDiD's analogous `finite_mask + design-subset` pattern at `two_stage.py:567-601` is NOT yet adopted to Wave E.3 — separate parity follow-up tracked in `TODO.md` (an expected-divergence test was attempted but TwoStageDiD's always-treated handling at `two_stage.py:294-336` differs from SpilloverDiD's per-unit Omega_0 check, so the divergence didn't materialize on the standard fixture; the parity follow-up should add its own targeted regression). **Implementation:** `spillover.py:2845-2896` design-subset block deleted; `survey_weights_fit = survey_weights[finite_mask]` retained for the stage-2 OLS solve which still operates on the fit sample; `cluster_ids_full[finite_mask]` subset dropped on the survey path. `_compute_gmm_corrected_meat` call at `spillover.py:3163` now receives FIT-LENGTH gamma_hat-construction inputs (unchanged) plus FULL-LENGTH kernel-dispatch arrays (`cluster_ids_for_meat`, `conley_*_for_meat`, `resolved_survey_fit`) plus the new `score_pad_mask=survey_finite_mask` kwarg; no-survey path passes `score_pad_mask=None` and uses fit-length variables throughout (bit-identical to pre-E.3). `_compute_gmm_corrected_meat` at `two_stage.py:62-80` adds one new optional kwarg `score_pad_mask: Optional[np.ndarray] = None` and one post-Psi-construction zero-pad block; the `_validate_conley_kwargs` call uses `n_for_conley = len(score_pad_mask)` when the kwarg is set. Within-unit-constancy validator at `spillover.py:2913` updated to operate on full-length unit array. Second `compute_survey_metadata` recompute at `spillover.py:2954-2959` uses full-length `raw_w`. No `_compute_stratified_meat_from_psu_scores` / `_compute_stratified_conley_meat` / `_compute_stratified_serial_bartlett_meat` signature changes. **Tests:** new `TestSpilloverDiDWaveE3SubpopulationFullDesign` and `TestSpilloverDiDWaveE3SubpopulationFullDesignEventStudy` classes in `tests/test_spillover.py` (19 tests: pre-E.3 baseline parity via pinned goldens, n_psu cross-surface consistency on implicit-PSU branch, A2 invariant (zero-pad mechanics via mock-spy), subpopulation × explicit-PSU parity, conley + lag>0 + subpopulation × explicit-PSU / cluster-injection / weights-only branches, cluster-as-PSU + subpopulation parity, unit with BOTH zero weight AND no Omega_0 support, gamma_hat-build sample excludes zero-weight rows, n_obs / n_treated / n_control / n_far_away_obs reflect count_mask, warn-drop SE drift golden, ATT bit-equality under PSU-last-sort exclusion, exact event-study n_obs propagation, event-study on both is_staggered branches with analytical + conley+lag variants). Pre-existing Wave E.1 `test_p2_finite_mask_forces_drop_under_survey` assertion flipped from `n_psu=8` (subset) to `n_psu=10` (full domain) to reflect the new contract.
 - **ChaisemartinDHaultfoeuille (DCDH) methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473) and ContinuousDiD precedent (PR #476). REGISTRY `## ChaisemartinDHaultfoeuille` gains a formal `### Deviations from the paper / from R / library extensions` block consolidating 7 documented deviations into a single AI-review-recognized labeled surface (per CLAUDE.md "Documenting Deviations (AI Review Compatibility)"): (D1) equal-cell weighting (deviation from BOTH AER 2020 Equation 3 AND R `DIDmultiplegtDYN`); (D2) period-based vs cohort-based stable controls; (D3) balanced-baseline panel + interior-gap drops + terminal-missingness retention + cell-period-allocator targeted `ValueError`; (D4) SE normalization `N_l` vs R `G` (~4% smaller analytical SE); (D5) singleton-cohort degeneracy → NaN with `UserWarning`; (D6) `<50%` switcher warning at far horizons (library extension citing Favara-Imbs application, footnote 14 of NBER WP 29873); (D7) Phase 3 `DID^X` covariate first-stage equal-cell weights. R cross-language coverage holds at documented tolerance bands in `tests/test_chaisemartin_dhaultfoeuille_parity.py` (`POINT_RTOL = 1e-4` on pure-direction point estimates, `MIXED_POINT_RTOL = 0.025` on mixed-direction, `PURE_DIRECTION_SE_RTOL = 0.05` on pure-direction SE, `SE_RTOL = 0.10` on multi-horizon SE, `se_rtol=0.15` on the long-panel `L_max=5` joiners-only scenario where cell-count-weighting compounds). No source code changes, no new tests, no new docstrings — consolidation only against the existing 12 methodology tests (`tests/test_methodology_chaisemartin_dhaultfoeuille.py`), 26 R-parity tests (`tests/test_chaisemartin_dhaultfoeuille_parity.py`), 352 unit tests (`tests/test_chaisemartin_dhaultfoeuille.py`), survey suites (`tests/test_survey_dcdh.py`, `tests/test_survey_dcdh_replicate_psu.py`, three cell-period coverage suites), and two primary-source DCDH paper reviews on disk (2020 AER + 2022/2023 NBER WP 29873 via PR #478; the `dechaisemartin-2026-review.md` on disk is HAD's primary source, not DCDH's, and is referenced as adjacent context only). The REGISTRY Deviations block uses semantic section-name anchors (rather than fragile line numbers) for back-references to other parts of the DCDH section — an intentional divergence from the PR #476 ContinuousDiD precedent reflecting PR-A wording-drift CI feedback that flagged line-number cross-references as drift-prone in long sections. `METHODOLOGY_REVIEW.md` DCDH row promoted **In Progress** → **Complete**; L27 In Progress example paragraph re-pointed to WooldridgeDiD; L1289 priority-order queue item #6 (DCDH) removed and items #7-#11 renumbered to #6-#10.
 
diff --git a/diff_diff/wooldridge.py b/diff_diff/wooldridge.py
index 5a77cb71..897bd7b4 100644
--- a/diff_diff/wooldridge.py
+++ b/diff_diff/wooldridge.py
@@ -1003,6 +1003,23 @@ def _fit_ols(
         bm_artifacts: Optional[
             Tuple[np.ndarray, np.ndarray, np.ndarray, Dict[Tuple, int]]
         ] = None
+        # Residual DOF for one-way ``vcov_type in {"classical","hc2"}`` paths
+        # (full-dummy, no survey). Matches R's ``lm()`` / ``coef_test()`` use
+        # of ``n - rank(X)`` for the t-distribution under both classical OLS
+        # SE and ``sandwich::vcovHC(type="HC2")``. ``None`` on hc1 /
+        # hc2_bm / surveyed paths (those use their own DOF threading or
+        # df_inf). Mirrors R's t-distribution convention so per-cell +
+        # aggregate p-values/CIs are not normal-theory under small samples.
+        df_one_way: Optional[float] = None
+        if (
+            self.vcov_type in ("classical", "hc2")
+            and use_full_dummy
+            and resolved is None
+            and vcov is not None
+        ):
+            n_kept = int((~np.isnan(coefs)).sum())
+            df_candidate = X.shape[0] - n_kept
+            df_one_way = float(df_candidate) if df_candidate > 0 else float("nan")
         if (
             self.vcov_type == "hc2_bm"
             and use_full_dummy
@@ -1104,10 +1121,13 @@ def _fit_ols(
             # coefficient space and avoids the singular full-design bread.
             bm_artifacts = (X_red, cluster_ids, bread_red, reduced_coef_idx_map)
 
-        # 8a. Apply per-cell BM DOFs (or fail-closed NaN) to ``gt_effects``
-        # for hc2_bm; otherwise use the shared ``df_inf`` (survey df or None).
-        # Per ``feedback_bm_contrast_dof_fail_closed``: when per-cell DOF
-        # is NaN, the cell's inference fields are NaN.
+        # 8a. Apply DOF threading (or fail-closed NaN) to ``gt_effects``:
+        # hc2_bm uses per-cell BM Satterthwaite DOF; classical/hc2 (one-way,
+        # no survey) use the residual ``df_one_way = n - rank(X)`` so
+        # p-values/CIs match R ``lm()`` / ``coef_test()`` t-distribution
+        # instead of normal-theory; hc1 / surveyed paths use ``df_inf``
+        # (survey df or None). Per ``feedback_bm_contrast_dof_fail_closed``:
+        # NaN BM DOF emits NaN inference fields (never normal-theory).
         for (g, t), eff in gt_effects.items():
             if self.vcov_type == "hc2_bm" and use_full_dummy and resolved is None:
                 cell_dof = per_cell_bm_dof.get((g, t), float("nan"))
@@ -1119,6 +1139,16 @@ def _fit_ols(
                     t_stat = float("nan")
                     p_value = float("nan")
                     conf_int = (float("nan"), float("nan"))
+            elif (
+                self.vcov_type in ("classical", "hc2")
+                and use_full_dummy
+                and resolved is None
+                and df_one_way is not None
+                and np.isfinite(df_one_way)
+            ):
+                t_stat, p_value, conf_int = safe_inference(
+                    eff["att"], eff["se"], alpha=self.alpha, df=df_one_way
+                )
             else:
                 t_stat, p_value, conf_int = safe_inference(
                     eff["att"], eff["se"], alpha=self.alpha, df=df_inf
@@ -1127,11 +1157,11 @@ def _fit_ols(
             eff["p_value"] = p_value
             eff["conf_int"] = conf_int
 
-        # 8b. Simple aggregation (always computed). Use BM contrast DOF for
-        # the overall ATT inference when ``vcov_type='hc2_bm'``; otherwise
-        # fall back to the shared df (survey df or None). Fail-closed: when
-        # BM DOF is NaN, the analytical sandwich inference fields are NaN
-        # too (see ``feedback_bm_contrast_dof_fail_closed``).
+        # 8b. Simple aggregation (always computed). DOF threading mirrors 8a:
+        # hc2_bm uses the overall ATT BM contrast DOF (fail-closed NaN if
+        # unavailable); classical/hc2 (one-way, no survey) use the residual
+        # ``df_one_way``; hc1 / surveyed paths use ``df_inf`` (survey df or
+        # None).
         if self.vcov_type == "hc2_bm" and use_full_dummy and resolved is None:
             if overall_att_bm_dof is not None and np.isfinite(overall_att_bm_dof):
                 overall = _compute_weighted_agg(
@@ -1161,6 +1191,16 @@ def _fit_ols(
                     "p_value": float("nan"),
                     "conf_int": (float("nan"), float("nan")),
                 }
+        elif (
+            self.vcov_type in ("classical", "hc2")
+            and use_full_dummy
+            and resolved is None
+            and df_one_way is not None
+            and np.isfinite(df_one_way)
+        ):
+            overall = _compute_weighted_agg(
+                gt_effects, gt_weights, gt_keys_ordered, gt_vcov, self.alpha, df=df_one_way
+            )
         else:
             overall = _compute_weighted_agg(
                 gt_effects, gt_weights, gt_keys_ordered, gt_vcov, self.alpha, df=df_inf
@@ -1206,6 +1246,7 @@ def _fit_ols(
             _df_survey=df_inf,
             _bm_per_cell_dof=per_cell_bm_dof,
             _bm_artifacts=bm_artifacts,
+            _df_one_way=df_one_way,
         )
 
         # 9. Optional multiplier bootstrap (overrides analytic SE for overall ATT).
diff --git a/diff_diff/wooldridge_results.py b/diff_diff/wooldridge_results.py
index 77765140..df51ebd9 100644
--- a/diff_diff/wooldridge_results.py
+++ b/diff_diff/wooldridge_results.py
@@ -98,6 +98,12 @@ class WooldridgeDiDResults:
     ``group_time_effects`` to its column index in ``X_red``. Storing reduced
     artifacts avoids the singular full-design bread that
     ``_compute_cr2_bm_contrast_dof`` would otherwise reject."""
+    _df_one_way: Optional[float] = field(default=None, repr=False)
+    """Residual DOF (``n - rank(X)``) for one-way ``vcov_type in
+    {"classical","hc2"}`` paths (full-dummy, no survey). ``aggregate()``
+    uses this to thread R's ``lm()`` t-distribution into per-key
+    inference. ``None`` on hc1 / hc2_bm / surveyed paths (which use BM
+    DOF or ``_df_survey`` instead)."""
 
     # ------------------------------------------------------------------ #
     # Public methods                                                      #
@@ -206,8 +212,11 @@ def _build_effect(att: float, se: float, df_for_inference: Optional[float]) -> D
             """Build an effect dict using ``df_for_inference`` for the t-distribution.
 
             When ``self.vcov_type == "hc2_bm"``, ``df_for_inference`` should be
-            the BM contrast DOF (NaN → fail-closed). Otherwise it falls back
-            to ``self._df_survey`` (None → normal-theory).
+            the BM contrast DOF (NaN → fail-closed). For ``classical`` /
+            ``hc2`` (one-way, no survey) the residual DOF ``self._df_one_way``
+            is used so per-key inference matches R ``lm()`` /
+            ``coef_test()`` t-distribution. For hc1 / surveyed paths,
+            ``self._df_survey`` (None → normal-theory) is used.
             """
             if self.vcov_type == "hc2_bm":
                 if df_for_inference is None or not np.isfinite(df_for_inference):
@@ -221,6 +230,14 @@ def _build_effect(att: float, se: float, df_for_inference: Optional[float]) -> D
                 t_stat, p_value, conf_int = safe_inference(
                     att, se, alpha=self.alpha, df=df_for_inference
                 )
+            elif (
+                self.vcov_type in ("classical", "hc2")
+                and self._df_one_way is not None
+                and np.isfinite(self._df_one_way)
+            ):
+                t_stat, p_value, conf_int = safe_inference(
+                    att, se, alpha=self.alpha, df=self._df_one_way
+                )
             else:
                 t_stat, p_value, conf_int = safe_inference(
                     att, se, alpha=self.alpha, df=self._df_survey
diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index 40fa82f2..e8986c0b 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1489,7 +1489,7 @@ where `g(·)` is the link inverse (logistic or exp), `η_i` is the individual li
 *Variance families (`vcov_type`, OLS path only):*
 - `hc1` (default) — CR1 Liang-Zeger cluster-robust on the within-transformed design. Bit-equal to prior behavior (FWL preserves the score). The natural R anchor is `fixest::feols(y ~ <interactions> | unit + time, cluster=~unit)` or Stata `jwdid` (both within-transform). **Deviation from R `lm + clubSandwich::vcovCR(type="CR1S")`:** the full-dummy `lm` SE differs by a factor of `sqrt((n - k_within) / (n - k_total))` because clubSandwich's `(n-1)/(n-p)` finite-sample correction counts ALL columns (intercept + treatment + unit dummies + time dummies = `k_total`) while WooldridgeDiD's `solve_ols` on the within-transformed design counts only the treatment-cell columns (`k_within`). On the 240-obs / 51-column R-parity fixture this is ~11%; on typical larger panels (n >> k_total) the gap shrinks to <2%. No public WooldridgeDiD code path exposes the `lm + CR1S` (CR1 cluster-robust on the full-dummy design) finite-sample correction — `vcov_type="hc2_bm"` routes to the CR2 Bell-McCaffrey sandwich on the full-dummy design (different variance estimator entirely), not CR1S. Users who need exact `lm + clubSandwich::vcovCR(type="CR1S")` parity must call `solve_ols` directly on a full-dummy design or fit via R. Same deviation pattern as SunAbraham PR #472 (`fixest::sunab` vs `lm + clubSandwich`).
 - `hc2_bm` — CR2 Bell-McCaffrey via auto-route to full-dummy design (`[intercept, X_design, unit_dummies, time_dummies]`), then `solve_ols(..., vcov_type="hc2_bm")` through the clubSandwich port (PR #475). FWL does NOT preserve the hat matrix; HC2 leverage + BM DOF require the full-projection design. Per-coefficient SE matches `clubSandwich::vcovCR(lm(...), cluster=~unit, type="CR2")` at atol=1e-10. Per-cell `(g, t)` inference fields use `coef_test()$df_Satt` Bell-McCaffrey DOF (pinned at atol=1e-6 from CI half-width inversion). Aggregated inference (overall ATT + `.aggregate("group" | "calendar" | "event")`) uses contrast-specific BM DOFs from `_compute_cr2_bm_contrast_dof` (matches R `Wald_test(constraints=matrix(w, 1), vcov=vcov_CR2, test="HTZ")$df_denom`); the overall ATT contrast DOF is computed at fit time, the other three aggregations lazily on each `.aggregate(...)` call from BM artifacts (the REDUCED kept-column `X` / `cluster_ids` / bread matrix + the reduced-space coef-index map) stored on the Results object — using the reduced design after rank-deficient drops keeps the bread non-singular and matches the subspace `solve_ols` actually estimated in. Fail-closed across all surfaces: when BM DOF is unavailable (helper raises or returns non-finite), the affected inference fields are NaN — not normal-theory fallback (per `feedback_bm_contrast_dof_fail_closed`).
-- `classical`, `hc2` — supported via auto-route to full-dummy AND auto-drop of the unit auto-cluster (one-way families don't compose with `cluster_ids` per the linalg validator). Set `self.cluster=None` (default) for these; explicit `cluster="state"` + one-way family raises at the linalg validator. Matches `summary(lm(...))$coefficients` (classical) and `sandwich::vcovHC(type="HC2")` respectively.
+- `classical`, `hc2` — supported via auto-route to full-dummy AND auto-drop of the unit auto-cluster (one-way families don't compose with `cluster_ids` per the linalg validator). Set `self.cluster=None` (default) for these; explicit `cluster="state"` + one-way family raises at the linalg validator. SE matches `summary(lm(...))$coefficients` (classical) and `sandwich::vcovHC(type="HC2")` respectively. Per-cell + aggregate p-values/CIs use the residual DOF `n - rank(X)` (matches R `lm()` / `coef_test()` t-distribution under both classical OLS SE and `sandwich::vcovHC` defaults) — not normal-theory, so inference is correct under small samples.
 - `conley` — REJECTED at `__init__` (deferral; would require threading `conley_*` params through `solve_ols`; tracked in TODO.md).
 - `method ∈ {"logit","poisson"}` + `vcov_type != "hc1"` — REJECTED at `__init__`. GLM QMLE sandwich with HC2 leverage on canonical-link pseudo-residuals (`w = p(1-p)` for logit, `w = μ_i` for Poisson) needs CR2-BM-on-GLM derivation + R parity against `clubSandwich::vcovCR(glm(...))`. Tracked in TODO.md (WooldridgeDiD logit/poisson follow-up row).
 - `survey_design=` + `vcov_type != "hc1"` — REJECTED at `fit()` with `NotImplementedError`. Survey TSL/replicate-refit overrides analytical sandwich. Use `vcov_type="hc1"` (default) for survey designs.
diff --git a/tests/test_methodology_wooldridge.py b/tests/test_methodology_wooldridge.py
index 854270a3..f1172b7a 100644
--- a/tests/test_methodology_wooldridge.py
+++ b/tests/test_methodology_wooldridge.py
@@ -185,6 +185,45 @@ def test_hc2_se_matches_sandwich_vcovhc(self, golden: dict, panel: pd.DataFrame)
             assert py_se == pytest.approx(r_ses[i], abs=1e-10)
         assert res.overall_se == pytest.approx(golden["hc2"]["overall_att_se"], abs=1e-10)
 
+    def test_classical_per_cell_inference_uses_residual_df(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """Per-cell ``vcov_type="classical"`` inference uses ``n - rank(X)``
+        residual DOF (matches R ``summary(lm(...))$coefficients`` t-distribution)
+        rather than normal-theory.
+
+        n_obs=240, full-dummy design has intercept (1) + treatment cells (6) +
+        unit dummies (drop_first=True, 39) + time dummies (drop_first=True, 5)
+        = 51 columns, all kept (full rank). Residual df = 240 - 51 = 189.
+        """
+        res = WooldridgeDiD(method="ols", vcov_type="classical").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        expected_df = float(panel.shape[0] - 51)  # 189
+        for (g, t), eff in res.group_time_effects.items():
+            recovered_df = _recover_dof_from_ci(
+                eff["att"], eff["se"], eff["conf_int"][1], res.alpha
+            )
+            assert recovered_df == pytest.approx(expected_df, abs=1e-6), (
+                f"(g={g}, t={t}): recovered df={recovered_df:.4f} expected={expected_df}"
+            )
+
+    def test_hc2_per_cell_inference_uses_residual_df(
+        self, golden: dict, panel: pd.DataFrame
+    ) -> None:
+        """Per-cell ``vcov_type="hc2"`` inference uses ``n - rank(X)`` residual
+        DOF (matches R ``coef_test(fit, vcov=vcovHC(type="HC2"))`` t-distribution
+        default) rather than normal-theory."""
+        res = WooldridgeDiD(method="ols", vcov_type="hc2").fit(
+            panel, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        expected_df = float(panel.shape[0] - 51)
+        for (g, t), eff in res.group_time_effects.items():
+            recovered_df = _recover_dof_from_ci(
+                eff["att"], eff["se"], eff["conf_int"][1], res.alpha
+            )
+            assert recovered_df == pytest.approx(expected_df, abs=1e-6)
+
     def test_aggregate_group_bm_dof_matches_wald_test_htz(
         self, golden: dict, panel: pd.DataFrame
     ) -> None:
diff --git a/tests/test_wooldridge.py b/tests/test_wooldridge.py
index 6d0f86b4..a2d1de7f 100644
--- a/tests/test_wooldridge.py
+++ b/tests/test_wooldridge.py
@@ -1821,6 +1821,79 @@ def test_bootstrap_plus_one_way_rejected_regardless_of_cluster(self):
         with pytest.raises(ValueError, match=r"multiplier bootstrap"):
             est_cl.fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
 
+    def test_hc2_bm_plus_bootstrap_finite_inference(self):
+        """Positive regression: ``vcov_type='hc2_bm'`` + ``n_bootstrap > 0``
+        runs through the new full-dummy branch's bootstrap closure (with
+        ``coef_offset=1`` for the post-period ATT reconstruction) without
+        regressing. Asserts finite ``overall_se`` (overridden by the
+        multiplier bootstrap), stable ``overall_att`` (matches the
+        analytical fit at machine precision since the bootstrap only
+        overrides SE), and finite event-study aggregation."""
+        df = _make_vcov_panel()
+        # Analytical hc2_bm fit for ATT reference.
+        res_analytical = WooldridgeDiD(method="ols", vcov_type="hc2_bm").fit(
+            df, outcome="y", unit="unit", time="time", cohort="cohort"
+        )
+        # Bootstrap fit on the same data + seed.
+        res_boot = WooldridgeDiD(
+            method="ols", vcov_type="hc2_bm", n_bootstrap=50, seed=0
+        ).fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
+        # ATT is unchanged by the bootstrap (only SE is overridden)
+        assert res_boot.overall_att == pytest.approx(
+            res_analytical.overall_att, abs=1e-10
+        )
+        # SE finite + sensible (positive, smaller than the panel SD of y)
+        assert np.isfinite(res_boot.overall_se)
+        assert res_boot.overall_se > 0
+        assert res_boot.overall_se < df["y"].std()
+        # Bootstrap overrides analytical inference for overall ATT
+        assert np.isfinite(res_boot.overall_t_stat)
+        assert np.isfinite(res_boot.overall_p_value)
+        # Per-cell SEs still come from the analytical full-dummy CR2-BM path
+        # (bootstrap only overrides overall_*); locks the coef_offset
+        # bootstrap indexing didn't regress the per-cell analytical path.
+        for k, eff in res_boot.group_time_effects.items():
+            assert np.isfinite(eff["se"])
+            assert eff["att"] == pytest.approx(
+                res_analytical.group_time_effects[k]["att"], abs=1e-10
+            )
+        # Event-study aggregate also produces finite inference under bootstrap
+        res_boot.aggregate("event")
+        assert res_boot.event_study_effects is not None
+        for k, eff in res_boot.event_study_effects.items():
+            assert np.isfinite(eff["att"])
+            assert np.isfinite(eff["se"])
+            assert np.isfinite(eff["t_stat"])
+
+    def test_hc2_bm_plus_bootstrap_rank_deficient(self):
+        """hc2_bm + bootstrap on a rank-deficient design (all-eventually-
+        treated panel where late cohorts drop out of solve_ols) — bootstrap
+        loop must still run because cluster_ids_bootstrap defaults to unit
+        (cluster_ids itself is non-None on hc2_bm). Locks that the
+        coef_offset + dropped-cell indexing in the bootstrap closure
+        survives rank deficiency."""
+        rng = np.random.default_rng(42)
+        n_units, n_periods = 20, 8
+        units = np.repeat(np.arange(n_units), n_periods)
+        periods = np.tile(np.arange(1, n_periods + 1), n_units)
+        cohorts = rng.choice([3, 5, 7], size=n_units)
+        cohort_per_obs = cohorts[units]
+        tau = np.where(
+            periods >= cohort_per_obs, 0.5 + 0.2 * (periods - cohort_per_obs), 0.0
+        )
+        y = 1.0 + 0.1 * periods + tau + 0.1 * rng.normal(size=len(units))
+        df = pd.DataFrame(
+            {"unit": units, "time": periods, "cohort": cohort_per_obs, "y": y}
+        )
+        with warnings.catch_warnings():
+            warnings.simplefilter("ignore", UserWarning)
+            res = WooldridgeDiD(
+                method="ols", vcov_type="hc2_bm", n_bootstrap=50, seed=0
+            ).fit(df, outcome="y", unit="unit", time="time", cohort="cohort")
+        assert np.isfinite(res.overall_att)
+        assert np.isfinite(res.overall_se)
+        assert res.overall_se > 0
+
     def test_get_params_includes_vcov_type(self):
         est = WooldridgeDiD(vcov_type="hc2_bm")
         params = est.get_params()

From 8d96f3e6f7b2e590561d8baf0212014fdf034b0d Mon Sep 17 00:00:00 2001
From: igerber <isaac.gerber@gmail.com>
Date: Thu, 21 May 2026 17:55:41 -0400
Subject: [PATCH 11/11] wooldridge: REGISTRY bootstrap-cluster note scope (CI
 codex R2 P3)

CI codex R2 flagged that the bootstrap-cluster note described a
one-way + explicit-cluster bootstrap scenario as if it were part of the
supported contract, but the previous bullet rejects ``n_bootstrap > 0`` +
``vcov_type in {hc2, classical}`` regardless of cluster. Rewrote the
note to scope it to the supported paths (``hc1`` / ``hc2_bm``) and
explicitly reference the rejection bullet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
---
 docs/methodology/REGISTRY.md | 2 +-
 1 file changed, 1 insertion(+), 1 deletion(-)

diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md
index e8986c0b..b6fd2608 100644
--- a/docs/methodology/REGISTRY.md
+++ b/docs/methodology/REGISTRY.md
@@ -1495,7 +1495,7 @@ where `g(·)` is the link inverse (logistic or exp), `η_i` is the individual li
 - `survey_design=` + `vcov_type != "hc1"` — REJECTED at `fit()` with `NotImplementedError`. Survey TSL/replicate-refit overrides analytical sandwich. Use `vcov_type="hc1"` (default) for survey designs.
 - `n_bootstrap > 0` + `vcov_type ∈ {"hc2","classical"}` — REJECTED at `fit()` regardless of `self.cluster` setting. The multiplier bootstrap is intrinsically clustered, but one-way vcov_type does not compose with `cluster_ids`: with `cluster=None` the auto-cluster is dropped (bootstrap has no cluster to draw weights at); with `cluster=X` the linalg validator rejects one-way + cluster_ids downstream with a less-informative error. User must drop bootstrap (`n_bootstrap=0`) or pick a cluster-compatible `vcov_type` (`hc1` or `hc2_bm`).
 - **Note:** This routing is a documented synthesis of two existing methodology ingredients: the full-dummy auto-route from the Phase 1b PR 1/8 SunAbraham pattern (PR #472, which itself reused the Phase 1a Gate 1 TWFE lift from PR #469), and the clubSandwich WLS-CR2 algebra from the Phase 1a port (PR #475). The BM contrast DOF threading reuses `_compute_cr2_bm_contrast_dof` from PR #465 (MPD). No new methodology choice is introduced — the change is purely surface: extending the existing pattern from SA-OLS to WooldridgeDiD-OLS.
-- **Note:** Bootstrap clusters at `self.cluster if self.cluster else unit` regardless of `vcov_type`. When the analytical sandwich is one-way + the user set an explicit `cluster=X`, the bootstrap matches the user's cluster. The bootstrap SE overrides the analytical SE for `overall_*` on `n_bootstrap > 0` paths; per-cell `(g, t)` SEs still come from the analytical vcov.
+- **Note:** Bootstrap is supported only with `vcov_type ∈ {"hc1","hc2_bm"}` (one-way `classical`/`hc2` + bootstrap is rejected at `fit()` per the previous bullet). On the supported paths, the bootstrap clusters at `self.cluster if self.cluster else unit` — i.e., it matches the user's explicit cluster column if set, falling back to unit otherwise (the panel's natural unit of variation). The bootstrap SE overrides the analytical SE for `overall_*` on `n_bootstrap > 0` paths; per-cell `(g, t)` SEs still come from the analytical vcov.
 
 *Aggregations (matching `jwdid_estat`):*
 - `simple`: Weighted average across all post-treatment (g, t) cells with weights `n_{g,t}`: