Skip to content

wooldridge: thread vcov_type through OLS path (Phase 1b 3/8)#483

Merged
igerber merged 12 commits into
mainfrom
wooldridge-vcov-type-phase1b-3
May 22, 2026
Merged

wooldridge: thread vcov_type through OLS path (Phase 1b 3/8)#483
igerber merged 12 commits into
mainfrom
wooldridge-vcov-type-phase1b-3

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 21, 2026

Summary

  • WooldridgeDiD(vcov_type=...) now accepts {"classical","hc1","hc2","hc2_bm"} on the OLS path, mirroring the SunAbraham PR Thread vcov_type through SunAbraham (Phase 1b 1/8) #472 / StackedDiD PR Thread vcov_type through StackedDiD (Phase 1b 2/8) #479 pattern. hc1 (default) preserves bit-equal prior within-transform CR1 behavior; hc2_bm/hc2/classical auto-route to a full-dummy saturated design where the hat matrix matters.
  • hc2_bm end-to-end: per-coefficient SE matches clubSandwich::vcovCR(lm(...), type="CR2") at atol=1e-10. Bell-McCaffrey Satterthwaite DOF is threaded across ALL user-facing inference surfaces — per-cell group_time_effects[(g,t)] use coef_test()$df_Satt (matches R at atol=1e-6), overall ATT + .aggregate("group"|"calendar"|"event") use contrast-specific BM DOFs from _compute_cr2_bm_contrast_dof (matches Wald_test(test="HTZ")$df_denom). Aggregate DOFs are recomputed lazily from BM artifacts stashed on the Results object (REDUCED kept-column design — survives rank-deficient drops).
  • Fail-closed: when BM DOF is unavailable (helper raises or returns non-finite), affected inference fields are NaN — never silent normal-theory fallback.
  • One-way classical/hc2 auto-drop the unit auto-cluster (one-way families don't compose with cluster_ids); explicit cluster="X" + one-way raises at the linalg validator.
  • New rejects at the estimator boundary: method ∈ {logit, poisson} + vcov_type != "hc1" (GLM CR2-BM-on-pseudo-residuals deferred), survey_design + vcov_type != "hc1" (survey TSL overrides analytical), n_bootstrap > 0 + one-way vcov_type (intrinsically clustered bootstrap doesn't compose).
  • set_params() is now atomic: rejected combinations leave the estimator unchanged.
  • WooldridgeDiDResults gains vcov_type, cluster_name, n_clusters (None under survey TSL since the analytical sandwich is replaced).
  • Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs to follow: CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD).

Methodology references (required if estimator / math changes)

Validation

  • Tests added/updated: tests/test_wooldridge.py (28 new TestWooldridgeVcovType tests including bit-equal hc1 regression, 2 rank-deficient hc2_bm regressions, atomic set_params, survey cluster metadata cleanup, fail-closed BM DOF), tests/test_methodology_wooldridge.py (new file: 10 R-parity tests covering per-coef SE, per-cell df_Satt, overall ATT BM contrast DOF, group/calendar/event aggregate BM DOFs, classical/hc2 against lm + sandwich::vcovHC)
  • Backtest / simulation / notebook evidence: R parity goldens regenerated via benchmarks/R/generate_wooldridge_golden.R on the fixed-seed benchmarks/data/wooldridge_test_panel.csv (240 obs, 40 units, 6 periods, 3 cohorts, heterogeneous treatment effects per feedback_homogeneous_dgp_no_twfe_bias)
  • Existing 106-test tests/test_wooldridge.py baseline still passes unchanged; 148 total WooldridgeDiD tests pass

Security / privacy

  • Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

igerber and others added 9 commits May 21, 2026 15:04
…ath (Phase 1b 3/8)

WooldridgeDiD now accepts `vcov_type` for the OLS path, mirroring the
SunAbraham PR #472 / StackedDiD PR #479 pattern:

- `hc1` (default) preserves bit-equal within-transform CR1 behavior
- `hc2_bm` / `hc2` / `classical` auto-route to full-dummy saturated design
  (FWL doesn't preserve the hat matrix; HC2 leverage + BM DOF need the
  full FE projection). Matches `clubSandwich::vcovCR(lm(...), type="CR2")
  + coef_test()$df_Satt` at atol=1e-10 on the 6 R-parity tests in
  tests/test_methodology_wooldridge.py.
- Bell-McCaffrey Satterthwaite DOF threaded into overall ATT inference
  via `_compute_cr2_bm_contrast_dof`; fail-closed (all-NaN) when DOF
  unavailable, per feedback_bm_contrast_dof_fail_closed.
- One-way `hc2`/`classical` auto-drop the unit auto-cluster (one-way
  families don't compose with cluster_ids). Explicit `cluster="X"` +
  one-way raises at the linalg validator.
- `method ∈ {logit, poisson}` + `vcov_type != "hc1"` rejected at
  `__init__` (GLM CR2-BM derivation deferred to follow-up TODO row).
- `SurveyDesign` + `vcov_type != "hc1"` rejected at `fit()` (survey
  TSL overrides analytical sandwich).
- `n_bootstrap > 0` + one-way + `cluster=None` rejected at `fit()`
  (bootstrap is intrinsically clustered).

WooldridgeDiDResults gains `vcov_type`, `cluster_name`, `n_clusters`
fields for downstream introspection. Third PR of the Phase 1b
standalone-estimator threading initiative (5 PRs remaining).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urfaces

Address codex R1 P1 findings: per-cell group_time_effects and
.aggregate("group"/"calendar"/"event") were using df=None
(normal-theory) on the hc2_bm path. Both surfaces now use BM contrast
DOFs to match the documented clubSandwich parity contract:

- Per-cell df_Satt computed in one batched _compute_cr2_bm_contrast_dof
  call at fit time (one-hot contrasts per (g,t)); matches
  coef_test()$df_Satt at atol=1e-6 (CI inversion).
- Overall ATT DOF computed in the same batched call (post-period
  aggregation contrast).
- BM artifacts (X_full, cluster_ids, bread, coef_index_map) stashed on
  WooldridgeDiDResults so .aggregate("group"/"calendar"/"event")
  recomputes contrast-specific BM DOFs lazily.
- Fail-closed everywhere: NaN inference fields when BM DOF unavailable,
  never silent normal-theory fallback (per
  feedback_bm_contrast_dof_fail_closed).

New tests: per-cell df_Satt R-parity (1), aggregate(group/event/calendar)
BM DOF threading (3), aggregate fail-closed on helper failure (1),
extended existing fail-closed test to cover per-cell NaN propagation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…parity

Address codex R2 findings:

P2 (Code Quality): Bootstrap + one-way analytical vcov_type was rejected
only under cluster=None. Tightened to reject the full Cartesian (cluster
None OR set) — under cluster=X the linalg validator would reject one-way
+ cluster_ids with a less-informative downstream message, so preempt at
the estimator boundary.

P3 (Maintainability): _fit_logit and _fit_poisson now thread vcov_type,
cluster_name, n_clusters into WooldridgeDiDResults (locked to vcov_type=
"hc1" by the __init__ guard on non-OLS + non-hc1, but the metadata
contract is now consistent across all three method paths).

P2 (Doc): REGISTRY hc1 note incorrectly claimed users could recover the
lm + CR1S SE via vcov_type="hc2_bm". hc2_bm is the CR2 Bell-McCaffrey
sandwich, not CR1S. Rewrote to state explicitly that no public
WooldridgeDiD path exposes lm + CR1S parity.

P2 (Tests): aggregate("group"/"calendar"/"event") under hc2_bm previously
only asserted finite p-values, which would have passed under the prior
normal-theory fallback bug. Added 3 R-parity tests that invert the CI
half-width to recover Python's BM contrast DOF and assert atol=1e-6
parity with R clubSandwich::Wald_test(test="HTZ")$df_denom. Extended
benchmarks/R/generate_wooldridge_golden.R to compute the per-key BM DOF
for each aggregation type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nment

P1 (Methodology, codex R3): hc2_bm BM contrast DOF computation operated
on the unreduced full-dummy design (X / X.T @ X). When solve_ols dropped
collinear columns under rank-deficient ETWFE specs (all-eventually-
treated not_yet_treated, time-invariant exovar collinear with unit FE,
etc.), _compute_cr2_bm_contrast_dof would LinAlgError on the singular
bread and the fail-closed branch zeroed every per-cell + aggregate
inference field to NaN — even for the still-identified cells that
solve_ols correctly returned.

Fix: subset X / bread / contrasts to the kept-column space (where
solve_ols produced non-NaN coefs) before passing to
_compute_cr2_bm_contrast_dof. Reduced artifacts (X_red, cluster_ids,
bread_red, reduced_coef_idx_map) are stashed on the Results object so
the lazy aggregate() path also operates in the reduced subspace.
Mirrors the MPD pattern at estimators.py:1860-1913. Cells whose
treatment-cell coefficient was dropped retain NaN inference fields
(fail-closed correctly for them).

P2 (Docs): CHANGELOG + REGISTRY bootstrap-reject prose was narrower than
the actual implementation (cluster=None only); updated to reflect the
broader reject regardless of cluster=.

Two new regression tests in tests/test_wooldridge.py exercise the
previously-broken rank-deficient hc2_bm paths:
- All-eventually-treated panel with not_yet_treated control (late
  cohort cells dropped by solve_ols)
- Unit-invariant exovar covariate (collinear with unit FE)

Both assert per-cell + aggregate inference is finite on identified cells.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e('calendar')

Address codex R4 P2: the rank-deficient hc2_bm regression tests
exercised group + event aggregates but not calendar. Extends both
regressions to also call aggregate('calendar') and assert finite BM
inference on surviving calendar contrasts — locks the reduced-space
BM DOF threading across all three non-simple aggregation surfaces.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(R5)

P1 (Code Quality, codex R5): set_params() mutated self BEFORE running
validation. A rejected method × vcov_type combination left the
estimator half-mutated: a caller that caught the exception could then
run fit() on a partially-applied configuration (e.g., logit HC1 fit
while self.vcov_type silently reads "hc2_bm"). Reworked set_params()
to compute pending values on locals, validate the full overlay first,
and only mutate self after validation succeeds. Mirrors the
DifferenceInDifferences.set_params atomic pattern at
estimators.py:995-1023.

P2 (Maintainability, codex R5): under survey TSL, the analytical
sandwich (including its cluster_ids) is replaced by the survey-design
variance — but the WooldridgeDiDResults dataclass was still surfacing
cluster_name='unit' / n_clusters=N_units from the analytical-path
default, misrepresenting what was actually computed. Set both to None
when survey_metadata is active; the survey-design stratification
lives in survey_metadata. Applied across _fit_ols, _fit_logit, and
_fit_poisson.

New regression tests:
- test_set_params_is_atomic_on_validation_failure (3 reject scenarios:
  method×vcov interaction, unknown vcov_type batch, unknown param key —
  each asserts get_params() unchanged after the rejection).
- test_survey_design_clears_cluster_metadata (survey + hc1 fit on OLS
  path asserts cluster_name/n_clusters are None).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address codex R6 P3: internal naming in aggregate() and the
_bm_artifacts docstring used the misleading X_full / bread_matrix
labels even though the stored artifacts are the REDUCED kept-column
design (post rank-deficient drops). Renamed locals to X_red /
bread_red and expanded the dataclass field docstring to spell out
that the artifacts are reduced — using the singular full-design
bread is exactly the failure mode the rank-deficient threading was
introduced to avoid. REGISTRY entry updated to match.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P3 #1 (Maintainability, codex R7): bootstrap comment in _fit_ols said
"Always clusters at the unit level" but the implementation uses
``self.cluster if self.cluster else unit``. Updated the comment to
match the actual behavior (and the registry wording).

P3 #2 (Docs/Tests, codex R7): R golden-generator header claimed hc1
"matches CR1S" and described classical as "heteroskedasticity-only".
Both contradicted the registry deviation note + the methodology test
header. Rewrote: hc1 is REFERENCE-ONLY (not pinned at parity due to
the (n-1)/(n-k) deviation); classical is homoskedastic OLS SE.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P3 (Maintainability, codex R8): the R script computed an unused
overall_dof_hc2_bm via constrain_equal(int_idx) before the actual
scalar-contrast Wald_test path that produces overall_att_contrast_dof.
Removed the dead block so the generator has a single authoritative
path for the overall ATT BM DOF, matching what Python computes.

P3 (Doc, codex R8): the R script inline hc1 comment still implied
CR1S "matches diff-diff's hc1+cluster"; expanded to mark hc1 output
as reference-only with a pointer to the registry deviation note.

P3 (CHANGELOG, codex R8): Wooldridge entry still referenced X_full as
the stored BM artifact even though the implementation now stores the
REDUCED kept-column design (X_red). Updated CHANGELOG to match the
implementation and the dataclass docstring.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • Methods affected: Wooldridge ETWFE OLS variance routing for vcov_type in {"classical","hc1","hc2","hc2_bm"}.
  • The hc2_bm methodology itself looks aligned with docs/methodology/REGISTRY.md: full-dummy routing, reduced-design BM contrast DOF handling, and fail-closed NaN inference are all documented and regression-tested against R.
  • I did not find a P0/P1 defect in the new CR2-BM path.
  • The main remaining issue is lower-severity: the new one-way classical/hc2 surfaces only lock SE parity, while non-survey p-values/CIs still use normal-theory inference.
  • There is also a lower-severity test gap on the supported hc2_bm bootstrap/full-dummy refit path.

Methodology
Methods checked: Wooldridge (2025) ETWFE OLS path plus the clubSandwich/PT2018 CR2 Bell-McCaffrey extension described in the registry.

  • P2 Impact: for non-survey classical and hc2 fits, user-facing inference still flows through safe_inference(..., df=None), so per-cell and overall p-values/CIs are normal-theory even though the registry text says classical matches summary(lm(...))$coefficients and the new parity tests only lock SEs. This does not change coefficients or SEs, but it does make the new inference surface optimistic in smaller samples and leaves the R-reference claim incomplete. Concrete fix: either thread finite-sample df into the one-way paths (for classical, n_eff - k_effective on the reduced full-dummy design; for hc2, explicitly choose and document the intended reference), or narrow the docs/tests to “SE parity only.” References: diff_diff/wooldridge.py:L1111-L1167, docs/methodology/REGISTRY.md:L1489-L1493, tests/test_methodology_wooldridge.py:L164-L186.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No untracked blocker. The conley deferral and the non-OLS vcov_type != "hc1" deferral are explicitly tracked in TODO.md, so those are properly mitigated for this PR.

Security

  • No findings.

Documentation/Tests

  • P3 Impact: the changed bootstrap code path for supported vcov_type="hc2_bm" fits is not positively covered. The new suite exercises analytical hc2_bm, one-way bootstrap rejections, and BM fail-closed behavior, but not a successful hc2_bm + n_bootstrap > 0 refit through the new full-dummy branch. That leaves the new coef_offset/full-dummy bootstrap plumbing exposed to regression. Concrete fix: add a positive bootstrap regression test for WooldridgeDiD(method="ols", vcov_type="hc2_bm", n_bootstrap>0) asserting finite overall_se, stable overall_att, and one rank-deficient fixture. References: diff_diff/wooldridge.py:L1230-L1276, tests/test_wooldridge.py:L1699-L2145.

igerber and others added 2 commits May 21, 2026 17:40
…rap test

P2 (Methodology, CI codex R1): for non-survey `classical` and `hc2` fits,
per-cell + aggregate user-facing inference flowed through
`safe_inference(df=None)` even though the registry text said SE matches
`summary(lm(...))$coefficients` / `sandwich::vcovHC`. R's lm() / coef_test()
use the t-distribution with residual DOF = n - rank(X), not normal-theory.
Threaded `df_one_way = X.shape[0] - n_kept_columns` through both per-cell
inference and `aggregate()` for the classical/hc2 paths. New
`_df_one_way` field on WooldridgeDiDResults makes it accessible to the
lazy aggregate. New R-parity tests
`test_classical_per_cell_inference_uses_residual_df` and
`test_hc2_per_cell_inference_uses_residual_df` pin recovered DOF to
`n - rank(X) = 189` on the existing 240-obs / 51-column fixture via CI
half-width inversion.

P3 (Tests, CI codex R1): no positive `hc2_bm + n_bootstrap > 0`
regression existed. The new full-dummy branch's bootstrap closure
(coef_offset=1 indexing under the rank-deficient kept-column logic)
was thus only exposed to the regression-via-failure direction (the
analytical path tests). Added two positive bootstrap tests:
`test_hc2_bm_plus_bootstrap_finite_inference` (full-rank panel; asserts
ATT bit-equality vs analytical fit, finite bootstrap SE, finite
event-study aggregation) and `test_hc2_bm_plus_bootstrap_rank_deficient`
(all-eventually-treated panel where solve_ols drops late-cohort
columns; locks that the bootstrap loop survives rank deficiency).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: c7107d7a2a445dc1cbff4957682f94ca12d6c753


Overall Assessment

✅ Looks good

Executive Summary

  • Methods affected: Wooldridge ETWFE OLS vcov_type routing for classical, hc1, hc2, and hc2_bm.
  • The substantive methodology changes look aligned with docs/methodology/REGISTRY.md: full-dummy routing for hat-matrix-dependent paths, residual-DF inference for one-way classical/hc2, reduced-design BM contrast DOFs, and fail-closed NaN inference when BM DOF is unavailable.
  • The prior lower-severity review items appear addressed: one-way non-survey inference is no longer normal-theory, and the supported hc2_bm + n_bootstrap > 0 path now has positive coverage.
  • I did not find an unmitigated P0/P1 issue in the changed estimator/results code.
  • One minor documentation inconsistency remains in the new registry text.
  • I could not execute the test suite in this shell because the project Python dependencies are unavailable, so this is a static review.

Methodology

  • P3 Severity: informational. Impact: the registry correctly says n_bootstrap > 0 with one-way vcov_type in {"hc2","classical"} is rejected, but the next note still describes a one-way + explicit-cluster bootstrap case as if it were part of the supported contract. This does not affect runtime behavior, but it leaves the documented methodology surface internally inconsistent for readers. Concrete fix: rewrite the bootstrap note so it is scoped only to supported bootstrap paths (hc1/hc2_bm), or explicitly state that the one-way case is rejected before bootstrap. References: docs/methodology/REGISTRY.md:L1496-L1498, diff_diff/wooldridge.py:L558-L578, tests/test_wooldridge.py:L1811-L1822.

Code Quality

  • No findings.

Performance

  • No findings.

Maintainability

  • No findings.

Tech Debt

  • No findings. The deferred conley and non-OLS vcov_type work is properly tracked in TODO.md:L102-L106, so those follow-ups are non-blocking for this PR.

Security

  • No findings.

Documentation/Tests

  • No findings from static review of the changed tests. The prior bootstrap coverage gap appears closed by tests/test_wooldridge.py:L1824-L1895, and the new R-parity suite materially improves methodology locking in tests/test_methodology_wooldridge.py:L75-L292.

CI codex R2 flagged that the bootstrap-cluster note described a
one-way + explicit-cluster bootstrap scenario as if it were part of the
supported contract, but the previous bullet rejects ``n_bootstrap > 0`` +
``vcov_type in {hc2, classical}`` regardless of cluster. Rewrote the
note to scope it to the supported paths (``hc1`` / ``hc2_bm``) and
explicitly reference the rejection bullet.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 8d96f3e6f7b2e590561d8baf0212014fdf034b0d


Overall Assessment

✅ Looks good

Executive Summary

  • Re-review scope: the prior informational registry inconsistency is resolved; docs/methodology/REGISTRY.md:L1496-L1498 now matches the estimator-level one-way-bootstrap rejection in diff_diff/wooldridge.py:L558-L578.
  • Methods affected: Wooldridge ETWFE OLS vcov_type routing for classical, hc1, hc2, and hc2_bm, plus BM DOF threading, one-way residual-DF inference, bootstrap gating, and results metadata in diff_diff/wooldridge.py:L766-L1319 and diff_diff/wooldridge_results.py:L60-L323.
  • I did not find an unmitigated P0/P1. The changed behavior matches the documented methodology surface in docs/methodology/REGISTRY.md:L1489-L1498 and the estimator docstrings/references in diff_diff/wooldridge.py:L262-L320.
  • The edge cases that mattered in prior review are covered: fail-closed BM inference, reduced-design BM DOF under rank deficiency, survey + non-hc1 rejection, and one-way + bootstrap rejection (diff_diff/wooldridge.py:L986-L1319, tests/test_wooldridge.py:L1804-L2218).
  • The new R-parity suite materially strengthens methodology lock-in for CR2/BM and one-way classical/hc2 inference (tests/test_methodology_wooldridge.py:L75-L292).
  • Static review only: I could not execute the tests in this shell because numpy and pytest are unavailable.

Methodology

  • No findings. hc1 stays on the within-transform path; hc2_bm correctly routes to full-dummy CR2/BM with per-cell and contrast-specific DOFs; classical/hc2 use one-way full-dummy inference with residual n - rank(X) df (diff_diff/wooldridge.py:L823-L1250, diff_diff/wooldridge_results.py:L145-L323, docs/methodology/REGISTRY.md:L1489-L1498).

Code Quality

  • No findings. The PR uses safe_inference() rather than reintroducing inline inference, and vcov_type is propagated through __init__, get_params(), set_params(), solve/bootstrap paths, and results metadata (diff_diff/wooldridge.py:L323-L478, L912-L921, L1293-L1317).

Performance

  • No findings. The dense full-dummy route is a methodological requirement for hat-matrix-dependent vcov families, and the PR at least warns on very large designs (diff_diff/wooldridge.py:L831-L852).

Maintainability

  • No findings. The shared constructor validator and atomic set_params() reduce drift risk on the new public surface (diff_diff/wooldridge.py:L367-L478).

Tech Debt

  • Severity: P3
  • Impact: None. The deliberate deferrals for vcov_type="conley" and non-OLS vcov_type != "hc1" are explicitly tracked, so they are non-blocking.
  • Concrete fix: None required in this PR; follow-up work is already recorded in TODO.md:L102-L106.

Security

  • No findings.

Documentation/Tests

  • No findings from static review. The docs, changelog, guide text, and tests are internally consistent for the new OLS vcov surface (CHANGELOG.md:L10-L11, diff_diff/guides/llms-full.txt:L965-L973, docs/methodology/REGISTRY.md:L1489-L1498, tests/test_wooldridge.py:L1679-L2218, tests/test_methodology_wooldridge.py:L75-L292).
  • I could not run the suite here because the environment is missing numpy and pytest.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 21, 2026
@igerber igerber merged commit 0058693 into main May 22, 2026
33 of 34 checks passed
@igerber igerber deleted the wooldridge-vcov-type-phase1b-3 branch May 22, 2026 11:39
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant