wooldridge: thread vcov_type through OLS path (Phase 1b 3/8)#483
Merged
Conversation
…ath (Phase 1b 3/8) WooldridgeDiD now accepts `vcov_type` for the OLS path, mirroring the SunAbraham PR #472 / StackedDiD PR #479 pattern: - `hc1` (default) preserves bit-equal within-transform CR1 behavior - `hc2_bm` / `hc2` / `classical` auto-route to full-dummy saturated design (FWL doesn't preserve the hat matrix; HC2 leverage + BM DOF need the full FE projection). Matches `clubSandwich::vcovCR(lm(...), type="CR2") + coef_test()$df_Satt` at atol=1e-10 on the 6 R-parity tests in tests/test_methodology_wooldridge.py. - Bell-McCaffrey Satterthwaite DOF threaded into overall ATT inference via `_compute_cr2_bm_contrast_dof`; fail-closed (all-NaN) when DOF unavailable, per feedback_bm_contrast_dof_fail_closed. - One-way `hc2`/`classical` auto-drop the unit auto-cluster (one-way families don't compose with cluster_ids). Explicit `cluster="X"` + one-way raises at the linalg validator. - `method ∈ {logit, poisson}` + `vcov_type != "hc1"` rejected at `__init__` (GLM CR2-BM derivation deferred to follow-up TODO row). - `SurveyDesign` + `vcov_type != "hc1"` rejected at `fit()` (survey TSL overrides analytical sandwich). - `n_bootstrap > 0` + one-way + `cluster=None` rejected at `fit()` (bootstrap is intrinsically clustered). WooldridgeDiDResults gains `vcov_type`, `cluster_name`, `n_clusters` fields for downstream introspection. Third PR of the Phase 1b standalone-estimator threading initiative (5 PRs remaining). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…urfaces
Address codex R1 P1 findings: per-cell group_time_effects and
.aggregate("group"/"calendar"/"event") were using df=None
(normal-theory) on the hc2_bm path. Both surfaces now use BM contrast
DOFs to match the documented clubSandwich parity contract:
- Per-cell df_Satt computed in one batched _compute_cr2_bm_contrast_dof
call at fit time (one-hot contrasts per (g,t)); matches
coef_test()$df_Satt at atol=1e-6 (CI inversion).
- Overall ATT DOF computed in the same batched call (post-period
aggregation contrast).
- BM artifacts (X_full, cluster_ids, bread, coef_index_map) stashed on
WooldridgeDiDResults so .aggregate("group"/"calendar"/"event")
recomputes contrast-specific BM DOFs lazily.
- Fail-closed everywhere: NaN inference fields when BM DOF unavailable,
never silent normal-theory fallback (per
feedback_bm_contrast_dof_fail_closed).
New tests: per-cell df_Satt R-parity (1), aggregate(group/event/calendar)
BM DOF threading (3), aggregate fail-closed on helper failure (1),
extended existing fail-closed test to cover per-cell NaN propagation.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…parity
Address codex R2 findings:
P2 (Code Quality): Bootstrap + one-way analytical vcov_type was rejected
only under cluster=None. Tightened to reject the full Cartesian (cluster
None OR set) — under cluster=X the linalg validator would reject one-way
+ cluster_ids with a less-informative downstream message, so preempt at
the estimator boundary.
P3 (Maintainability): _fit_logit and _fit_poisson now thread vcov_type,
cluster_name, n_clusters into WooldridgeDiDResults (locked to vcov_type=
"hc1" by the __init__ guard on non-OLS + non-hc1, but the metadata
contract is now consistent across all three method paths).
P2 (Doc): REGISTRY hc1 note incorrectly claimed users could recover the
lm + CR1S SE via vcov_type="hc2_bm". hc2_bm is the CR2 Bell-McCaffrey
sandwich, not CR1S. Rewrote to state explicitly that no public
WooldridgeDiD path exposes lm + CR1S parity.
P2 (Tests): aggregate("group"/"calendar"/"event") under hc2_bm previously
only asserted finite p-values, which would have passed under the prior
normal-theory fallback bug. Added 3 R-parity tests that invert the CI
half-width to recover Python's BM contrast DOF and assert atol=1e-6
parity with R clubSandwich::Wald_test(test="HTZ")$df_denom. Extended
benchmarks/R/generate_wooldridge_golden.R to compute the per-key BM DOF
for each aggregation type.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nment P1 (Methodology, codex R3): hc2_bm BM contrast DOF computation operated on the unreduced full-dummy design (X / X.T @ X). When solve_ols dropped collinear columns under rank-deficient ETWFE specs (all-eventually- treated not_yet_treated, time-invariant exovar collinear with unit FE, etc.), _compute_cr2_bm_contrast_dof would LinAlgError on the singular bread and the fail-closed branch zeroed every per-cell + aggregate inference field to NaN — even for the still-identified cells that solve_ols correctly returned. Fix: subset X / bread / contrasts to the kept-column space (where solve_ols produced non-NaN coefs) before passing to _compute_cr2_bm_contrast_dof. Reduced artifacts (X_red, cluster_ids, bread_red, reduced_coef_idx_map) are stashed on the Results object so the lazy aggregate() path also operates in the reduced subspace. Mirrors the MPD pattern at estimators.py:1860-1913. Cells whose treatment-cell coefficient was dropped retain NaN inference fields (fail-closed correctly for them). P2 (Docs): CHANGELOG + REGISTRY bootstrap-reject prose was narrower than the actual implementation (cluster=None only); updated to reflect the broader reject regardless of cluster=. Two new regression tests in tests/test_wooldridge.py exercise the previously-broken rank-deficient hc2_bm paths: - All-eventually-treated panel with not_yet_treated control (late cohort cells dropped by solve_ols) - Unit-invariant exovar covariate (collinear with unit FE) Both assert per-cell + aggregate inference is finite on identified cells. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…e('calendar')
Address codex R4 P2: the rank-deficient hc2_bm regression tests
exercised group + event aggregates but not calendar. Extends both
regressions to also call aggregate('calendar') and assert finite BM
inference on surviving calendar contrasts — locks the reduced-space
BM DOF threading across all three non-simple aggregation surfaces.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…(R5) P1 (Code Quality, codex R5): set_params() mutated self BEFORE running validation. A rejected method × vcov_type combination left the estimator half-mutated: a caller that caught the exception could then run fit() on a partially-applied configuration (e.g., logit HC1 fit while self.vcov_type silently reads "hc2_bm"). Reworked set_params() to compute pending values on locals, validate the full overlay first, and only mutate self after validation succeeds. Mirrors the DifferenceInDifferences.set_params atomic pattern at estimators.py:995-1023. P2 (Maintainability, codex R5): under survey TSL, the analytical sandwich (including its cluster_ids) is replaced by the survey-design variance — but the WooldridgeDiDResults dataclass was still surfacing cluster_name='unit' / n_clusters=N_units from the analytical-path default, misrepresenting what was actually computed. Set both to None when survey_metadata is active; the survey-design stratification lives in survey_metadata. Applied across _fit_ols, _fit_logit, and _fit_poisson. New regression tests: - test_set_params_is_atomic_on_validation_failure (3 reject scenarios: method×vcov interaction, unknown vcov_type batch, unknown param key — each asserts get_params() unchanged after the rejection). - test_survey_design_clears_cluster_metadata (survey + hc1 fit on OLS path asserts cluster_name/n_clusters are None). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Address codex R6 P3: internal naming in aggregate() and the _bm_artifacts docstring used the misleading X_full / bread_matrix labels even though the stored artifacts are the REDUCED kept-column design (post rank-deficient drops). Renamed locals to X_red / bread_red and expanded the dataclass field docstring to spell out that the artifacts are reduced — using the singular full-design bread is exactly the failure mode the rank-deficient threading was introduced to avoid. REGISTRY entry updated to match. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P3 #1 (Maintainability, codex R7): bootstrap comment in _fit_ols said "Always clusters at the unit level" but the implementation uses ``self.cluster if self.cluster else unit``. Updated the comment to match the actual behavior (and the registry wording). P3 #2 (Docs/Tests, codex R7): R golden-generator header claimed hc1 "matches CR1S" and described classical as "heteroskedasticity-only". Both contradicted the registry deviation note + the methodology test header. Rewrote: hc1 is REFERENCE-ONLY (not pinned at parity due to the (n-1)/(n-k) deviation); classical is homoskedastic OLS SE. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
P3 (Maintainability, codex R8): the R script computed an unused overall_dof_hc2_bm via constrain_equal(int_idx) before the actual scalar-contrast Wald_test path that produces overall_att_contrast_dof. Removed the dead block so the generator has a single authoritative path for the overall ATT BM DOF, matching what Python computes. P3 (Doc, codex R8): the R script inline hc1 comment still implied CR1S "matches diff-diff's hc1+cluster"; expanded to mark hc1 output as reference-only with a pointer to the registry deviation note. P3 (CHANGELOG, codex R8): Wooldridge entry still referenced X_full as the stored BM artifact even though the implementation now stores the REDUCED kept-column design (X_red). Updated CHANGELOG to match the implementation and the dataclass docstring. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…phase1b-3 # Conflicts: # CHANGELOG.md
…rap test P2 (Methodology, CI codex R1): for non-survey `classical` and `hc2` fits, per-cell + aggregate user-facing inference flowed through `safe_inference(df=None)` even though the registry text said SE matches `summary(lm(...))$coefficients` / `sandwich::vcovHC`. R's lm() / coef_test() use the t-distribution with residual DOF = n - rank(X), not normal-theory. Threaded `df_one_way = X.shape[0] - n_kept_columns` through both per-cell inference and `aggregate()` for the classical/hc2 paths. New `_df_one_way` field on WooldridgeDiDResults makes it accessible to the lazy aggregate. New R-parity tests `test_classical_per_cell_inference_uses_residual_df` and `test_hc2_per_cell_inference_uses_residual_df` pin recovered DOF to `n - rank(X) = 189` on the existing 240-obs / 51-column fixture via CI half-width inversion. P3 (Tests, CI codex R1): no positive `hc2_bm + n_bootstrap > 0` regression existed. The new full-dummy branch's bootstrap closure (coef_offset=1 indexing under the rank-deficient kept-column logic) was thus only exposed to the regression-via-failure direction (the analytical path tests). Added two positive bootstrap tests: `test_hc2_bm_plus_bootstrap_finite_inference` (full-rank panel; asserts ATT bit-equality vs analytical fit, finite bootstrap SE, finite event-study aggregation) and `test_hc2_bm_plus_bootstrap_rank_deficient` (all-eventually-treated panel where solve_ols drops late-cohort columns; locks that the bootstrap loop survives rank deficiency). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI codex R2 flagged that the bootstrap-cluster note described a
one-way + explicit-cluster bootstrap scenario as if it were part of the
supported contract, but the previous bullet rejects ``n_bootstrap > 0`` +
``vcov_type in {hc2, classical}`` regardless of cluster. Rewrote the
note to scope it to the supported paths (``hc1`` / ``hc2_bm``) and
explicitly reference the rejection bullet.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
WooldridgeDiD(vcov_type=...)now accepts{"classical","hc1","hc2","hc2_bm"}on the OLS path, mirroring the SunAbraham PR Thread vcov_type through SunAbraham (Phase 1b 1/8) #472 / StackedDiD PR Thread vcov_type through StackedDiD (Phase 1b 2/8) #479 pattern.hc1(default) preserves bit-equal prior within-transform CR1 behavior;hc2_bm/hc2/classicalauto-route to a full-dummy saturated design where the hat matrix matters.hc2_bmend-to-end: per-coefficient SE matchesclubSandwich::vcovCR(lm(...), type="CR2")atatol=1e-10. Bell-McCaffrey Satterthwaite DOF is threaded across ALL user-facing inference surfaces — per-cellgroup_time_effects[(g,t)]usecoef_test()$df_Satt(matches R atatol=1e-6), overall ATT +.aggregate("group"|"calendar"|"event")use contrast-specific BM DOFs from_compute_cr2_bm_contrast_dof(matchesWald_test(test="HTZ")$df_denom). Aggregate DOFs are recomputed lazily from BM artifacts stashed on the Results object (REDUCED kept-column design — survives rank-deficient drops).classical/hc2auto-drop the unit auto-cluster (one-way families don't compose withcluster_ids); explicitcluster="X"+ one-way raises at the linalg validator.method ∈ {logit, poisson}+vcov_type != "hc1"(GLM CR2-BM-on-pseudo-residuals deferred),survey_design+vcov_type != "hc1"(survey TSL overrides analytical),n_bootstrap > 0+ one-wayvcov_type(intrinsically clustered bootstrap doesn't compose).set_params()is now atomic: rejected combinations leave the estimator unchanged.WooldridgeDiDResultsgainsvcov_type,cluster_name,n_clusters(None under survey TSL since the analytical sandwich is replaced).Methodology references (required if estimator / math changes)
lm + clubSandwich::vcovCR(type="CR1S")— diff-diff'ssolve_ols(n-1)/(n-k_within)correction counts only the demeaned columns while clubSandwich's(n-1)/(n-p)counts all FE; on small panels (n ~ 240, k_total ~ 50) the gap is ~11%, on typical larger panels (n >> k_total) <2%. Same deviation pattern as SunAbraham PR Thread vcov_type through SunAbraham (Phase 1b 1/8) #472 vsfixest::sunab. Documented indocs/methodology/REGISTRY.md"Variance families" → "Deviation from R"; no public WooldridgeDiD path exposeslm + CR1Sparity. The natural R anchor for diff-diff's hc1 isfixest::feols(..., cluster=~unit)or Statajwdid(both within-transform).Validation
tests/test_wooldridge.py(28 newTestWooldridgeVcovTypetests including bit-equal hc1 regression, 2 rank-deficient hc2_bm regressions, atomicset_params, survey cluster metadata cleanup, fail-closed BM DOF),tests/test_methodology_wooldridge.py(new file: 10 R-parity tests covering per-coef SE, per-celldf_Satt, overall ATT BM contrast DOF, group/calendar/event aggregate BM DOFs, classical/hc2 againstlm+sandwich::vcovHC)benchmarks/R/generate_wooldridge_golden.Ron the fixed-seedbenchmarks/data/wooldridge_test_panel.csv(240 obs, 40 units, 6 periods, 3 cohorts, heterogeneous treatment effects perfeedback_homogeneous_dgp_no_twfe_bias)tests/test_wooldridge.pybaseline still passes unchanged; 148 total WooldridgeDiD tests passSecurity / privacy
Generated with Claude Code