Thread vcov_type through StackedDiD (Phase 1b 2/8) by igerber · Pull Request #479 · igerber/diff-diff

igerber · 2026-05-21T13:10:26Z

Summary

Threads vcov_type ∈ {"hc1","hc2_bm"} through StackedDiD (Phase 1b 2/8 of the standalone-estimator threading initiative). HC1 default preserves prior behavior; hc2_bm routes CR2 Bell-McCaffrey through the clubSandwich WLS-CR2 port merged in PR Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port #475.
classical/hc2 rejected at __init__ (StackedDiD is intrinsically clustered, no cluster=None opt-out). conley rejected with a deferral message.
Bell-McCaffrey contrast DOF threaded into both event-study (event_study_effects[h]) and overall (overall_p_value/overall_conf_int) aggregated inference via _compute_cr2_bm_contrast_dof; mirrors the SunAbraham aggregated-inference pattern from PR Thread vcov_type through SunAbraham (Phase 1b 1/8) #472.
Rank-deficient design handling: BM contrast DOF computed on the identified-column reduced design (mirrors MPD's PR Lift Gate 6: cluster-aware CR2 Bell-McCaffrey contrast DOF for MultiPeriodDiD avg_att #465 pattern at estimators.py:1860-1913).
Fail-closed inference when BM DOF unavailable: emits all-NaN inference fields rather than silent normal-theory fallback (mirrors LinearRegression.get_inference() from PR Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port #475 R7).
survey_design= combined with vcov_type != "hc1" raises NotImplementedError. Reject order locked: fweight/aweight check fires first (Q-weight semantics), then vcov check.
set_params(vcov_type=...) re-validates via shared _validate_vcov_type helper (no bypass of __init__ guard).
6 R-parity tests pinning event-study SE + overall_se + per-event BM DOF + overall BM DOF at atol=1e-10 against clubSandwich::vcovCR(...) + coef_test()$df_Satt + Wald_test(test="HTZ")$df_denom on 4 panel variants: default-aggregate, heterogeneous-population, sample_share, anticipation=1.

Methodology references

Wing, Freedman & Hollingsworth (2024) "Stacked Difference-in-Differences" (NBER WP 32054): the core estimator + Q-weight formulas (aggregate / population / sample_share).
clubSandwich (Pustejovsky 2024) R package ≥ 0.7.0: the WLS-CR2 algebra is inherited from the Phase 1a port (PR Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port #475 / REGISTRY Phase 1a hc2_bm + weights row); no new methodology choice is introduced in this PR.
Pustejovsky & Tipton (2018) JBES + Imbens & Kolesar (2016) ReStat + Bell & McCaffrey (2002): foundational small-sample CR2 / Bell-McCaffrey derivations.
Intentional deviation: HC1 + cluster matches clubSandwich::vcovCR(type="CR1S") (Stata-style G/(G-1) * (n-1)/(n-p) correction) — NOT plain CR1 which omits the (n-1)/(n-p) factor and diverges by ~1.4% on the test fixture. Documented in REGISTRY.

Validation

Tests added/updated: tests/test_methodology_stacked_did.py (NEW, 12 tests covering R parity on 4 variants × hc1/hc2_bm surfaces + DOF parity); tests/test_stacked_did.py extended with TestStackedDiDVcovType (29 tests: defaults + rejects + clone idempotency + replicate-refit drift lock + rank-deficient handling + fail-closed inference + parameter-interaction smoke tests).
89 tests pass post-rebase.
R goldens regenerated via benchmarks/R/generate_stacked_did_golden.R (clubSandwich v0.7.0). 4 committed pre-stacked CSV panels.
Local AI review iterated 10 rounds before clean (R1 P0 BM DOF propagation → R2 P1 rank-deficient → R3 P1 fail-closed → R4-R6 polish → R7-R9 parameter-coverage iteration → R10 clean).

Security / privacy

Confirm no secrets/PII in this PR: Yes (canonical secret scan clean)

Test plan

Verify R-parity tests pass: pytest tests/test_methodology_stacked_did.py
Verify unit/regression tests pass: pytest tests/test_stacked_did.py
Verify no SA / TWFE / wls_cr2 regressions: pytest tests/test_methodology_sun_abraham.py tests/test_methodology_twfe.py tests/test_methodology_wls_cr2.py
CI codex review (local convergence is necessary but not sufficient per feedback_local_codex_vs_ci_codex_divergence)

🤖 Generated with Claude Code

Adds vcov_type ∈ {"classical","hc1","hc2","hc2_bm"} to StackedDiD with clubSandwich R parity at atol=1e-10 on the hc2_bm path. Defaults to hc1 (prior behavior at machine precision). Source edits: - __init__ accepts vcov_type=, rejects classical/hc2 with cluster- incompatibility ValueError (StackedDiD clusters intrinsically at unit or unit_subexp; one-way families don't compose with cluster_ids per the linalg validator), rejects conley with a deferral message. - fit() rejects survey_design + vcov_type != "hc1" with NotImplementedError (survey TSL/replicate-refit overrides analytical sandwich; SA precedent). Reject order locked: fweight/aweight check fires first, then vcov check (pinned by test_aweight_plus_hc2_bm_*). - Main solve_ols call at :419: switched from bake-Q-into-X (X_t = X*sqrt(Q)) to explicit `solve_ols(X, Y, weights=composed_weights, vcov_type=self.vcov_type)`. solve_ols internally bakes Q for the coef solve AND back-transforms for vcov on original-scale data via clubSandwich's WLS-CR2 algebra for hc2_bm (PR #475). - _refit_stacked closure at :444: mirror switch (cosmetic per return_vcov=False but grep-consistency). - StackedDiDResults gains vcov_type field; get_params() includes it. - llms-full.txt agent-facing entry updated. R parity: - benchmarks/data/stacked_did_test_panel.csv (NEW) — pre-stacked panel generated by extracting StackedDiD's internal stacked_data via the fixed-seed Python fixture (n_units=50, n_periods=8, cohorts=[3,5,7], seed=20260521; 325 rows after stacking). - benchmarks/R/generate_stacked_did_golden.R (NEW) — loads CSV, fits lm(weights=Q,...) with the same event-study design as Python, computes CR1S (Stata-style) + CR2-BM SE + BM df_Satt at cluster=unit AND cluster=unit_subexp via clubSandwich >= 0.7.0. - benchmarks/data/stacked_did_golden.json (NEW) — committed goldens. Tests: - tests/test_stacked_did.py::TestStackedDiDVcovType (19 tests): default + bit-equality + reject paths + survey precedence + get_params/set_params + result-class field + clone idempotency + replicate-refit smoke + reject-order regression + unit_subexp + survey symmetric. - tests/test_methodology_stacked_did.py (4 tests, NEW): hc1 vs CR1S, hc2_bm vs CR2 (unit), BM DOF vs coef_test()$df_Satt (via t-dist inversion of CI half-width), hc2_bm vs CR2 (unit_subexp). - Tolerance note: hc1 vs prior bake-Q-into-X is bit-equal up to ~2 ULPs at SE scale (multiplication ordering in solve_ols(weights=) vs prior user-side bake-w); pinned by test_hc1_se_bit_equal_to_pre_pr_baseline at atol=1e-13. Docs: - REGISTRY.md StackedDiD section: new "Variance families" subsection documenting hc1/hc2_bm routing + rejects, with **Note:** deferring methodology framing to Phase 1a clubSandwich port (PR #475) — no new methodology synthesis introduced in this PR. - CHANGELOG.md: Unreleased Added bullet at top. - TODO.md: Phase 1b row updated to track remaining 6 estimators; new StackedDiD conley follow-up row mirroring SA precedent. 234 tests pass across stacked + methodology_stacked_did + methodology_sun_abraham + methodology_twfe + methodology_wls_cr2 + estimators_vcov_type. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

… (R1 P0) Local codex R1 caught a P0: StackedDiD(vcov_type="hc2_bm") computed CR2 vcov correctly but never propagated the Bell-McCaffrey Satterthwaite DOF into safe_inference() calls. event_study_effects[h]['p_value']/['conf_int'] and overall_p_value/overall_conf_int silently fell back to normal-theory inference (df=None ⇒ scipy.norm), contradicting the registry contract. Fix mirrors the SunAbraham aggregated-inference pattern from PR #472 (sun_abraham.py:997-1097). After solve_ols(), if vcov_type=="hc2_bm" and not on the survey replicate-refit path, build contrast matrix: - Per-event-time: unit vector at each interaction_indices[h] - Overall ATT: 1/K average across post-period interaction columns Call _compute_cr2_bm_contrast_dof(X, cluster_ids, bread, contrasts, weights=composed_weights). Apply per-event-time DOFs to event_study_effects inference, overall DOF to overall_* inference. Wrap in try/except so any rank-deficient or linalg failure emits a UserWarning and falls back to normal-theory (visible deviation, not silent). R fixture extended with the post-period-average ATT contrast DOF via Wald_test(constraints=row_avg, vcov=CR2, test="HTZ")$df_denom (mirrors PR #465's MPD avg_att approach). New goldens at both cluster=unit and cluster=unit_subexp. Test additions / strengthening (addresses R1 P2 + P3): - test_hc2_bm_per_event_dof_matches_coef_test_df_satt_unit_cluster: uses brentq inversion of CI half-width to recover the DOF safe_inference actually used. If propagation failed (the R1 P0 bug), the inversion raises ValueError → test FAILS instead of silently skipping (replaces the prior `continue`-on-failure pattern which could vacuously pass). Hard-asserts validated_count == len(event_times). - test_hc2_bm_overall_att_dof_matches_wald_test_htz_unit_cluster (NEW): pins overall ATT DOF at atol=1e-6 against R Wald_test(HTZ)$df_denom. - test_hc2_bm_overall_att_dof_matches_wald_test_htz_unit_subexp_cluster (NEW): symmetric coverage at alternate cluster level. - Renamed CR1 → CR1S throughout docs/tests/REGISTRY for consistency (diff-diff's HC1+cluster uses Stata-style G/(G-1)*(n-1)/(n-p); plain CR1 omits the (n-1)/(n-p) term and diverges by ~1.4%). 192 tests pass (74 stacked + 19 wls_cr2 + 47 SA + 52 estimators_vcov_type). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Local codex R2 caught a P1: on rank-deficient stacked designs, the BM contrast DOF block called `_compute_cr2_bm_contrast_dof()` on the singular full bread matrix, which raised LinAlgError and triggered the catch-and- fallback path — downgrading the aggregated inference to normal-theory even when the target event-time delta_h coefficients were still identified. Fix mirrors the MultiPeriodDiD rank-deficient pattern (PR #465, estimators.py:1860-1913): - Derive `_identified = ~np.isnan(coef)` (solve_ols emits NaN for dropped columns under R-style rank handling). - Subset X, bread, and contrasts to the identified-column block BEFORE calling `_compute_cr2_bm_contrast_dof`. - Only build per-event-time contrasts for event-times whose target delta_h column is identified; only build the overall ATT contrast when ALL post-period delta_h are identified (otherwise the contrast is undefined). - The remaining try/except is a defensive guard for genuine singularities on the identified-column design (rare; the rank-deficient handling already removes the problematic columns). The earlier behavior was too aggressive: a single dropped nuisance column would knock out BM DOF for ALL contrasts. The new behavior NaN-guards only contrasts that target a dropped column. New regression test `test_hc2_bm_rank_deficient_design_keeps_bm_dof_*` — verifies the fit doesn't emit fallback warnings on a well-conditioned design + checks that CI half-widths use t(BM DOF) instead of z=1.96 (catches the R1/R2 normal-fallback failure mode end-to-end). 75 tests pass across stacked + methodology_stacked_did. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Local codex R3 caught a P1 on the same surface: when BM contrast DOF was unavailable (helper raised OR noise-floor NaN guard fired), StackedDiD either fell back to normal-theory inference (silent wrong CIs/p-values under the hc2_bm contract) or passed NaN df through safe_inference (which guards df<=0 but NOT NaN, producing finite t_stat with NaN p/CI — mixed inconsistent inference fields). Both fail-open patterns contradict the registry contract that hc2_bm aggregated inference uses BM DOF. Fix mirrors the LinearRegression.get_inference() pattern from PR #475 R7 (linalg.py:3689-3706): on the hc2_bm path, when the per-contrast BM DOF is None or non-finite, emit ALL-NaN inference fields for that contrast rather than falling back to safe_inference(df=None or NaN). Effect and SE remain finite — only the inference fields downstream of the DOF are suppressed. Applied at both inference sites: - Event-study per-event-time inference (event_study_effects[h]): inline check before safe_inference call. - Overall ATT inference (overall_t/p/conf_int): same guard. New regression tests: - test_hc2_bm_nan_dof_fails_closed_with_all_nan_inference: monkeypatches `_compute_cr2_bm_contrast_dof` to return a NaN-only vector and asserts ALL event-study + overall t_stat/p_value/conf_int are NaN while effect and se remain finite. - test_hc2_bm_helper_raises_fails_closed_with_all_nan_inference: monkeypatches the helper to raise LinAlgError and asserts the fallback path NaN-closes inference (does NOT use normal-theory). 77 tests pass. Pattern is consistent with the analytical-surface fail-closed contract added in PR #475 R7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…drift lock) R4 verdict was ✅ with 3 polish items: P2: helper-exception warning text was stale (said "Falling back to normal distribution" but the code now NaN-closes per the R3 P1 fix). Rewrote the warning to accurately describe the fail-closed contract: "Aggregated p-values, t-statistics, and confidence intervals will be returned as NaN to preserve the hc2_bm contract". P3: set_params(vcov_type=...) bypassed the __init__ validation guard — sklearn-style mutation could set arbitrary values and fail later in the linalg layer with a different message. Fix: factored the vcov_type validation into `StackedDiD._validate_vcov_type(staticmethod)` and call it from BOTH __init__ AND set_params. New test `test_set_params_revalidates_vcov_type` exercises all 4 reject paths. P3: test_replicate_refit_coef_bit_equal_vs_bake_w_baseline was described as a "drift lock" but only checked finiteness — wouldn't catch numeric drift. Captured baseline overall_att=2.0807833161293945 + overall_se=0.14697256517699428 on the post-PR branch and asserted with atol=1e-13 (coef) / 1e-10 (SE, compounded through replicate-refit variance aggregation). Now the test actually catches drift. 78 tests pass (4 set_params reject paths + strengthened drift lock). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eficient test) R5 verdict was ✅ with 2 P3 polish items: P3 (Maintainability): the new solve_ols(weights=...) call sites relied on the default weight_type="pweight". Locked the WLS-CR2 weight convention locally by passing weight_type="pweight" explicitly in BOTH solve_ols calls (main fit + _refit_stacked replicate closure). Prevents silent variance-contract drift if solve_ols's default ever changes. P3 (Docs/Tests): test_hc2_bm_rank_deficient_design_keeps_bm_dof_on_ identified_contrasts didn't actually create a rank-deficient design — it only verified the t-distribution was used on a well-conditioned fixture. Two-part fix: 1. Renamed the existing test to test_hc2_bm_uses_t_distribution_not_normal_on_well_conditioned_design (honest description: it's an end-to-end smoke check that BM DOF threading reaches safe_inference). 2. NEW test test_hc2_bm_rank_deficient_keeps_bm_dof_on_identified_ contrasts: monkeypatches solve_ols to inject NaN at the intercept column, simulating rank deficiency. Exercises the reduced-design code path at stacked_did.py:529-577 added in the R2 P1 fix. Verifies: (i) no fallback warning (reduced-design helper succeeds); (ii) identified delta_h coefficients keep finite t-distribution inference (CI half-width > z_0.975 * SE); (iii) overall ATT also keeps t-distribution. 79 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R6 verdict was ✅ with 2 P3 test-coverage polish items: P3 (test): prior R fixture didn't pin the overall delta-method SE for the post-period-average ATT contrast — only BM DOF. Post-period covariance regressions could slip through. Fix: added `se_overall_cr1` and `se_overall_cr2` to the R fixture (computed as `c @ vcov @ c.T` for the post-period 1/K row contrast at both cluster levels). Added overall_se parity assertions at atol=1e-10 to the three existing hc1/hc2_bm parity tests. P3 (test): the Python parity test regenerated the panel via `generate_staggered_data(...)` instead of cross-checking against the committed CSV. Risk: future DGP-helper drift would silently invalidate R-parity assertions. Fix: added a DGP-drift check in the `panel` fixture that loads `benchmarks/data/stacked_did_test_panel.csv`, merges with the regenerated panel on (unit, period), and asserts the `outcome` column matches at atol=1e-12. Catches any drift in the DGP helper with a clear error message rather than wrong parity output. (first_treat column not checked because inf-vs-0 recoding differs between CSV/pandas-read and regenerated panel; outcome is what actually drives the R-side fit.) 79 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ning R7 verdict was ⚠️ Needs changes with one P1 + one P3: P1: BM-DOF logic was only pinned on default (anticipation=0, weighting='aggregate'). The BM-DOF code path uses anticipation in post-period selection (`h >= -self.anticipation`) and consumes the composed Q-weights directly — both surfaces were untested. Added 4 parameter-interaction tests: - test_anticipation_plus_hc2_bm_threads_bm_dof: anticipation=1, asserts shifted reference period (e=-2 has SE=0), overall ATT contrast uses t(BM DOF) (CI half-width > z_0.975 * SE), at least one event-study CI uses t-distribution. - test_population_weighting_plus_hc2_bm_finite_threads_bm_dof: weighting='population' (different Q-weight formula), verifies finite output + BM DOF propagation via t-distribution check. - test_sample_share_weighting_plus_hc2_bm_finite_threads_bm_dof: weighting='sample_share' (third Q-weight formula), symmetric coverage. - test_hc1_vs_hc2_bm_differ_under_anticipation: sanity check that hc1 vs hc2_bm SEs actually differ under anticipation=1 (proves the leverage/DOF adjustment fires under non-default params). These tests don't require R goldens (no methodology change vs default case; just verifies the BM-DOF code path remains active under the parameter variants). R parity on the default path is already strong. P3: fixture-independence concern from R6 was only partially closed. Now also pins the `first_treat` column equality after normalizing inf/0 between CSV (pandas reads inf for never-treated) and regenerated (0 for never-treated in some pandas versions). Catches cohort-assignment drift independently of outcome drift. 83 tests pass (+4 new R7 tests). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ion') R8 verdict was ⚠️ still Needs changes: R7's parameter-interaction tests were smoke checks (population test used constant pop=1.0 which is numerically degenerate with aggregate weighting on a balanced panel); no R goldens for non-default variants. Closes the gap with numerical parity at atol=1e-10. Source-tree additions: - benchmarks/data/stacked_did_anticipation1_panel.csv (NEW): pre-stacked fixture with anticipation=1, generated from same base panel via StackedDiD(anticipation=1).fit(). Event times [-3,-2,-1,0,1,2] with ref=-2 (shifted by anticipation). - benchmarks/data/stacked_did_population_panel.csv (NEW): pre-stacked fixture with weighting='population' AND heterogeneous unit populations (rng.uniform(1,100), seed=99). Q-weights range [0.68, 1.54] vs uniform ~1.0 under aggregate weighting; forces the population-formula branch. - benchmarks/R/generate_stacked_did_golden.R: refactored to loop over 3 variants (default, population, anticipation1), producing variant-keyed entries in `stacked_did_golden.json`. Post-period filter now generalized to `h >= -anticipation` (matches Python's `_post_event_times_preview` at stacked_did.py:546-548). - benchmarks/data/stacked_did_golden.json: regenerated with all 3 variants. Test additions (tests/test_methodology_stacked_did.py): - new class TestStackedDiDParityRVariants (4 tests): - test_population_weighting_hc1_matches_clubsandwich_cr1s - test_population_weighting_hc2_bm_matches_clubsandwich_cr2 - test_anticipation1_hc1_matches_clubsandwich_cr1s - test_anticipation1_hc2_bm_matches_clubsandwich_cr2 All assert event-study SE AND overall_se at atol=1e-10. - new fixture `population_panel` reproduces the heterogeneous-population seed (np.random.default_rng(99)) used by the R variant CSV generator. 87 tests pass (+4 variant parity). Note: weighting='sample_share' parity not added because its Q-weight formula structurally mirrors population weighting (both differ from aggregate only in the Q-weight calculation, which feeds into the same solve_ols(weights=) path). The population test exercises that code path; sample_share remains covered by the smoke test from R7. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

R9 verdict was ⚠️ still on parameter-coverage P1: sample_share had no R-parity fixture, and population/anticipation1 tests asserted SE only (not BM DOF). Source-tree additions: - benchmarks/data/stacked_did_sample_share_panel.csv (NEW): pre-stacked fixture for weighting='sample_share'. - benchmarks/R/generate_stacked_did_golden.R: added sample_share variant to the variants list; loops over all 4 variants now. - benchmarks/data/stacked_did_golden.json: regenerated with sample_share. Test additions (tests/test_methodology_stacked_did.py): - test_sample_share_weighting_hc1_matches_clubsandwich_cr1s - test_sample_share_weighting_hc2_bm_matches_clubsandwich_cr2 (+ DOF) - Extended test_population_weighting_hc2_bm_matches_clubsandwich_cr2: added per-event-time + overall BM DOF parity via CI-width inversion. - Extended test_anticipation1_hc2_bm_matches_clubsandwich_cr2: same DOF parity additions. 89 tests pass. All 3 non-default variants now have full R parity: event-study SE + overall_se + per-event BM DOF + overall BM DOF. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T13:17:46Z

Overall assessment

✅ Looks good

Executive summary

No unmitigated methodology or correctness issues stood out in the StackedDiD vcov_type threading. The new hc2_bm route and the CR1S-vs-CR1 deviation are explicitly documented in docs/methodology/REGISTRY.md:L1367-L1374, so I do not treat those as defects.
The new Bell-McCaffrey DOF path in diff_diff/stacked_did.py:L520-L610 and the fail-closed NaN gating in diff_diff/stacked_did.py:L730-L807 correctly avoid the silent normal-theory fallback that would have been a P0/P1 issue.
Parameter propagation looks complete through constructor validation, fit-time branching, survey/replicate guards, get_params() / set_params(), and the results object.
There is one moderate user-surface gap: StackedDiDResults.summary() still does not reveal which variance family was used, even though vcov_type is now a first-class result field.
There is one minor performance inefficiency: overall-only hc2_bm fits still compute per-event contrast DOFs that are never surfaced.
I could not execute the new tests in this environment because pytest and numpy are not installed here.

Methodology

No unmitigated findings. Cross-checking the change against docs/methodology/REGISTRY.md:L1367-L1374 and the implementation in diff_diff/stacked_did.py:L500-L807, the PR stays within documented methodology: hc1 remains the existing CR1S-style clustered path, hc2_bm is explicitly documented as the clubSandwich-based CR2 Bell-McCaffrey route, and the new BM-DOF fail-closed behavior NaN-closes inference instead of emitting misleading normal-theory output.

Code Quality

No unmitigated findings. The new parameter is validated at construction and re-validation time (diff_diff/stacked_did.py:L203-L229, diff_diff/stacked_did.py:L1197-L1212), threaded into the main solve path (diff_diff/stacked_did.py:L508-L516), threaded through survey/replicate branches (diff_diff/stacked_did.py:L622-L651), and persisted on the results object (diff_diff/stacked_did.py:L809-L833).

Performance

Severity: P3. Impact: on vcov_type="hc2_bm" fits with aggregate=None / "simple", the code still builds per-event Bell-McCaffrey contrast DOFs even though only the overall contrast is user-visible, so the default overall-only path does avoidable CR2 work. See diff_diff/stacked_did.py:L530-L597. Concrete fix: gate es_cols_full construction on aggregate=="event_study" and compute only the overall contrast DOF for simple fits.

Maintainability

Severity: P2. Impact: the PR adds vcov_type to StackedDiDResults, but the primary textual output still omits the variance family, so summary() / print_summary() cannot distinguish HC1 from HC2-BM despite those now being materially different user-facing inference modes. See diff_diff/stacked_did_results.py:L97-L100, diff_diff/stacked_did_results.py:L144-L210; other result classes already print a variance label via diff_diff/results.py:L213-L235. Concrete fix: add a Variance: line to StackedDiDResults.summary() using the shared label helper pattern, suppressing it when survey_metadata is present.

Tech Debt

Severity: P3 informational. Impact: the remaining vcov_type="conley" gap is properly tracked rather than silently deferred, so it is not a blocker under the project rules. See TODO.md:L102-L104. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

No unmitigated documentation/test findings in the PR itself. The added coverage is broad across hc1 / hc2_bm, survey rejection, replicate-refit, rank deficiency, fail-closed BM inference, weighting variants, and anticipation=1.

Verification note: I could not run tests/test_stacked_did.py or tests/test_methodology_stacked_did.py here because this review environment does not have pytest or numpy installed.

CI codex R1 verdict was ✅ with 2 polish items: P2 (Maintainability): StackedDiDResults.summary() didn't surface the analytical variance family (vcov_type), even though hc1 vs hc2_bm are now materially different user-facing inference modes. Other result classes (DiD/MPD/TWFE) already print a Variance: line via the shared `_format_vcov_label` helper at results.py:49-89. Fix: added a Variance: line to StackedDiDResults.summary() using the shared helper, suppressed under survey_metadata (the survey block already surfaces inference source). My earlier plan note ("mirror SA: don't add Variance: row") was based on a wrong assumption — SA-results uses the same shared helper, and StackedDiD inherits the same precedent. P3 (Performance): on `vcov_type="hc2_bm"` fits with `aggregate=None` or `"simple"`, the code built per-event Bell-McCaffrey contrast DOFs that were never surfaced (only overall ATT is exposed). Avoidable CR2 cost on every overall-only hc2_bm fit. Fix: gated the per-event contrast construction (es_keys, es_cols_full) on `aggregate == "event_study"`. The overall ATT contrast still gets computed unconditionally (it's the user-facing scalar). 89 tests still pass — no behavioral regression on the event-study path; overall-only fits are now slightly faster. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T13:24:45Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f0aa06096b9b56ad00112e52cf8cd707c2fa4cdb

Overall assessment

✅ Looks good

Executive summary

No unmitigated P0/P1 methodology or correctness issues stood out in the StackedDiD vcov_type threading. The hc1/hc2_bm routes, BM-DOF fail-closed behavior, survey-design rejection, and Conley deferral match docs/methodology/REGISTRY.md:L1367-L1374.
The previous performance nit appears resolved: per-event Bell-McCaffrey contrast DOFs are now only built on the aggregate=="event_study" path.
The previous summary-output gap is only partially resolved. StackedDiDResults.summary() now prints a Variance: line, but it mislabels clustered StackedDiD fits as one-way HC/HC2 because cluster metadata is not carried onto the results object.
Parameter propagation looks complete through constructor validation, fit(), replicate-refit, get_params() / set_params(), and StackedDiDResults.vcov_type.
I could not execute the targeted tests here because this environment does not have pytest, numpy, pandas, or scipy installed.

Methodology

Severity: P2. Impact: the new summary surface still reports the wrong inference family. StackedDiDResults.summary() calls _format_vcov_label(..., cluster_name=None), which renders hc1 as "HC1 heteroskedasticity-robust" and hc2_bm as "HC2 + Bell-McCaffrey DOF (one-way, ...)", even though StackedDiD is intrinsically cluster-robust at unit or unit_subexp. That makes the prior summary fix only partially correct and can mislead users about the variance method actually used. diff_diff/stacked_did_results.py:L179-L199, diff_diff/results.py:L49-L76, diff_diff/stacked_did.py:L493-L517, diff_diff/stacked_did.py:L813-L839
Concrete fix: persist the cluster setting, and ideally the realized cluster count, on StackedDiDResults; pass that metadata into _format_vcov_label; add a regression test that summary() shows clustered CR1 / CR2-BM labels for both cluster="unit" and cluster="unit_subexp".

Code Quality

No unmitigated findings. vcov_type propagation and revalidation look complete.

Performance

No unmitigated findings. The earlier overall-only CR2-BM extra-work issue appears addressed by gating per-event DOF construction to the event-study path in diff_diff/stacked_did.py:L553-L559.

Maintainability

No additional unmitigated findings beyond the summary-metadata issue above.

Tech Debt

Severity: P3 informational. Impact: vcov_type="conley" remains unavailable on StackedDiD, but the limitation is explicitly tracked in TODO.md and documented in the Methodology Registry, so it is not blocking. TODO.md:L102-L104, docs/methodology/REGISTRY.md:L1370-L1374
Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

No additional unmitigated findings in the PR content. I could not run tests/test_stacked_did.py or tests/test_methodology_stacked_did.py in this environment because pytest, numpy, pandas, and scipy are not installed.

R2 verdict was ✅ + 1 P2: the summary's Variance: line (added in R1) mislabelled clustered StackedDiD fits as one-way ("HC1 heteroskedasticity-robust" / "HC2 + Bell-McCaffrey DOF (one-way, n=N)") because I passed cluster_name=None to _format_vcov_label. StackedDiD is intrinsically clustered. Fix: - Added `cluster_name` and `n_clusters` fields to StackedDiDResults dataclass. - Populated both at fit() time: cluster_name=self.cluster, n_clusters= np.unique(cluster_ids).size (matches the actual realized count, not config). - summary() now passes cluster_name + n_clusters into _format_vcov_label, rendering "CR1 cluster-robust at unit, G=50" for hc1 and "CR2 Bell-McCaffrey cluster-robust at unit, G=50" for hc2_bm. New regression test test_summary_renders_clustered_variance_label verifies: - hc1 → label contains "CR1 cluster-robust at unit" - hc2_bm → label contains "CR2 Bell-McCaffrey cluster-robust at unit" - Neither contains "(one-way" (catches the R2 regression directly) - G= cluster count present - cluster="unit_subexp" → label contains "at unit_subexp" 90 tests pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T13:34:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: af10cc4718ba667f90ee538b4eab1026cf1f11e4

Overall Assessment

✅ Looks good

Executive Summary

No unmitigated P0/P1 issues found in the StackedDiD vcov_type threading.
The methodology surface now lines up across code and registry: hc1 default, hc2_bm routed through the existing weighted clubSandwich CR2 port, classical/hc2 rejected up front, and survey_design restricted to hc1. diff_diff/stacked_did.py:L202-L229, diff_diff/stacked_did.py:L330-L346, docs/methodology/REGISTRY.md:L1367-L1374
The previous review’s summary-label issue appears resolved: StackedDiDResults.summary() now carries cluster metadata and renders clustered CR1/CR2-BM labels instead of one-way HC labels. diff_diff/stacked_did.py:L813-L840, diff_diff/stacked_did_results.py:L186-L202, diff_diff/results.py:L49-L76
The new hc2_bm aggregated inference path is fail-closed and computes BM contrast DOF on the identified-column design, which addresses the earlier DOF propagation and rank-deficiency risks. diff_diff/stacked_did.py:L520-L619, diff_diff/stacked_did.py:L730-L810
Parameter propagation looks complete through constructor validation, get_params() / set_params(), results objects, and the survey replicate-refit closure. diff_diff/stacked_did.py:L185-L229, diff_diff/stacked_did.py:L637-L651, diff_diff/stacked_did.py:L1187-L1218
I could not execute the targeted tests here because this environment does not have numpy, pandas, scipy, or pytest.

Methodology

No unmitigated findings. The hc1/hc2_bm routing, survey precedence, and BM-DOF threading in diff_diff/stacked_did.py:L500-L840 match the StackedDiD registry entry at docs/methodology/REGISTRY.md:L1367-L1374.

Code Quality

No unmitigated findings.

Performance

No unmitigated findings. The extra per-event Bell-McCaffrey DOF work is gated to the aggregate=="event_study" path, so the simple/overall surface only computes the overall contrast DOF. diff_diff/stacked_did.py:L549-L581

Maintainability

No unmitigated findings.

Tech Debt

Severity: P3 informational. Impact: vcov_type="conley" remains deferred on StackedDiD, but the limitation is explicitly documented in the methodology registry and tracked in TODO.md, so it is not blocking. docs/methodology/REGISTRY.md:L1371-L1374, TODO.md:L102-L104
Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

No unmitigated findings by inspection. The new tests cover R parity, parameter interactions (population, sample_share, anticipation), rank-deficient BM-DOF handling, fail-closed inference, summary labeling, and set_params() revalidation. tests/test_methodology_stacked_did.py:L1-L669, tests/test_stacked_did.py:L927-L1717
I could not run tests/test_methodology_stacked_did.py or tests/test_stacked_did.py in this environment because numpy, pandas, scipy, and pytest are unavailable. I did verify that the touched Python files parse successfully with ast.parse.

igerber and others added 10 commits May 21, 2026 09:09

igerber added the ready-for-ci Triggers CI test workflows label May 21, 2026

igerber merged commit d1543b7 into main May 21, 2026
33 of 34 checks passed

igerber deleted the stacked-did-vcov-type-phase1b-2 branch May 21, 2026 17:08

igerber mentioned this pull request May 21, 2026

wooldridge: thread vcov_type through OLS path (Phase 1b 3/8) #483

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thread vcov_type through StackedDiD (Phase 1b 2/8)#479

Thread vcov_type through StackedDiD (Phase 1b 2/8)#479
igerber merged 12 commits into
mainfrom
stacked-did-vcov-type-phase1b-2

igerber commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 21, 2026

Summary

Methodology references

Validation

Security / privacy

Test plan

Uh oh!

github-actions Bot commented May 21, 2026

Overall assessment

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 21, 2026

Overall assessment

Executive summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 21, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant