Skip to content

Thread vcov_type through StackedDiD (Phase 1b 2/8)#479

Merged
igerber merged 12 commits into
mainfrom
stacked-did-vcov-type-phase1b-2
May 21, 2026
Merged

Thread vcov_type through StackedDiD (Phase 1b 2/8)#479
igerber merged 12 commits into
mainfrom
stacked-did-vcov-type-phase1b-2

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 21, 2026

Summary

  • Threads vcov_type ∈ {"hc1","hc2_bm"} through StackedDiD (Phase 1b 2/8 of the standalone-estimator threading initiative). HC1 default preserves prior behavior; hc2_bm routes CR2 Bell-McCaffrey through the clubSandwich WLS-CR2 port merged in PR Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port #475.
  • classical/hc2 rejected at __init__ (StackedDiD is intrinsically clustered, no cluster=None opt-out). conley rejected with a deferral message.
  • Bell-McCaffrey contrast DOF threaded into both event-study (event_study_effects[h]) and overall (overall_p_value/overall_conf_int) aggregated inference via _compute_cr2_bm_contrast_dof; mirrors the SunAbraham aggregated-inference pattern from PR Thread vcov_type through SunAbraham (Phase 1b 1/8) #472.
  • Rank-deficient design handling: BM contrast DOF computed on the identified-column reduced design (mirrors MPD's PR Lift Gate 6: cluster-aware CR2 Bell-McCaffrey contrast DOF for MultiPeriodDiD avg_att #465 pattern at estimators.py:1860-1913).
  • Fail-closed inference when BM DOF unavailable: emits all-NaN inference fields rather than silent normal-theory fallback (mirrors LinearRegression.get_inference() from PR Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port #475 R7).
  • survey_design= combined with vcov_type != "hc1" raises NotImplementedError. Reject order locked: fweight/aweight check fires first (Q-weight semantics), then vcov check.
  • set_params(vcov_type=...) re-validates via shared _validate_vcov_type helper (no bypass of __init__ guard).
  • 6 R-parity tests pinning event-study SE + overall_se + per-event BM DOF + overall BM DOF at atol=1e-10 against clubSandwich::vcovCR(...) + coef_test()$df_Satt + Wald_test(test="HTZ")$df_denom on 4 panel variants: default-aggregate, heterogeneous-population, sample_share, anticipation=1.

Methodology references

  • Wing, Freedman & Hollingsworth (2024) "Stacked Difference-in-Differences" (NBER WP 32054): the core estimator + Q-weight formulas (aggregate / population / sample_share).
  • clubSandwich (Pustejovsky 2024) R package ≥ 0.7.0: the WLS-CR2 algebra is inherited from the Phase 1a port (PR Lift hc2_bm + weights gates via clubSandwich WLS-CR2 port #475 / REGISTRY Phase 1a hc2_bm + weights row); no new methodology choice is introduced in this PR.
  • Pustejovsky & Tipton (2018) JBES + Imbens & Kolesar (2016) ReStat + Bell & McCaffrey (2002): foundational small-sample CR2 / Bell-McCaffrey derivations.
  • Intentional deviation: HC1 + cluster matches clubSandwich::vcovCR(type="CR1S") (Stata-style G/(G-1) * (n-1)/(n-p) correction) — NOT plain CR1 which omits the (n-1)/(n-p) factor and diverges by ~1.4% on the test fixture. Documented in REGISTRY.

Validation

  • Tests added/updated: tests/test_methodology_stacked_did.py (NEW, 12 tests covering R parity on 4 variants × hc1/hc2_bm surfaces + DOF parity); tests/test_stacked_did.py extended with TestStackedDiDVcovType (29 tests: defaults + rejects + clone idempotency + replicate-refit drift lock + rank-deficient handling + fail-closed inference + parameter-interaction smoke tests).
  • 89 tests pass post-rebase.
  • R goldens regenerated via benchmarks/R/generate_stacked_did_golden.R (clubSandwich v0.7.0). 4 committed pre-stacked CSV panels.
  • Local AI review iterated 10 rounds before clean (R1 P0 BM DOF propagation → R2 P1 rank-deficient → R3 P1 fail-closed → R4-R6 polish → R7-R9 parameter-coverage iteration → R10 clean).

Security / privacy

  • Confirm no secrets/PII in this PR: Yes (canonical secret scan clean)

Test plan

  • Verify R-parity tests pass: pytest tests/test_methodology_stacked_did.py
  • Verify unit/regression tests pass: pytest tests/test_stacked_did.py
  • Verify no SA / TWFE / wls_cr2 regressions: pytest tests/test_methodology_sun_abraham.py tests/test_methodology_twfe.py tests/test_methodology_wls_cr2.py
  • CI codex review (local convergence is necessary but not sufficient per feedback_local_codex_vs_ci_codex_divergence)

🤖 Generated with Claude Code

igerber and others added 10 commits May 21, 2026 09:09
Adds vcov_type ∈ {"classical","hc1","hc2","hc2_bm"} to StackedDiD with
clubSandwich R parity at atol=1e-10 on the hc2_bm path. Defaults to hc1
(prior behavior at machine precision).

Source edits:
- __init__ accepts vcov_type=, rejects classical/hc2 with cluster-
  incompatibility ValueError (StackedDiD clusters intrinsically at unit
  or unit_subexp; one-way families don't compose with cluster_ids per
  the linalg validator), rejects conley with a deferral message.
- fit() rejects survey_design + vcov_type != "hc1" with
  NotImplementedError (survey TSL/replicate-refit overrides analytical
  sandwich; SA precedent). Reject order locked: fweight/aweight check
  fires first, then vcov check (pinned by test_aweight_plus_hc2_bm_*).
- Main solve_ols call at :419: switched from bake-Q-into-X
  (X_t = X*sqrt(Q)) to explicit `solve_ols(X, Y, weights=composed_weights,
  vcov_type=self.vcov_type)`. solve_ols internally bakes Q for the coef
  solve AND back-transforms for vcov on original-scale data via
  clubSandwich's WLS-CR2 algebra for hc2_bm (PR #475).
- _refit_stacked closure at :444: mirror switch (cosmetic per
  return_vcov=False but grep-consistency).
- StackedDiDResults gains vcov_type field; get_params() includes it.
- llms-full.txt agent-facing entry updated.

R parity:
- benchmarks/data/stacked_did_test_panel.csv (NEW) — pre-stacked panel
  generated by extracting StackedDiD's internal stacked_data via the
  fixed-seed Python fixture (n_units=50, n_periods=8, cohorts=[3,5,7],
  seed=20260521; 325 rows after stacking).
- benchmarks/R/generate_stacked_did_golden.R (NEW) — loads CSV, fits
  lm(weights=Q,...) with the same event-study design as Python, computes
  CR1S (Stata-style) + CR2-BM SE + BM df_Satt at cluster=unit AND
  cluster=unit_subexp via clubSandwich >= 0.7.0.
- benchmarks/data/stacked_did_golden.json (NEW) — committed goldens.

Tests:
- tests/test_stacked_did.py::TestStackedDiDVcovType (19 tests): default
  + bit-equality + reject paths + survey precedence + get_params/set_params
  + result-class field + clone idempotency + replicate-refit smoke +
  reject-order regression + unit_subexp + survey symmetric.
- tests/test_methodology_stacked_did.py (4 tests, NEW): hc1 vs CR1S,
  hc2_bm vs CR2 (unit), BM DOF vs coef_test()$df_Satt (via t-dist
  inversion of CI half-width), hc2_bm vs CR2 (unit_subexp).
- Tolerance note: hc1 vs prior bake-Q-into-X is bit-equal up to ~2 ULPs
  at SE scale (multiplication ordering in solve_ols(weights=) vs prior
  user-side bake-w); pinned by test_hc1_se_bit_equal_to_pre_pr_baseline
  at atol=1e-13.

Docs:
- REGISTRY.md StackedDiD section: new "Variance families" subsection
  documenting hc1/hc2_bm routing + rejects, with **Note:** deferring
  methodology framing to Phase 1a clubSandwich port (PR #475) — no new
  methodology synthesis introduced in this PR.
- CHANGELOG.md: Unreleased Added bullet at top.
- TODO.md: Phase 1b row updated to track remaining 6 estimators; new
  StackedDiD conley follow-up row mirroring SA precedent.

234 tests pass across stacked + methodology_stacked_did +
methodology_sun_abraham + methodology_twfe + methodology_wls_cr2 +
estimators_vcov_type.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… (R1 P0)

Local codex R1 caught a P0: StackedDiD(vcov_type="hc2_bm") computed CR2
vcov correctly but never propagated the Bell-McCaffrey Satterthwaite DOF
into safe_inference() calls. event_study_effects[h]['p_value']/['conf_int']
and overall_p_value/overall_conf_int silently fell back to normal-theory
inference (df=None ⇒ scipy.norm), contradicting the registry contract.

Fix mirrors the SunAbraham aggregated-inference pattern from PR #472
(sun_abraham.py:997-1097). After solve_ols(), if vcov_type=="hc2_bm" and
not on the survey replicate-refit path, build contrast matrix:
  - Per-event-time: unit vector at each interaction_indices[h]
  - Overall ATT: 1/K average across post-period interaction columns
Call _compute_cr2_bm_contrast_dof(X, cluster_ids, bread, contrasts,
weights=composed_weights). Apply per-event-time DOFs to event_study_effects
inference, overall DOF to overall_* inference. Wrap in try/except so any
rank-deficient or linalg failure emits a UserWarning and falls back to
normal-theory (visible deviation, not silent).

R fixture extended with the post-period-average ATT contrast DOF via
Wald_test(constraints=row_avg, vcov=CR2, test="HTZ")$df_denom (mirrors PR
#465's MPD avg_att approach). New goldens at both cluster=unit and
cluster=unit_subexp.

Test additions / strengthening (addresses R1 P2 + P3):
- test_hc2_bm_per_event_dof_matches_coef_test_df_satt_unit_cluster:
  uses brentq inversion of CI half-width to recover the DOF safe_inference
  actually used. If propagation failed (the R1 P0 bug), the inversion
  raises ValueError → test FAILS instead of silently skipping (replaces
  the prior `continue`-on-failure pattern which could vacuously pass).
  Hard-asserts validated_count == len(event_times).
- test_hc2_bm_overall_att_dof_matches_wald_test_htz_unit_cluster (NEW):
  pins overall ATT DOF at atol=1e-6 against R Wald_test(HTZ)$df_denom.
- test_hc2_bm_overall_att_dof_matches_wald_test_htz_unit_subexp_cluster
  (NEW): symmetric coverage at alternate cluster level.
- Renamed CR1 → CR1S throughout docs/tests/REGISTRY for consistency
  (diff-diff's HC1+cluster uses Stata-style G/(G-1)*(n-1)/(n-p); plain
  CR1 omits the (n-1)/(n-p) term and diverges by ~1.4%).

192 tests pass (74 stacked + 19 wls_cr2 + 47 SA + 52 estimators_vcov_type).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local codex R2 caught a P1: on rank-deficient stacked designs, the BM
contrast DOF block called `_compute_cr2_bm_contrast_dof()` on the singular
full bread matrix, which raised LinAlgError and triggered the catch-and-
fallback path — downgrading the aggregated inference to normal-theory
even when the target event-time delta_h coefficients were still identified.

Fix mirrors the MultiPeriodDiD rank-deficient pattern (PR #465,
estimators.py:1860-1913):
- Derive `_identified = ~np.isnan(coef)` (solve_ols emits NaN for dropped
  columns under R-style rank handling).
- Subset X, bread, and contrasts to the identified-column block BEFORE
  calling `_compute_cr2_bm_contrast_dof`.
- Only build per-event-time contrasts for event-times whose target delta_h
  column is identified; only build the overall ATT contrast when ALL
  post-period delta_h are identified (otherwise the contrast is undefined).
- The remaining try/except is a defensive guard for genuine singularities
  on the identified-column design (rare; the rank-deficient handling
  already removes the problematic columns).

The earlier behavior was too aggressive: a single dropped nuisance column
would knock out BM DOF for ALL contrasts. The new behavior NaN-guards
only contrasts that target a dropped column.

New regression test `test_hc2_bm_rank_deficient_design_keeps_bm_dof_*` —
verifies the fit doesn't emit fallback warnings on a well-conditioned
design + checks that CI half-widths use t(BM DOF) instead of z=1.96
(catches the R1/R2 normal-fallback failure mode end-to-end).

75 tests pass across stacked + methodology_stacked_did.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Local codex R3 caught a P1 on the same surface: when BM contrast DOF was
unavailable (helper raised OR noise-floor NaN guard fired), StackedDiD
either fell back to normal-theory inference (silent wrong CIs/p-values
under the hc2_bm contract) or passed NaN df through safe_inference (which
guards df<=0 but NOT NaN, producing finite t_stat with NaN p/CI — mixed
inconsistent inference fields). Both fail-open patterns contradict the
registry contract that hc2_bm aggregated inference uses BM DOF.

Fix mirrors the LinearRegression.get_inference() pattern from PR #475 R7
(linalg.py:3689-3706): on the hc2_bm path, when the per-contrast BM DOF
is None or non-finite, emit ALL-NaN inference fields for that contrast
rather than falling back to safe_inference(df=None or NaN). Effect and SE
remain finite — only the inference fields downstream of the DOF are
suppressed.

Applied at both inference sites:
- Event-study per-event-time inference (event_study_effects[h]):
  inline check before safe_inference call.
- Overall ATT inference (overall_t/p/conf_int): same guard.

New regression tests:
- test_hc2_bm_nan_dof_fails_closed_with_all_nan_inference: monkeypatches
  `_compute_cr2_bm_contrast_dof` to return a NaN-only vector and asserts
  ALL event-study + overall t_stat/p_value/conf_int are NaN while effect
  and se remain finite.
- test_hc2_bm_helper_raises_fails_closed_with_all_nan_inference:
  monkeypatches the helper to raise LinAlgError and asserts the
  fallback path NaN-closes inference (does NOT use normal-theory).

77 tests pass. Pattern is consistent with the analytical-surface
fail-closed contract added in PR #475 R7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…drift lock)

R4 verdict was ✅ with 3 polish items:

P2: helper-exception warning text was stale (said "Falling back to normal
distribution" but the code now NaN-closes per the R3 P1 fix). Rewrote
the warning to accurately describe the fail-closed contract: "Aggregated
p-values, t-statistics, and confidence intervals will be returned as NaN
to preserve the hc2_bm contract".

P3: set_params(vcov_type=...) bypassed the __init__ validation guard —
sklearn-style mutation could set arbitrary values and fail later in the
linalg layer with a different message. Fix: factored the vcov_type
validation into `StackedDiD._validate_vcov_type(staticmethod)` and call
it from BOTH __init__ AND set_params. New test
`test_set_params_revalidates_vcov_type` exercises all 4 reject paths.

P3: test_replicate_refit_coef_bit_equal_vs_bake_w_baseline was described
as a "drift lock" but only checked finiteness — wouldn't catch numeric
drift. Captured baseline overall_att=2.0807833161293945 +
overall_se=0.14697256517699428 on the post-PR branch and asserted with
atol=1e-13 (coef) / 1e-10 (SE, compounded through replicate-refit
variance aggregation). Now the test actually catches drift.

78 tests pass (4 set_params reject paths + strengthened drift lock).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…eficient test)

R5 verdict was ✅ with 2 P3 polish items:

P3 (Maintainability): the new solve_ols(weights=...) call sites relied on
the default weight_type="pweight". Locked the WLS-CR2 weight convention
locally by passing weight_type="pweight" explicitly in BOTH solve_ols
calls (main fit + _refit_stacked replicate closure). Prevents silent
variance-contract drift if solve_ols's default ever changes.

P3 (Docs/Tests): test_hc2_bm_rank_deficient_design_keeps_bm_dof_on_
identified_contrasts didn't actually create a rank-deficient design — it
only verified the t-distribution was used on a well-conditioned fixture.
Two-part fix:
  1. Renamed the existing test to
     test_hc2_bm_uses_t_distribution_not_normal_on_well_conditioned_design
     (honest description: it's an end-to-end smoke check that BM DOF
     threading reaches safe_inference).
  2. NEW test test_hc2_bm_rank_deficient_keeps_bm_dof_on_identified_
     contrasts: monkeypatches solve_ols to inject NaN at the intercept
     column, simulating rank deficiency. Exercises the reduced-design
     code path at stacked_did.py:529-577 added in the R2 P1 fix.
     Verifies: (i) no fallback warning (reduced-design helper succeeds);
     (ii) identified delta_h coefficients keep finite t-distribution
     inference (CI half-width > z_0.975 * SE); (iii) overall ATT also
     keeps t-distribution.

79 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 verdict was ✅ with 2 P3 test-coverage polish items:

P3 (test): prior R fixture didn't pin the overall delta-method SE for
the post-period-average ATT contrast — only BM DOF. Post-period
covariance regressions could slip through. Fix: added `se_overall_cr1`
and `se_overall_cr2` to the R fixture (computed as
`c @ vcov @ c.T` for the post-period 1/K row contrast at both
cluster levels). Added overall_se parity assertions at atol=1e-10 to
the three existing hc1/hc2_bm parity tests.

P3 (test): the Python parity test regenerated the panel via
`generate_staggered_data(...)` instead of cross-checking against the
committed CSV. Risk: future DGP-helper drift would silently invalidate
R-parity assertions. Fix: added a DGP-drift check in the `panel`
fixture that loads `benchmarks/data/stacked_did_test_panel.csv`,
merges with the regenerated panel on (unit, period), and asserts the
`outcome` column matches at atol=1e-12. Catches any drift in the DGP
helper with a clear error message rather than wrong parity output.
(first_treat column not checked because inf-vs-0 recoding differs
between CSV/pandas-read and regenerated panel; outcome is what
actually drives the R-side fit.)

79 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ning

R7 verdict was ⚠️ Needs changes with one P1 + one P3:

P1: BM-DOF logic was only pinned on default (anticipation=0,
weighting='aggregate'). The BM-DOF code path uses anticipation in
post-period selection (`h >= -self.anticipation`) and consumes the
composed Q-weights directly — both surfaces were untested. Added 4
parameter-interaction tests:

- test_anticipation_plus_hc2_bm_threads_bm_dof: anticipation=1, asserts
  shifted reference period (e=-2 has SE=0), overall ATT contrast uses
  t(BM DOF) (CI half-width > z_0.975 * SE), at least one event-study CI
  uses t-distribution.
- test_population_weighting_plus_hc2_bm_finite_threads_bm_dof:
  weighting='population' (different Q-weight formula), verifies finite
  output + BM DOF propagation via t-distribution check.
- test_sample_share_weighting_plus_hc2_bm_finite_threads_bm_dof:
  weighting='sample_share' (third Q-weight formula), symmetric coverage.
- test_hc1_vs_hc2_bm_differ_under_anticipation: sanity check that hc1
  vs hc2_bm SEs actually differ under anticipation=1 (proves the
  leverage/DOF adjustment fires under non-default params).

These tests don't require R goldens (no methodology change vs default
case; just verifies the BM-DOF code path remains active under the
parameter variants). R parity on the default path is already strong.

P3: fixture-independence concern from R6 was only partially closed.
Now also pins the `first_treat` column equality after normalizing
inf/0 between CSV (pandas reads inf for never-treated) and regenerated
(0 for never-treated in some pandas versions). Catches cohort-assignment
drift independently of outcome drift.

83 tests pass (+4 new R7 tests).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ion')

R8 verdict was ⚠️ still Needs changes: R7's parameter-interaction tests
were smoke checks (population test used constant pop=1.0 which is
numerically degenerate with aggregate weighting on a balanced panel);
no R goldens for non-default variants. Closes the gap with numerical
parity at atol=1e-10.

Source-tree additions:
- benchmarks/data/stacked_did_anticipation1_panel.csv (NEW): pre-stacked
  fixture with anticipation=1, generated from same base panel via
  StackedDiD(anticipation=1).fit(). Event times [-3,-2,-1,0,1,2] with
  ref=-2 (shifted by anticipation).
- benchmarks/data/stacked_did_population_panel.csv (NEW): pre-stacked
  fixture with weighting='population' AND heterogeneous unit populations
  (rng.uniform(1,100), seed=99). Q-weights range [0.68, 1.54] vs uniform
  ~1.0 under aggregate weighting; forces the population-formula branch.
- benchmarks/R/generate_stacked_did_golden.R: refactored to loop over
  3 variants (default, population, anticipation1), producing
  variant-keyed entries in `stacked_did_golden.json`. Post-period filter
  now generalized to `h >= -anticipation` (matches Python's
  `_post_event_times_preview` at stacked_did.py:546-548).
- benchmarks/data/stacked_did_golden.json: regenerated with all 3 variants.

Test additions (tests/test_methodology_stacked_did.py):
- new class TestStackedDiDParityRVariants (4 tests):
  - test_population_weighting_hc1_matches_clubsandwich_cr1s
  - test_population_weighting_hc2_bm_matches_clubsandwich_cr2
  - test_anticipation1_hc1_matches_clubsandwich_cr1s
  - test_anticipation1_hc2_bm_matches_clubsandwich_cr2
  All assert event-study SE AND overall_se at atol=1e-10.
- new fixture `population_panel` reproduces the heterogeneous-population
  seed (np.random.default_rng(99)) used by the R variant CSV generator.

87 tests pass (+4 variant parity).

Note: weighting='sample_share' parity not added because its Q-weight
formula structurally mirrors population weighting (both differ from
aggregate only in the Q-weight calculation, which feeds into the same
solve_ols(weights=) path). The population test exercises that code
path; sample_share remains covered by the smoke test from R7.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R9 verdict was ⚠️ still on parameter-coverage P1: sample_share had no
R-parity fixture, and population/anticipation1 tests asserted SE only
(not BM DOF).

Source-tree additions:
- benchmarks/data/stacked_did_sample_share_panel.csv (NEW): pre-stacked
  fixture for weighting='sample_share'.
- benchmarks/R/generate_stacked_did_golden.R: added sample_share variant
  to the variants list; loops over all 4 variants now.
- benchmarks/data/stacked_did_golden.json: regenerated with sample_share.

Test additions (tests/test_methodology_stacked_did.py):
- test_sample_share_weighting_hc1_matches_clubsandwich_cr1s
- test_sample_share_weighting_hc2_bm_matches_clubsandwich_cr2 (+ DOF)
- Extended test_population_weighting_hc2_bm_matches_clubsandwich_cr2:
  added per-event-time + overall BM DOF parity via CI-width inversion.
- Extended test_anticipation1_hc2_bm_matches_clubsandwich_cr2: same DOF
  parity additions.

89 tests pass. All 3 non-default variants now have full R parity:
event-study SE + overall_se + per-event BM DOF + overall BM DOF.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall assessment

✅ Looks good

Executive summary

  • No unmitigated methodology or correctness issues stood out in the StackedDiD vcov_type threading. The new hc2_bm route and the CR1S-vs-CR1 deviation are explicitly documented in docs/methodology/REGISTRY.md:L1367-L1374, so I do not treat those as defects.
  • The new Bell-McCaffrey DOF path in diff_diff/stacked_did.py:L520-L610 and the fail-closed NaN gating in diff_diff/stacked_did.py:L730-L807 correctly avoid the silent normal-theory fallback that would have been a P0/P1 issue.
  • Parameter propagation looks complete through constructor validation, fit-time branching, survey/replicate guards, get_params() / set_params(), and the results object.
  • There is one moderate user-surface gap: StackedDiDResults.summary() still does not reveal which variance family was used, even though vcov_type is now a first-class result field.
  • There is one minor performance inefficiency: overall-only hc2_bm fits still compute per-event contrast DOFs that are never surfaced.
  • I could not execute the new tests in this environment because pytest and numpy are not installed here.

Methodology

No unmitigated findings. Cross-checking the change against docs/methodology/REGISTRY.md:L1367-L1374 and the implementation in diff_diff/stacked_did.py:L500-L807, the PR stays within documented methodology: hc1 remains the existing CR1S-style clustered path, hc2_bm is explicitly documented as the clubSandwich-based CR2 Bell-McCaffrey route, and the new BM-DOF fail-closed behavior NaN-closes inference instead of emitting misleading normal-theory output.

Code Quality

No unmitigated findings. The new parameter is validated at construction and re-validation time (diff_diff/stacked_did.py:L203-L229, diff_diff/stacked_did.py:L1197-L1212), threaded into the main solve path (diff_diff/stacked_did.py:L508-L516), threaded through survey/replicate branches (diff_diff/stacked_did.py:L622-L651), and persisted on the results object (diff_diff/stacked_did.py:L809-L833).

Performance

  • Severity: P3. Impact: on vcov_type="hc2_bm" fits with aggregate=None / "simple", the code still builds per-event Bell-McCaffrey contrast DOFs even though only the overall contrast is user-visible, so the default overall-only path does avoidable CR2 work. See diff_diff/stacked_did.py:L530-L597. Concrete fix: gate es_cols_full construction on aggregate=="event_study" and compute only the overall contrast DOF for simple fits.

Maintainability

  • Severity: P2. Impact: the PR adds vcov_type to StackedDiDResults, but the primary textual output still omits the variance family, so summary() / print_summary() cannot distinguish HC1 from HC2-BM despite those now being materially different user-facing inference modes. See diff_diff/stacked_did_results.py:L97-L100, diff_diff/stacked_did_results.py:L144-L210; other result classes already print a variance label via diff_diff/results.py:L213-L235. Concrete fix: add a Variance: line to StackedDiDResults.summary() using the shared label helper pattern, suppressing it when survey_metadata is present.

Tech Debt

  • Severity: P3 informational. Impact: the remaining vcov_type="conley" gap is properly tracked rather than silently deferred, so it is not a blocker under the project rules. See TODO.md:L102-L104. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

No unmitigated documentation/test findings in the PR itself. The added coverage is broad across hc1 / hc2_bm, survey rejection, replicate-refit, rank deficiency, fail-closed BM inference, weighting variants, and anticipation=1.

Verification note: I could not run tests/test_stacked_did.py or tests/test_methodology_stacked_did.py here because this review environment does not have pytest or numpy installed.

CI codex R1 verdict was ✅ with 2 polish items:

P2 (Maintainability): StackedDiDResults.summary() didn't surface the
analytical variance family (vcov_type), even though hc1 vs hc2_bm are
now materially different user-facing inference modes. Other result
classes (DiD/MPD/TWFE) already print a Variance: line via the shared
`_format_vcov_label` helper at results.py:49-89. Fix: added a Variance:
line to StackedDiDResults.summary() using the shared helper, suppressed
under survey_metadata (the survey block already surfaces inference
source). My earlier plan note ("mirror SA: don't add Variance: row")
was based on a wrong assumption — SA-results uses the same shared
helper, and StackedDiD inherits the same precedent.

P3 (Performance): on `vcov_type="hc2_bm"` fits with `aggregate=None`
or `"simple"`, the code built per-event Bell-McCaffrey contrast DOFs
that were never surfaced (only overall ATT is exposed). Avoidable
CR2 cost on every overall-only hc2_bm fit. Fix: gated the per-event
contrast construction (es_keys, es_cols_full) on
`aggregate == "event_study"`. The overall ATT contrast still gets
computed unconditionally (it's the user-facing scalar).

89 tests still pass — no behavioral regression on the event-study
path; overall-only fits are now slightly faster.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: f0aa06096b9b56ad00112e52cf8cd707c2fa4cdb


Overall assessment

✅ Looks good

Executive summary

  • No unmitigated P0/P1 methodology or correctness issues stood out in the StackedDiD vcov_type threading. The hc1/hc2_bm routes, BM-DOF fail-closed behavior, survey-design rejection, and Conley deferral match docs/methodology/REGISTRY.md:L1367-L1374.
  • The previous performance nit appears resolved: per-event Bell-McCaffrey contrast DOFs are now only built on the aggregate=="event_study" path.
  • The previous summary-output gap is only partially resolved. StackedDiDResults.summary() now prints a Variance: line, but it mislabels clustered StackedDiD fits as one-way HC/HC2 because cluster metadata is not carried onto the results object.
  • Parameter propagation looks complete through constructor validation, fit(), replicate-refit, get_params() / set_params(), and StackedDiDResults.vcov_type.
  • I could not execute the targeted tests here because this environment does not have pytest, numpy, pandas, or scipy installed.

Methodology

  • Severity: P2. Impact: the new summary surface still reports the wrong inference family. StackedDiDResults.summary() calls _format_vcov_label(..., cluster_name=None), which renders hc1 as "HC1 heteroskedasticity-robust" and hc2_bm as "HC2 + Bell-McCaffrey DOF (one-way, ...)", even though StackedDiD is intrinsically cluster-robust at unit or unit_subexp. That makes the prior summary fix only partially correct and can mislead users about the variance method actually used. diff_diff/stacked_did_results.py:L179-L199, diff_diff/results.py:L49-L76, diff_diff/stacked_did.py:L493-L517, diff_diff/stacked_did.py:L813-L839
  • Concrete fix: persist the cluster setting, and ideally the realized cluster count, on StackedDiDResults; pass that metadata into _format_vcov_label; add a regression test that summary() shows clustered CR1 / CR2-BM labels for both cluster="unit" and cluster="unit_subexp".

Code Quality

No unmitigated findings. vcov_type propagation and revalidation look complete.

Performance

No unmitigated findings. The earlier overall-only CR2-BM extra-work issue appears addressed by gating per-event DOF construction to the event-study path in diff_diff/stacked_did.py:L553-L559.

Maintainability

No additional unmitigated findings beyond the summary-metadata issue above.

Tech Debt

  • Severity: P3 informational. Impact: vcov_type="conley" remains unavailable on StackedDiD, but the limitation is explicitly tracked in TODO.md and documented in the Methodology Registry, so it is not blocking. TODO.md:L102-L104, docs/methodology/REGISTRY.md:L1370-L1374
  • Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

No additional unmitigated findings in the PR content. I could not run tests/test_stacked_did.py or tests/test_methodology_stacked_did.py in this environment because pytest, numpy, pandas, and scipy are not installed.

R2 verdict was ✅ + 1 P2: the summary's Variance: line (added in R1)
mislabelled clustered StackedDiD fits as one-way ("HC1
heteroskedasticity-robust" / "HC2 + Bell-McCaffrey DOF (one-way, n=N)")
because I passed cluster_name=None to _format_vcov_label. StackedDiD
is intrinsically clustered.

Fix:
- Added `cluster_name` and `n_clusters` fields to StackedDiDResults
  dataclass.
- Populated both at fit() time: cluster_name=self.cluster, n_clusters=
  np.unique(cluster_ids).size (matches the actual realized count, not
  config).
- summary() now passes cluster_name + n_clusters into
  _format_vcov_label, rendering "CR1 cluster-robust at unit, G=50"
  for hc1 and "CR2 Bell-McCaffrey cluster-robust at unit, G=50" for
  hc2_bm.

New regression test test_summary_renders_clustered_variance_label
verifies:
- hc1 → label contains "CR1 cluster-robust at unit"
- hc2_bm → label contains "CR2 Bell-McCaffrey cluster-robust at unit"
- Neither contains "(one-way" (catches the R2 regression directly)
- G= cluster count present
- cluster="unit_subexp" → label contains "at unit_subexp"

90 tests pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: af10cc4718ba667f90ee538b4eab1026cf1f11e4


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 issues found in the StackedDiD vcov_type threading.
  • The methodology surface now lines up across code and registry: hc1 default, hc2_bm routed through the existing weighted clubSandwich CR2 port, classical/hc2 rejected up front, and survey_design restricted to hc1. diff_diff/stacked_did.py:L202-L229, diff_diff/stacked_did.py:L330-L346, docs/methodology/REGISTRY.md:L1367-L1374
  • The previous review’s summary-label issue appears resolved: StackedDiDResults.summary() now carries cluster metadata and renders clustered CR1/CR2-BM labels instead of one-way HC labels. diff_diff/stacked_did.py:L813-L840, diff_diff/stacked_did_results.py:L186-L202, diff_diff/results.py:L49-L76
  • The new hc2_bm aggregated inference path is fail-closed and computes BM contrast DOF on the identified-column design, which addresses the earlier DOF propagation and rank-deficiency risks. diff_diff/stacked_did.py:L520-L619, diff_diff/stacked_did.py:L730-L810
  • Parameter propagation looks complete through constructor validation, get_params() / set_params(), results objects, and the survey replicate-refit closure. diff_diff/stacked_did.py:L185-L229, diff_diff/stacked_did.py:L637-L651, diff_diff/stacked_did.py:L1187-L1218
  • I could not execute the targeted tests here because this environment does not have numpy, pandas, scipy, or pytest.

Methodology

  • No unmitigated findings. The hc1/hc2_bm routing, survey precedence, and BM-DOF threading in diff_diff/stacked_did.py:L500-L840 match the StackedDiD registry entry at docs/methodology/REGISTRY.md:L1367-L1374.

Code Quality

  • No unmitigated findings.

Performance

  • No unmitigated findings. The extra per-event Bell-McCaffrey DOF work is gated to the aggregate=="event_study" path, so the simple/overall surface only computes the overall contrast DOF. diff_diff/stacked_did.py:L549-L581

Maintainability

  • No unmitigated findings.

Tech Debt

  • Severity: P3 informational. Impact: vcov_type="conley" remains deferred on StackedDiD, but the limitation is explicitly documented in the methodology registry and tracked in TODO.md, so it is not blocking. docs/methodology/REGISTRY.md:L1371-L1374, TODO.md:L102-L104
  • Concrete fix: none required in this PR.

Security

  • No findings.

Documentation/Tests

  • No unmitigated findings by inspection. The new tests cover R parity, parameter interactions (population, sample_share, anticipation), rank-deficient BM-DOF handling, fail-closed inference, summary labeling, and set_params() revalidation. tests/test_methodology_stacked_did.py:L1-L669, tests/test_stacked_did.py:L927-L1717
  • I could not run tests/test_methodology_stacked_did.py or tests/test_stacked_did.py in this environment because numpy, pandas, scipy, and pytest are unavailable. I did verify that the touched Python files parse successfully with ast.parse.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 21, 2026
@igerber igerber merged commit d1543b7 into main May 21, 2026
33 of 34 checks passed
@igerber igerber deleted the stacked-did-vcov-type-phase1b-2 branch May 21, 2026 17:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant