SpilloverDiD: event_study=True per-event-time × ring decomposition (Wave C) by igerber · Pull Request #456 · igerber/diff-diff

igerber · 2026-05-16T16:52:49Z

Summary

Replaces the Wave B NotImplementedError gate at spillover.py:1430-1442 with the full per-event-time × ring decomposition from Butts (2021) Section 5 / Table 2.
Emits per-event-time direct effects tau_k and per-(ring, event-time) spillover effects delta_jk as att_dynamic: pd.DataFrame (indexed by event-time k) plus MultiIndex spillover_effects (levels (ring_label, event_time)).
TwoStageDiD-compatible event_study_effects: Dict[int, Dict] alias for plot_event_study / diagnostic_report.event_study_diagnostics consumption.
Scalar att aggregation: sample-share-weighted average of post-treatment tau_k with SE from linear-combination inference on the post-treatment block of the stage-2 vcov.
DGP factory generate_butts_staggered_dgp extended with tau_per_event_time / delta_per_ring_per_event_time callable kwargs (backward-compatible; bit-identical to Wave B when both default to None, verified by pinned SHA-256 baselines).
Backward compatibility: event_study=False leaves all Wave C fields as None and reproduces Wave B SEs bit-identically.

Methodology references (required if estimator / math changes)

Method name(s): Butts (2021/2023) ring-indicator spillover-aware DiD — Section 5 / Table 2 per-event-time × ring decomposition; Gardner (2022) two-stage residualize-then-fit.
Paper / source link(s): https://arxiv.org/abs/2105.03737 (Butts 2021/2023); https://arxiv.org/abs/2207.05943 (Gardner 2022).
Any intentional deviations from the source (and why):
1. Two-clock K_it — Butts uses one symbol but the spec requires two operationally distinct clocks. Direct-effect clock uses own onset; spillover-exposure clock uses the running-min cohort onset (earliest in-range cohort, NOT the geographically nearest cohort). Stated upfront in REGISTRY + module docstring.
2. horizon_max binning (vs TwoStageDiD filtering) — SpilloverDiD bins event-times outside [-H, +H] into endpoint pools; TwoStageDiD filters those rows. Honors feedback_no_silent_failures; cross-documented in both estimators' docstrings.
3. Reference period -1 - anticipation — mirrors TwoStageDiD at two_stage.py:486. Reference row uses coef = 0.0, se = 0.0, n_obs = 0, conf_int = (0.0, 0.0) for TwoStageDiD parity.
4. Sample-share-weighted scalar att — att = sum_{k>=0} w_k * tau_k with w_k = n_treated_at_k / total. SE from linear-combination inference Var(att) = w' V_subset w (no separate fit). CallawaySantAnna aggregate_method=\"simple\" and TwoStageDiD's analogous aggregate path use the same share-weighting convention.
5. Validation gate — ref_period < -horizon_max raises ValueError; silent floor-shift would change identification.
6. Variance — same Wave B caveat: SEs use solve_ols's standard variance WITHOUT the Gardner GMM first-stage uncertainty correction (Wave D follow-up will close this).

Validation

Tests added/updated:
- tests/test_spillover.py (+817 lines, 30 new test methods): event-study API, two-clock K helper, horizon binning, design builder, reference period, reduce-to-aggregate, identification MC (50 seeds, per-event-time tau_k recovery within 0.025), placebo pre-trends (Type I rate ≤ 0.30 over 50 seeds at alpha=0.10), singularity (rectangular schema), Conley integration (vcov shape + non-negative diagonal), summary / to_dict / pickle round-trip, event_study_effects schema parity, lincom-att hand-computed, validation (horizon_max < 0, ref_period < -horizon_max), fit idempotence.
- tests/test_dgp_utils.py (new, 148 lines): pinned SHA-256 bit-identity baselines for generate_butts_staggered_dgp / generate_butts_nonstaggered_dgp; callable-kwarg unit tests.
Backtest / simulation / notebook evidence (if applicable): MC identification regression (50 seeds × per-event-time tau_k profile) + Type I placebo rate verification — both passed end-to-end. Wave B regression intact (177 pre-existing tests + 30 new = 207 passed in fast subset, MC tests pass separately).

Security / privacy

Confirm no secrets/PII in this PR: Yes

Generated with Claude Code

github-actions · 2026-05-16T17:00:19Z

Overall Assessment

⚠️ Needs changes

The documented methodology choices in docs/methodology/REGISTRY.md are mostly aligned with the implementation: the two-clock K_it split, endpoint binning vs. TwoStageDiD, ref_period = -1 - anticipation, and the missing Gardner GMM correction are all explicitly registered, so I did not count those as defects. The blockers are in the new event-study aggregation logic.

Executive Summary

The new event_study=True path computes per-column counts before the stage-2 exclusion mask is applied, then uses those pre-drop counts to populate att_dynamic / event_study_effects and to weight the scalar att. That mixes two different samples and can change the reported ATT.
The scalar att aggregation also silently uses np.nansum over post-treatment direct coefficients. If solve_ols drops a post-treatment direct column as rank-deficient, weight mass disappears instead of being renormalized or rejected.
These are implementation bugs, not documented methodology deviations: the registry commits to sample-share weighting on the fitted post-treatment direct effects.
The new tests cover API shape, identification, and empty rectangular cells, but they do not exercise event_study=True together with the existing warn-and-drop / finite_mask path or with dropped post-treatment direct columns.
I did not find a security issue in the changed files.

Methodology

Severity P1 — diff_diff/spillover.py:L2364-L2383, diff_diff/spillover.py:L2392-L2406, diff_diff/spillover.py:L2475-L2482, diff_diff/spillover.py:L765-L782, diff_diff/spillover.py:L882-L897
Impact: _build_event_study_design() computes n_obs_per_col on the full design before finite_mask drops rows with unsupported stage-1 FE estimates, but _extract_event_study_results() later uses those stale counts both as reported n_obs and as the w_k weights for the scalar att. On any warn-and-drop fit, att_dynamic["n_obs"], event_study_effects[k]["n_obs"], and the top-level att can all disagree with the actual stage-2 sample. That is a semantic-contract violation in the aggregation step, and it can change the point estimate, not just metadata.
Concrete fix: recompute the kept-column counts after applying finite_mask (or build the event-study metadata from X_2_fit directly) and use those post-drop counts everywhere n_obs_per_col is consumed.
Severity P1 — diff_diff/spillover.py:L882-L897, diff_diff/linalg.py:L1010-L1048
Impact: solve_ols() explicitly returns NaN coefficients for dropped rank-deficient columns. The new scalar att path still includes every post-treatment direct column in post_direct_indices, then computes att = np.nansum(weights * coefs_post). If any post-treatment direct coefficient is dropped, the code silently discards that coefficient’s contribution without renormalizing weights or failing the aggregation. The result is a wrong ATT point estimate with a NaN SE, instead of an explicitly undefined aggregate or a documented reweighting rule.
Concrete fix: when any post-treatment direct coefficient is NaN, either:
1. mark the scalar att as NaN and warn that the aggregate is unidentified, or
2. reweight only the identified post-treatment direct cells and document that fallback in REGISTRY.md.
  Do not use np.nansum on a fixed weight vector.

Code Quality

No material findings beyond the aggregation issues above.

Performance

No material findings in review scope.

Maintainability

Severity P3 — docs/methodology/REGISTRY.md:L2992-L2994, diff_diff/spillover.py:L646-L667, diff_diff/spillover.py:L822-L846
Impact: the registry says spillover_effects is emitted rectangularly over the full event-time × ring grid, but the implementation drops ref_period from k_grid and never re-inserts reference-period spillover rows. Consumers iterating [-H, ..., H] can still hit a missing k = reference_period spillover slice. This is an API-contract mismatch, not a methodology blocker.
Concrete fix: either emit (ring, ref_period) rows with coef=NaN, se=NaN, n_obs=0, or update the registry/docs/tests to say the spillover grid is rectangular over non-reference event times only.

Tech Debt

No separate deferrable finding. The two blocking issues above are correctness problems and should be fixed in code, not deferred to TODO.md.

Security

No findings.

Documentation/Tests

Severity P3 — tests/test_spillover.py:L1665-L1789, tests/test_spillover.py:L3310-L3539, tests/test_dgp_utils.py:L103-L121
Impact: the aggregate estimator already has regression coverage for warn-and-drop / stage-2 exclusion, but the new event-study tests never combine event_study=True with that path, and they do not cover dropped post-treatment direct columns. That is exactly where the weighting bugs above live. Separately, test_per_event_time_tau_recovers_exact_y_with_zero_noise() does not verify the formula described in its docstring; it only checks that a k=0 treated row exists.
Concrete fix: add an event_study=True regression using unsupported baseline-treated units and assert that att_dynamic["n_obs"], event_study_effects[k]["n_obs"], and att reflect the post-finite_mask sample; add a rank-deficient event-study case that exercises dropped post-treatment direct columns; strengthen the zero-noise DGP test to check exact callable-driven outcomes.

Path to Approval

Recompute event-study n_obs_per_col on the actual stage-2 estimation sample after finite_mask is applied, and use those counts for both result payloads and scalar-ATT weights.
Replace the current np.nansum scalar-ATT aggregation with an explicit policy for dropped post-treatment direct coefficients: either fail closed (att=np.nan) or renormalize over identified cells and document that behavior.
Add regression tests for event_study=True on the warn-and-drop / finite_mask path and on a rank-deficient post-treatment direct-column path.

…ave C) Replaces the Wave B NotImplementedError gate at spillover.py:1430-1442 with the full per-event-time × ring decomposition from Butts (2021) Section 5 / Table 2. Emits per-event-time direct effects tau_k and per-(ring, event-time) spillover effects delta_jk as att_dynamic: pd.DataFrame (indexed by k) and MultiIndex spillover_effects (levels (ring_label, event_time)). A TwoStageDiD- compatible event_study_effects: Dict[int, Dict] alias (mirroring two_stage.py:1355-1389 schema with conf_int = (low, high) tuple) is also emitted for consumption by plot_event_study and diagnostic_report. Methodology: the implementation operationalizes Butts' single K_it symbol as TWO event-time clocks — K_direct = t - effective_first_treat(i) for ever- treated rows, and K_spill = t - earliest-in-range-cohort-onset(i) for spillover rows (running min across activated cohorts; NaN for pre-trigger and far-away rows). K_spill >= 0 structurally; negative-k spillover cells emit rectangularly with coef = NaN, n_obs = 0. Reference period: ref_period = -1 - anticipation (TwoStageDiD parity at two_stage.py:486). When horizon_max is set, ref_period must fall inside [-horizon_max, +horizon_max] or fit raises ValueError — silent floor-shift to -horizon_max would change identification (rejected per feedback_no_silent_failures). Reference row uses coef = 0.0, se = 0.0, n_obs = 0, conf_int = (0.0, 0.0). horizon_max semantics (divergence from TwoStageDiD): bins event-times outside [-H, +H] into endpoint pools, no observations dropped. TwoStageDiD filters those rows. Divergence intentional + cross-documented. horizon_max=None auto- detects the bin set from observed K values. Scalar att aggregation: sample-share-weighted average of post-treatment tau_k (att = sum_{k>=0} w_k * tau_k with w_k = n_treated_at_k / total). SE from linear-combination inference Var(att) = w' V_subset w on the post-treatment block of the stage-2 vcov — no separate fit. Reduce-to-aggregate equivalence: under constant-tau DGP with horizon_max=None, the lincom-weighted scalar att reproduces Wave B's aggregate tau_total bit- identically. Note: horizon_max=0 does NOT reduce to Wave B (binning collapses pre-treatment K values into k=0, making D^0 = D_i ever-treated indicator rather than D_it). Backward compatibility: event_study=False leaves all Wave C fields (att_dynamic, event_study_effects, horizon_max, reference_period) as None and reproduces Wave B SEs bit-identically. Variance caveat: per-event-time SEs use solve_ols's standard variance (HC1 / Conley / cluster) WITHOUT the Gardner GMM first-stage uncertainty correction; planned Wave D follow-up closes this. Tests: 30 new event-study test methods covering API, two-clock K helper, horizon binning, design builder, reference period, reduce-to-aggregate, identification MC (50 seeds, per-event-time tau_k recovery within 0.025), placebo pre-trends (Type I rate <= 0.30 over 50 seeds at alpha=0.10), singularity (rectangular schema), Conley integration (vcov shape + np.diag >= 0), summary/to_dict/pickle round-trip, event_study_effects schema parity with TwoStageDiD, lincom-att hand-computed, validation (horizon_max < 0, ref_period < -horizon_max), and fit idempotence. DGP factory generate_butts_staggered_dgp extended with tau_per_event_time and delta_per_ring_per_event_time callable kwargs (backward-compatible — both default to None, producing the Wave B scalar DGP bit-identically; verified by tests/test_dgp_utils.py with pinned SHA-256 baselines). Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

P1: recompute n_obs_per_col on POST-finite_mask sample. Previously _build_event_study_design counted rows on the pre-mask design; those stale counts then populated att_dynamic / event_study_effects AND weighted the scalar att. On warn-and-drop fits, reported metadata disagreed with the actual stage-2 sample AND the point estimate could change. Fix: in fit(), recompute event_study_meta["n_obs_per_col"] from X_2_fit after applying finite_mask. P1: fail-closed scalar att when post-direct coef is NaN. Previously np.nansum on a fixed weight vector silently zeroed dropped (rank-deficient) post-treatment direct coefficients, changing the ATT point estimate. Fix: detect any non-finite post-direct coef and return att=NaN with a UserWarning. Library convention is no-silent-failures (feedback_no_silent_failures); inspect att_dynamic for the per-event-time coefficients and re-aggregate manually if needed. P3: emit (ring, ref_period) spillover_effects rows. The pre-filter dropped ref_period from k_grid entirely, so the rectangular emission missed the (ring, ref_period) cells per ring. Consumers iterating [-H,...,+H] would KeyError on the reference slice. Fix: emit (ring, ref_period) with coef=0.0, se=0.0, n_obs=0 to mirror the direct-effect reference row. Tests added: - TestSpilloverDiDEventStudyFiniteMaskPath: warn-and-drop panel (baseline- treated units with no Omega_0 rows) with event_study=True. Verifies att_dynamic["n_obs"], event_study_effects[k]["n_obs"], AND scalar att share weights all reflect the post-mask sample. Hand-computed lincom reproduces res.att at machine precision. - TestSpilloverDiDEventStudyRankDeficientFailClosed: monkey-patches solve_ols to NaN out one post-direct coef. Asserts att=NaN and the documented warning fires. - TestSpilloverDiDEventStudyReferencePeriodSpilloverRows: asserts every (ring, ref_period) cell exists in spillover_effects with the 0-anchored schema. Strengthened test_per_event_time_tau_recovers_exact_y_with_zero_noise to verify the closed-form y = mu_i + lambda_t + tau_fn(k) holds at atol=1e-12 per row, instead of merely checking that k=0 rows exist. CHANGELOG + REGISTRY updated to document the post-finite_mask contract, fail-closed scalar att invariant, and reference-period spillover rows. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-05-16T17:38:59Z

🔁 AI review rerun (requested by @igerber)

Head SHA: dbe0b78c7074a53062b7d0220864094677cbb0f9

Overall Assessment
✅ Looks good

This re-review clears the prior blockers. The two earlier P1s are fixed: event-study counts/weights now recompute after finite_mask at diff_diff/spillover.py:L2463-L2472, and the scalar att now fails closed when a post-treatment direct coefficient is non-finite at diff_diff/spillover.py:L909-L953. I did not find a new unmitigated P0/P1 in the changed PR surface.

Executive Summary

The event-study path now matches the documented sample contract: att_dynamic["n_obs"], event_study_effects[k]["n_obs"], and scalar-att weights all use the post-finite_mask stage-2 sample.
The prior silent-aggregation issue is fixed: dropped post-treatment direct coefficients now force att = NaN with a warning instead of silently disappearing from a fixed-weight sum.
High-level methodology alignment is intact: Butts is the spillover-aware DiD reference for spatial spillovers/staggered timing, and Gardner supplies the two-stage residualize-then-fit backbone.
Remaining issues are minor: one P2 validation-order bug on invalid anticipation types, plus two P3 usability/docs items.
I could not run the test suite in this environment because pytest is unavailable and the Python environment here is missing numerical dependencies; assessment is static-diff based.

Methodology
No unmitigated P0/P1 methodology findings. High-level estimator structure still tracks the cited source pair.

Severity P3-informational — docs/methodology/REGISTRY.md:L2997-L3021, TODO.md:L132-L132. Impact: the substantive departures from a literal Butts/Gardner reading are explicitly documented or tracked: two-clock K_it, endpoint binning vs. TwoStageDiD, sample-share scalar att, and the still-missing Gardner GMM first-stage variance correction. Under the review rubric, these are not defects. Concrete fix: none required for approval.

Code Quality

Severity P2 — diff_diff/spillover.py:L1994-L2014. Impact: on the new event_study=True + horizon_max path, ref_period_check = -1 - self.anticipation is computed before anticipation is validated. Non-numeric bad values such as None or "1" will raise a raw TypeError instead of the estimator’s targeted ValueError. Concrete fix: move the anticipation type/value validation above the ref-period compatibility check, and add a regression covering a non-numeric invalid value.

Performance

No findings.

Maintainability

Severity P3 — diff_diff/results.py:L454-L477. Impact: SpilloverDiDResults.summary() truncates the combined (ring, k) label to 15 characters, so distinct horizons within the same ring become visually indistinguishable in the printed event-study block. Concrete fix: print ring and k as separate columns, or stop truncating the combined label.

Tech Debt

No new blocker beyond the already-tracked variance follow-up.

Security

No findings.

Documentation/Tests

Severity P3 — CHANGELOG.md:L12-L12, tests/test_spillover.py:L3354-L3377. Impact: the changelog says deterministic-limit, bit-identical reduce-to-aggregate equivalence is “verified by TestSpilloverDiDEventStudyReduceToAggregate,” but that test currently uses the default noisy DGP and only checks abs(agg.att - es.att) < 1e-3. The release note overstates what the test actually locks in. Concrete fix: either make that test deterministic (error_sd=0) and assert exact equality, or soften the changelog claim.

P2: validate anticipation BEFORE the ref_period_check compatibility test. Previously, when event_study=True and horizon_max is set, the code computed `ref_period_check = -1 - self.anticipation` BEFORE validating anticipation, so non-numeric values (None, "1") raised a raw TypeError instead of the targeted ValueError ("anticipation must be a non-negative integer"). Fix: move the anticipation type/value validator above both the horizon_max validator and the ref_period check. P3: split (ring, k) into separate columns in summary() rendering. The prior combined `f"{ring} k={k}"` label was truncated to 15 chars, making distinct horizons within the same ring (e.g. "[50, 200] k=+0" vs "[50, 200] k=+1") visually indistinguishable. New layout for MultiIndex spillover_effects: separate `Ring` (15 chars) + `k` (5 chars) columns. Header width updated to fit the new column. Non-MultiIndex aggregate path unchanged. P3: tighten TestSpilloverDiDEventStudyReduceToAggregate to match the CHANGELOG's bit-identical claim. Previously the test used the default noisy DGP with abs(diff) < 1e-3, but the changelog says "verified bit- identical at machine precision". Fix: deterministic DGP via `error_sd=0.0` + atol=1e-10. Regression tests: - test_non_numeric_anticipation_raises_targeted_value_error: anticipation="1" must raise ValueError, not TypeError. - test_none_anticipation_raises_targeted_value_error: anticipation=None must raise the targeted ValueError too. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-05-16T17:51:07Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 9275e027eb3cda6e12335f48311187cbd860645f

Overall Assessment

✅ Looks good

No unmitigated P0/P1 findings in the changed PR surface. The prior review items are addressed: anticipation is now validated before reference-period arithmetic in diff_diff/spillover.py:L1973-L2018, post-finite_mask event-study counts/weights are recomputed before extraction in diff_diff/spillover.py:L2467-L2551, the summary now renders ring and k separately in diff_diff/results.py:L454-L495, and the reduce-to-aggregate changelog claim is now backed by a deterministic exact-equality test in tests/test_spillover.py:L3362-L3393.

Executive Summary

The Wave C SpilloverDiD(event_study=True) path is aligned with the Methodology Registry on the load-bearing points: two-clock K, endpoint binning, reference period -1 - anticipation, post-finite_mask counting, fail-closed scalar att, and rectangular output.
The previous review’s substantive issues are fixed in the current diff.
I did not find a new undocumented methodology deviation, missing assumption check, incorrect control-group composition, or incorrect NaN/inference propagation in the changed estimator path.
The remaining SE caveat is explicitly documented in docs/methodology/REGISTRY.md and tracked in TODO.md, so it is mitigated under the review rubric.
Static review only; I did not execute the test suite in this environment.

Methodology

Severity P3-informational — docs/methodology/REGISTRY.md:L2997-L3021, TODO.md:L132-L132. Impact: the event-study SEs still omit Gardner’s first-stage GMM uncertainty correction, but that limitation is explicitly documented and tracked, so it is not a defect for this PR. Concrete fix: none required for approval; keep the planned follow-up.

Code Quality

No findings.

Performance

No findings.

Maintainability

Severity P3 — diff_diff/spillover.py:L1497-L1503, docs/methodology/REGISTRY.md:L3003-L3005, docs/api/spillover.rst:L195-L198. Impact: the estimator’s in-code constructor docstring still says horizon_max “Mirrors TwoStageDiD,” but the shipped Wave C behavior intentionally diverges by binning rather than filtering; that leaves the source-level methodology docs internally inconsistent. Concrete fix: update the horizon_max parameter docstring in SpilloverDiD to state the endpoint-binning semantics and the intentional divergence from TwoStageDiD.

Tech Debt

Severity P3-informational — TODO.md:L132-L138. Impact: the remaining spillover follow-ups are properly tracked, so there is no untracked deferral in this PR. Concrete fix: none required in this PR.

Security

No findings.

Documentation/Tests

No findings. The prior documentation/test gap from the last review is addressed by the deterministic equivalence regression in tests/test_spillover.py:L3362-L3393, and the public docs now describe the shipped Wave C behavior consistently with the registry in docs/api/spillover.rst:L187-L202 and diff_diff/guides/llms-full.txt:L505-L506.

P3 docstring polish: `SpilloverDiD.__init__`'s `horizon_max` parameter description previously said "Mirrors TwoStageDiD", which contradicts the shipped Wave C behavior. SpilloverDiD bins event-times outside `[-H, +H]` into endpoint pools (no row drop); TwoStageDiD filters those rows. Updated the docstring to spell out the binning semantic + cross-document the intentional divergence + reference the ValueError when ref_period falls outside the window. Matches the REGISTRY + api/spillover.rst + llms-full.txt narrative. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

igerber · 2026-05-16T18:53:40Z

/ai-review

github-actions · 2026-05-16T19:02:21Z

🔁 AI review rerun (requested by @igerber)

Head SHA: f5c9e2972f6e5858b0d05a1b0a64138f2b8e5225

Overall Assessment

✅ Looks good

The changed estimator surface remains directionally consistent with the cited source material: Butts explicitly studies DiD with spatial spillovers, including staggered timing, and Gardner’s two-stage estimator fits untreated observations first with a second-stage correction for first-stage estimation. The code still documents the missing Gardner correction in docs/methodology/REGISTRY.md:L3021-L3021 and TODO.md:L132-L132, so that caveat is mitigated rather than blocking for this PR. citeturn0view0turn1view0

Executive Summary

The prior re-review items are addressed: anticipation is validated before reference-period arithmetic in diff_diff/spillover.py:L1984-L2029, post-finite_mask counts are recomputed before extraction in diff_diff/spillover.py:L2478-L2558, the summary now renders ring and k separately in diff_diff/results.py:L454-L495, and the reduce-to-aggregate claim is backed by an exact deterministic regression in tests/test_spillover.py:L3362-L3393.
I did not find a new unmitigated P0/P1 in the changed estimator path.
One methodology-validation claim is overstated: the registry says dynamic delta_jk recovery is covered by TestSpilloverDiDEventStudyIdentification, but the added MC only checks tau_k.
One documentation/test claim is also overstated: the new backward-compatibility test checks determinism of the current code path, not parity against a pre-Wave-C baseline.
Static review only; I could not execute the test suite here because the environment is missing project dependencies (numpy import fails).

Methodology

Severity P3-informational — docs/methodology/REGISTRY.md:L2995-L2995, tests/test_spillover.py:L3497-L3528. Impact: the registry currently says Wave C ships per-event-time delta_jk recovery via TestSpilloverDiDEventStudyIdentification, but that test only validates direct-effect tau_k. This overstates the methodology-validation coverage for the new dynamic spillover coefficients. Concrete fix: either add an MC/regression that checks spillover_effects.loc[(ring, k), "coef"] against delta_per_ring_per_event_time(...) on the staggered DGP, or soften the registry text to say only tau_k recovery is currently anchored.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3-informational — TODO.md:L132-L132, docs/methodology/REGISTRY.md:L3021-L3021. Impact: the remaining Gardner GMM first-stage variance correction is explicitly documented and tracked, so it is mitigated under the review rubric. Concrete fix: none required for approval; keep the planned follow-up.

Security

No findings.

Documentation/Tests

Severity P3 — CHANGELOG.md:L12-L12, tests/test_spillover.py:L3472-L3494. Impact: the changelog says event_study=False backward compatibility is “verified” by TestSpilloverDiDEventStudyBackwardCompat, but the added test only fits the current implementation twice on the same fixture. That checks determinism, not pre-Wave-C numerical parity. Concrete fix: compare against a pinned pre-Wave-C golden att/se pair, or against a saved Wave B results fixture on a deterministic DGP.

P3: REGISTRY says Wave C ships per-event-time delta_jk recovery via TestSpilloverDiDEventStudyIdentification, but the existing MC only checks tau_k. Added test_per_ring_event_time_delta_jk_recovery: 50-seed staggered MC with delta_per_ring_per_event_time profile, asserts spillover_effects. loc[(ring, k), "coef"] recovers the per-event-time delta_jk target within 0.025 absolute tolerance. P3: CHANGELOG says event_study=False bit-identical to Wave B "verified by TestSpilloverDiDEventStudyBackwardCompat", but the existing test only fits twice on the current code path (determinism, not pre-Wave-C parity). Added test_event_study_false_matches_wave_b_golden which pins att/se/per-ring golden values captured against the Wave C event_study=False path (which is unchanged from Wave B). Since the aggregate stage-2 design, fit, and extraction logic are untouched in Wave C, these golden values ARE the Wave B numerics — any future drift on this PIN indicates an accidental change to the aggregate path. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-05-16T19:17:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 977e7c96f06aa9d1fa62a4bb0a0938c5bce41731

Overall Assessment

⚠️ Needs changes

The core estimator logic still looks directionally consistent with the cited source material: Butts is explicitly about DiD with spatial spillovers under staggered timing, and Gardner’s two-stage design estimates untreated-outcome structure first, then recovers treatment effects in a second stage. My blocking concern is not a hidden paper-faithfulness break; it is a new horizon_max contract bug in the shipped event-study surface. citeturn4academia0turn4academia1

Executive Summary

The prior methodology-validation gap on dynamic spillovers is addressed: the PR now adds a direct delta_jk recovery test at tests/test_spillover.py:L3586-L3629.
Severity P1: horizon_max=0 is documented as a valid, non-Wave-B event-study design, but the new fit-time validation rejects it outright via the reference-window guard in diff_diff/spillover.py:L2018-L2028.
Severity P2: the new event_study_effects / reference_period surface is not fully propagated to plot_event_study; empty non-reference horizons can break auto reference detection in diff_diff/visualization/_event_study.py:L748-L759.
The Gardner first-stage variance omission remains explicitly documented in docs/methodology/REGISTRY.md:L3021-L3021 and tracked in TODO.md:L132-L132, so that caveat is mitigated rather than blocking.
Static review only; I could not execute the test suite in this environment because project dependencies are missing (numpy import fails).

Methodology

Severity P1 — diff_diff/spillover.py:L2018-L2028, docs/methodology/REGISTRY.md:L3019-L3019, CHANGELOG.md:L12-L12. Impact: the shipped docs say horizon_max=0 is a valid “well-defined but semantically distinct” design, but every event_study=True, horizon_max=0, anticipation=0 fit hits the ref_period = -1 guard and raises before estimation. That is a broken new-parameter interaction on the estimator’s identification surface, not just a wording nit. Concrete fix: either implement an explicit H=0 anchoring rule and add a fit-level regression for it, or align the docs/guides/changelog/tests to state that event_study=True requires horizon_max is None or horizon_max >= 1.

Code Quality

Severity P2 — diff_diff/results.py:L416-L423, diff_diff/spillover.py:L787-L819, diff_diff/visualization/_event_study.py:L748-L759. Impact: SpilloverDiDResults now exposes a real reference_period, but plot_event_study still infers the reference from the first event_study_effects row with n_obs == 0. Because Wave C intentionally emits additional empty non-reference horizons with n_obs = 0, the advertised drop-in plotting compatibility is incomplete: the reference marker can be misplaced or absent on valid fits with oversized horizon_max. Concrete fix: in _extract_plot_data, prefer results.reference_period when present, then fall back to the legacy n_obs == 0 heuristic; add a regression using a SpilloverDiD fit with empty non-reference horizons.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity P3-informational — docs/methodology/REGISTRY.md:L3021-L3021, TODO.md:L132-L132. Impact: the missing Gardner GMM first-stage correction is still a real limitation, but it is clearly documented and tracked, so it is mitigated under the review rubric. Concrete fix: none required for this PR.

Security

No findings.

Documentation/Tests

Severity P3 — CHANGELOG.md:L12-L12, tests/test_spillover.py:L3469-L3525. Impact: the backward-compatibility evidence is stronger than in the last review, but it still is not a true pre-Wave-C anchor. The test docstring says the pinned goldens were captured on the Wave C path itself, so “verified Wave B bit-identity” remains slightly stronger than the actual evidence. Concrete fix: compare event_study=False output against a saved pre-Wave-C fixture or artifact from the last Wave B commit, not just a golden captured after the refactor.

Path to Approval

Resolve the horizon_max=0 contract one way or the other: support it end-to-end with a documented reference rule, or explicitly make it unsupported everywhere and regression-test that rejection.
Update plot_event_study to honor SpilloverDiDResults.reference_period, and add a regression covering rectangular empty horizons so the new alias really is downstream-compatible.

P1: reject horizon_max=0 under event_study=True. The previous docs said H=0 was a "well-defined but semantically distinct" design, but every event_study=True + horizon_max=0 + anticipation=0 fit hit the ref_period guard at -1 and raised. Resolution: lock the contract by rejecting H=0 explicitly with a remediation message ("use event_study=False for the aggregate Wave B spec; event-study mode requires horizon_max>=1 or horizon_max=None"). Updated REGISTRY + CHANGELOG to match. Added test_horizon_max_zero_with_event_study_raises regression. P2: plot_event_study now honors SpilloverDiDResults.reference_period. Wave C's rectangular event_study_effects emits multiple empty horizons (n_obs == 0 on dropped post-direct cells + the reference row); the legacy "first n_obs==0 row" reference detection could pick a non-reference empty horizon as the reference. Fix in _extract_plot_data: prefer results.reference_period when present (truthy attribute), fall back to the legacy n_obs==0 heuristic otherwise. Backward-compatible for estimators without the attribute (CallawaySantAnna, SunAbraham, etc.). Regression test on a Wave C fit with horizon_max=4 (oversized → multiple empty horizons) asserts the inferred reference is -1 not the first empty horizon. P3: soften "Wave B bit-identical" claim. CHANGELOG previously said "reproduces Wave B SEs bit-identically (verified by ...)" implying a pre-Wave-C checkout artifact; the goldens were actually captured on the current (Wave C) event_study=False path. Updated to: "the aggregate stage-2 design construction, fit, and extraction logic on this path are byte-identical to Wave B; the test pins goldens captured on the unchanged aggregate path so any future drift fails the regression." Same softening in the test class docstring. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-05-16T19:30:16Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 59bdc1bc0c4d5f88bd7e61ae9a363330d554d3ee

Overall Assessment

✅ Looks good

The prior blocking concerns are addressed, and I did not find a new unmitigated P0/P1 issue in the changed estimator path. The new event-study surface remains consistent with the repo’s documented Butts/Gardner interpretation, while the remaining non-literal-paper choices are explicitly documented in the methodology registry rather than being silent deviations. (arxiv.org)

Executive Summary

Previous P1 resolved: event_study=True, horizon_max=0 is now rejected explicitly in diff_diff/spillover.py:L1999-L2047, with matching regression coverage in tests/test_spillover.py:L3439-L3452.
Previous P2 resolved: plot_event_study now prefers results.reference_period in diff_diff/visualization/_event_study.py:L749-L769, and the new regression in tests/test_spillover.py:L3991-L4034 covers the rectangular-empty-horizon case.
Previous P3 materially improved: backward-compat evidence is much stronger via pinned DGP hashes in tests/test_dgp_utils.py:L1-L79 and aggregate-path goldens in tests/test_spillover.py:L3486-L3547.
New non-blocking findings remain in secondary-priority areas only: one avoidable performance regression in the new event-study path, and one documentation mismatch around horizon_max=0.
Static review only; I could not execute the test suite in this environment because numpy is not installed.

Methodology

No unmitigated findings. The changed estimator path matches the repo’s documented Butts Section 5 / Gardner two-stage reading, and the remaining deviations/limitations in this PR are explicitly labeled in docs/methodology/REGISTRY.md:L2997-L3021 rather than being silent methodology drift. (arxiv.org)

Code Quality

No findings.

Performance

P2 — diff_diff/spillover.py:L404-L429, diff_diff/spillover.py:L488-L525, diff_diff/spillover.py:L2389-L2401. Impact: event_study=True now computes cohort-by-unit distance fronts twice: once to build d_it, then again to recover trigger_onset/K_spill. On large staggered panels, that duplicates the dominant spatial work and will make the new path materially slower than the aggregate fit. Concrete fix: thread trigger_onset_per_unit out of _compute_nearest_treated_distance_staggered() or otherwise reuse the first cohort loop instead of redoing the dense pairwise pass in _compute_event_time_per_row().

Maintainability

No findings.

Tech Debt

P3-informational (tracked in TODO.md) — TODO.md:L132-L132. Impact: event-study SEs still omit the Gardner first-stage uncertainty correction, but that limitation is already documented and tracked, so it is mitigated under the stated review rules. Concrete fix: none required for this PR.

Security

No findings.

Documentation/Tests

P3 — diff_diff/guides/llms-full.txt:L505-L505, docs/api/spillover.rst:L187-L202, diff_diff/spillover.py:L2015-L2023. Impact: the implementation now rejects event_study=True, horizon_max=0, but the LLM guide still describes horizon_max=0 as a meaningful collapsed design and the API docs describe generic endpoint binning without calling out the rejection. Users following those docs will hit a runtime ValueError. Concrete fix: update both docs to say explicitly that event-study mode requires horizon_max>=1 or None, and direct single-effect use cases to event_study=False.
Otherwise, the regression surface for the prior review items is materially stronger, especially tests/test_spillover.py:L3439-L3452, tests/test_spillover.py:L3917-L3955, tests/test_spillover.py:L3991-L4034, and tests/test_dgp_utils.py:L1-L79.

P2 perf: remove duplicate cohort distance pass. Previously the event- study path computed cohort-by-unit distances twice on staggered panels: once in _compute_nearest_treated_distance_staggered for d_it (running min), then again in _compute_event_time_per_row to recover the per-row spillover-trigger onset. On large staggered panels this doubled the dominant spatial work. Fix: thread d_bar into _compute_nearest_treated_distance_staggered as an optional kwarg. When supplied, the cohort loop now ALSO computes trigger_onset_per_unit (the first cohort whose treated units fall within d_bar of unit i) and broadcasts it to rows. The helper's return is now a 4-tuple (d_it, row_unit, row_time, trigger_onset_or_None). _compute_event_time_per_row accepts an optional precomputed_trigger_onset_per_row that, when supplied (as fit() now does on the staggered event-study path), skips the redundant cohort loop. Falls back to inline computation for unit-test callers. Test callsites for _compute_nearest_treated_distance_staggered updated to handle the new 4-tuple via `d_it, row_unit, row_time, _trigger = ...`. P3 docs: llms-full.txt and api/spillover.rst now explicitly state that event_study=True requires horizon_max>=1 or None (horizon_max=0 is rejected, with redirect to event_study=False for the aggregate spec). The previous wording described horizon_max=0 as a meaningful collapsed design, which contradicted the new R5 rejection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-05-16T19:47:05Z

🔁 AI review rerun (requested by @igerber)

Head SHA: cfb822e97e8e0e7f49fcb5a0f20936d998732bde

Overall Assessment

✅ Looks good

The prior blocking items are addressed, and I did not find a new unmitigated P0/P1 in the changed estimator/math path. The remaining issues I found are secondary P2s outside the core methodology. Static review only; I could not execute the test suite here because the environment is missing numpy.

Executive Summary

Previous P1 resolved: event_study=True, horizon_max=0 is now rejected explicitly in diff_diff/spillover.py:L2053-L2102, with matching regression coverage in tests/test_spillover.py:L3439-L3453.
Previous P2 resolved: plot extraction now prefers results.reference_period over the ambiguous n_obs == 0 heuristic in diff_diff/visualization/_event_study.py:L749-L769, with targeted coverage in tests/test_spillover.py:L3991-L4034.
Methodology cross-check passed: the changed event-study path matches the repo’s documented Butts/Gardner interpretation, and the non-literal choices are explicitly documented in docs/methodology/REGISTRY.md:L2997-L3021 rather than being silent deviations.
P2: the new non-staggered event_study=True path still recomputes dense pairwise distances for spillover-trigger timing instead of reusing the already-computed static distance result.
P2: the PR claims event_study_effects is consumable by DiagnosticReport, but SpilloverDiDResults is still not registered in DiagnosticReport’s dispatch/applicability tables, so that advertised wrapper integration is not actually wired.

Methodology

No unmitigated findings. The changed estimator path is consistent with the Methodology Registry’s documented Butts Section 5 / Gardner interpretation, and the notable deviations are explicitly labeled Notes in docs/methodology/REGISTRY.md:L2997-L3021.

Code Quality

No findings.

Performance

Severity: P2. Impact: In the non-staggered event_study=True path, fit() already computes static nearest-treated distances with the cutoff-aware helper, but _compute_event_time_per_row() still falls back to a second dense pairwise pass because only the staggered branch populates precomputed_trigger_onset_per_row. On large non-staggered panels this defeats the sparse cutoff optimization and adds avoidable time/memory cost. Locations: diff_diff/spillover.py:L2242-L2274, diff_diff/spillover.py:L2452-L2468, diff_diff/spillover.py:L543-L581. Concrete fix: on the non-staggered branch, derive trigger_onset_per_row_cached directly from the existing static-distance result plus the shared onset, and thread it into precomputed_trigger_onset_per_row so the dense fallback is never needed.

Maintainability

No findings.

Tech Debt

Severity: P3-informational (tracked in TODO.md). Impact: event-study SEs still omit the Gardner GMM first-stage uncertainty correction, but that limitation is explicitly documented in docs/methodology/REGISTRY.md:L3021-L3034 and tracked in TODO.md:L132-L132, so it is mitigated under the stated review rules. Concrete fix: none required for this PR.

Security

No findings.

Documentation/Tests

Severity: P2. Impact: The PR advertises event_study_effects as consumable by DiagnosticReport, but SpilloverDiDResults is still absent from DiagnosticReport’s _APPLICABILITY and _PT_METHOD tables, so the normal DiagnosticReport(results) entry point will not actually route this new event-study surface into report diagnostics. The changed docs/comments therefore overstate the shipped integration. Locations: CHANGELOG.md:L12-L12, diff_diff/guides/llms-full.txt:L505-L505, diff_diff/results.py:L416-L420, diff_diff/diagnostic_report.py:L93-L140, diff_diff/diagnostic_report.py:L198-L212, diff_diff/diagnostic_report.py:L532-L569. Concrete fix: either add SpilloverDiDResults to DiagnosticReport’s dispatch/applicability tables and add an end-to-end regression test, or remove the new compatibility claims until that routing exists.

P2 perf: extend the trigger_onset_per_row precomputation to the non- staggered event-study branch. Previously only the staggered fit() call populated `trigger_onset_per_row_cached`; the non-staggered branch left it as None, so `_compute_event_time_per_row` fell back to its dense cohort loop AGAIN, defeating the sparse cutoff in `_compute_nearest_treated_distance_static` and adding avoidable cost on large non-staggered panels. Fix: on the non-staggered branch, derive trigger_onset_per_row directly from `unit_to_d` (already built from the static distance result) and the single shared effective onset. In the non-staggered case the trigger collapses to a constant: any unit within `d_bar` triggers at `shared_effective_onset`; far-away units have NaN trigger. No second pairwise pass needed. P2 docs: remove the unwired `diagnostic_report.event_study_diagnostics` consumability claim from CHANGELOG, llms-full.txt, results.py docstring, REGISTRY.md, and api/spillover.rst. SpilloverDiDResults is NOT registered in DiagnosticReport's `_APPLICABILITY` / `_PT_METHOD` tables, so `DiagnosticReport(spillover_result)` does not route to event-study diagnostics. `plot_event_study` integration IS wired and keeps its claim. Added a TODO row tracking the deferred DiagnosticReport routing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

github-actions · 2026-05-16T19:59:19Z

🔁 AI review rerun (requested by @igerber)

Head SHA: 75ff0e2aef654904024006a4036c0b332b8fa647

Overall Assessment

✅ Looks good

Static re-review only. I did not find a new unmitigated P0/P1 in the changed PR scope. I could not execute the test suite here because the environment is missing numpy.

Executive Summary

Previous P1 resolved: event_study=True, horizon_max=0 is now rejected explicitly in diff_diff/spillover.py:L2055-L2103, with regression coverage in tests/test_spillover.py:L3439-L3453.
Previous P2 resolved: plot extraction now prefers results.reference_period over the ambiguous n_obs == 0 heuristic in diff_diff/visualization/_event_study.py:L749-L769, with targeted coverage in tests/test_spillover.py:L3991-L4034.
Previous performance issue resolved: the event-study path now reuses cached spillover-trigger onsets on both the staggered and non-staggered branches instead of forcing a second dense distance pass in diff_diff/spillover.py:L2226-L2240 and diff_diff/spillover.py:L2275-L2293.
Methodology cross-check passed: the changed event-study path matches the repo’s documented Butts/Gardner interpretation, and the non-literal choices are explicitly recorded in docs/methodology/REGISTRY.md:L2997-L3021.
Remaining limitations are mitigated, not blockers: the missing Gardner GMM variance correction is still tracked in TODO.md:L132-L132, and DiagnosticReport integration is now explicitly documented as deferred in TODO.md:L139-L139 and diff_diff/results.py:L417-L422.

Methodology

Severity: P3-informational (tracked/documented). Impact: event-study SEs still omit the Gardner first-stage GMM correction, but that limitation is explicitly documented in docs/methodology/REGISTRY.md:L3021-L3021 and tracked in TODO.md:L132-L132, so it is mitigated rather than a blocker. Concrete fix: none required for this PR; keep the planned Wave D follow-up. Gardner’s two-stage writeup explicitly notes that plain second-stage SEs are not valid without adjusting for first-stage estimation, and Butts’ spillover extension likewise treats valid inference as a separate step. (journal.r-project.org)
No other findings. Butts’ staggered spillover extension uses stage 1 on untreated-and-unexposed observations and a residualized stage 2 on treatment-duration / spillover dummies; Gardner’s two-stage formulation uses the same untreated/not-yet-treated logic for identifying fixed effects before the second stage. The changed implementation follows that contract in diff_diff/spillover.py:L2306-L2655, and the repo’s documented deviations (two clocks, endpoint binning, shifted reference period, share-weighted scalar att) are disclosed in docs/methodology/REGISTRY.md:L2997-L3021 rather than left silent. (kylebutts.com)

Code Quality

No findings.

Performance

No findings. The prior duplicate-work issue on the new event-study path is addressed by the cached trigger-onset reuse in diff_diff/spillover.py:L2226-L2240 and diff_diff/spillover.py:L2275-L2293.

Maintainability

No findings.

Tech Debt

Severity: P3-informational (tracked in TODO.md). Impact: DiagnosticReport(spillover_result) still is not routed through _APPLICABILITY / _PT_METHOD, but the PR now documents that this integration is not shipped and tracks it as follow-up work in TODO.md:L139-L139; the overclaim from the previous review is therefore mitigated. Concrete fix: register SpilloverDiDResults in diff_diff/diagnostic_report.py and add an end-to-end regression test.

Security

No findings.

Documentation/Tests

No findings. The added tests materially improve coverage for the previously reported edge cases: horizon_max=0 rejection at tests/test_spillover.py:L3439-L3453, post-finite_mask count/weight propagation at tests/test_spillover.py:L3795-L3914, fail-closed aggregate att behavior at tests/test_spillover.py:L3917-L3955, and explicit plot reference-period handling at tests/test_spillover.py:L3991-L4034.

CI Pure Python Fallback (Linux py3.14) drifted 1 ULP from the macOS Accelerate capture machine on `test_event_study_false_matches_wave_b_golden` -- expected -0.08620379515400438, got -0.08620379515400439. The 6 `==` checks against _WAVE_B_GOLDEN_* are cross-machine pins, exactly the BLAS reduction-order class that `feedback_assert_allclose_numerical_parity` warns about. Switched all 6 golden assertions to `np.testing.assert_allclose(rtol=1e-14, atol=1e-14)` -- tight enough to catch real aggregate-path drift, loose enough to absorb cross-runner ULP differences. The same-machine determinism check `test_event_study_false_bit_identical_to_wave_b_fixture` keeps `==` (both fits run on the same runner). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-16T21:24:22Z

🔁 AI review rerun (requested by @igerber)

Head SHA: eb35ccf6fee483a34f260637cf399e6adb442b0d

Overall Assessment

Looks good.

Static re-review only. I did not find a new unmitigated P0/P1 in the changed PR scope. I could not execute the test suite here because the environment is missing numpy, pandas, scipy, and pytest.

Executive Summary

Previous P1 resolved: event_study=True, horizon_max=0 is now rejected explicitly in diff_diff/spillover.py:L2055-L2103, with regression coverage in tests/test_spillover.py:L3439-L3453.
Previous plotting/reference issue resolved: _extract_plot_data() now prefers an explicit results.reference_period over the ambiguous n_obs == 0 heuristic in diff_diff/visualization/_event_study.py:L749-L769, with targeted coverage in tests/test_spillover.py:L4017-L4059.
Previous post-mask and fail-closed inference issues appear resolved: event-study n_obs is recomputed on the post-finite_mask sample in diff_diff/spillover.py:L2582-L2591, and aggregate att now fails closed when a post-treatment direct coefficient is dropped in diff_diff/spillover.py:L965-L1009, with regression coverage in tests/test_spillover.py:L3821-L3980.
Methodology cross-check passed: the Wave C estimator behavior matches the repo’s documented SpilloverDiD registry notes in docs/methodology/REGISTRY.md:L2997-L3021.
One minor P3 remains: CHANGELOG.md and diff_diff/guides/llms-full.txt still describe all negative-k spillover cells as NaN, but (ring, ref_period) is now intentionally zero-anchored.

Methodology

Severity: P3-informational (documented deviation). Impact: The main Wave C choices that differ from a literal single-clock/Table 2 reading, namely two clocks (K_direct, K_spill), endpoint binning instead of filtering, shifted reference period, rectangular emission, and share-weighted scalar att, are explicitly documented in docs/methodology/REGISTRY.md:L2997-L3019 and implemented consistently in diff_diff/spillover.py:L781-L1028 and diff_diff/spillover.py:L2035-L2103. Concrete fix: none required for this PR.
Severity: P3-informational (tracked in TODO.md). Impact: Event-study SEs still omit the Gardner GMM first-stage uncertainty correction, so inference remains approximate. This is documented in docs/methodology/REGISTRY.md:L3021-L3021 and tracked in TODO.md:L132-L132. Concrete fix: implement the planned spillover-stage GMM variance correction via two_stage.py::_compute_gmm_variance.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3-informational (tracked in TODO.md). Impact: SpilloverDiDResults is still not registered in DiagnosticReport dispatch tables, so the new event_study_effects alias is wired for plotting but not report diagnostics. This is documented in diff_diff/results.py:L417-L422 and tracked in TODO.md:L139-L139. Concrete fix: add SpilloverDiDResults to diff_diff/diagnostic_report.py::_APPLICABILITY and _PT_METHOD, then add an end-to-end regression test.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: CHANGELOG.md:L12-L12 and diff_diff/guides/llms-full.txt:L505-L505 still say negative-k spillover cells are emitted with coef = NaN, but the implementation intentionally emits (ring, ref_period) rows with coef = 0.0, se = 0.0, and n_obs = 0 in diff_diff/spillover.py:L878-L931; that behavior is also what the registry and tests now assert in docs/methodology/REGISTRY.md:L3017-L3017 and tests/test_spillover.py:L3991-L4014. Concrete fix: update the CHANGELOG and guide wording to carve out the reference-period spillover rows explicitly.

…eased] CHANGELOG conflict (PR #457 BaconDecomposition R parity goldens) # Conflicts: # CHANGELOG.md

github-actions · 2026-05-16T23:17:14Z

🔁 AI review rerun (requested by @igerber)

Head SHA: d7043f033e835a44fd0e630d293151ce8fc6a912

Overall Assessment

⚠️ Needs changes — previous P1s from the last review look fixed, but I found one new unmitigated P1 in the changed SpilloverDiD(event_study=True) aggregation path.

Executive Summary

Previous review items appear resolved: horizon_max=0 is now rejected in diff_diff/spillover.py:L2062-L2079, post-finite_mask counts are recomputed in diff_diff/spillover.py:L2582-L2591, and plotting now prefers results.reference_period in diff_diff/visualization/_event_study.py:L749-L769.
A new P1 remains: zero-weight post-treatment horizons that disappear after finite_mask still force the share-weighted scalar att to NaN in diff_diff/spillover.py:L976-L999.
Methodology cross-check otherwise passed. The two-clock K_it, endpoint binning, shifted reference period, rectangular emission, and omitted Gardner GMM correction are all explicitly documented in docs/methodology/REGISTRY.md:L3009-L3033.
The prior P3 documentation mismatch is only partially fixed: the registry is correct, but CHANGELOG.md and diff_diff/guides/llms-full.txt still overstate that all negative-k spillover cells are NaN.
SpilloverDiDResults is still not wired into DiagnosticReport, but that is now properly tracked in TODO.md:L137-L137 and is not a blocker.
Static review only; I could not run pytest here because pytest is not installed.

Methodology

Severity: P1. Impact: the new sample-share aggregate att can become spuriously NaN on valid warn-and-drop fits. n_obs_per_col is recomputed on the post-finite_mask sample in diff_diff/spillover.py:L2582-L2591, but _extract_event_study_results() still fails closed on any non-finite post-treatment direct coefficient in diff_diff/spillover.py:L976-L999, even when that column’s recomputed weight is zero because all of its rows were dropped. That contradicts the documented post-mask weighting rule in docs/methodology/REGISTRY.md:L3019-L3027: zero-weight horizons should drop out of the aggregate rather than make it unidentified. Concrete fix: filter the scalar-att path to post-treatment direct columns with n_obs_per_col > 0 before both the NaN check and the w'Vw aggregation, then add a regression covering a warn-and-drop panel with horizon_max=2 or None where an unsupported unit creates a post-mask-empty D^k column.
No other methodology defects found in the changed estimator path. The implementation choices that differ from a literal Butts/Table 2 reading are documented in docs/methodology/REGISTRY.md:L3013-L3033, so I am treating those as non-defects.

Code Quality

No findings.

Performance

No findings.

Maintainability

No findings.

Tech Debt

Severity: P3-informational (tracked in TODO.md). Impact: DiagnosticReport(spillover_result) still will not route event-study diagnostics because SpilloverDiDResults is absent from _APPLICABILITY / _PT_METHOD in diff_diff/diagnostic_report.py:L93-L213. Concrete fix: none required for approval; this is already tracked in TODO.md:L137-L137.

Security

No findings.

Documentation/Tests

Severity: P3. Impact: the previous doc mismatch is not fully resolved. CHANGELOG.md:L14-L14 and diff_diff/guides/llms-full.txt:L505-L505 still say negative-k spillover cells are emitted as NaN, but the implementation and registry now carve out zero-anchored (ring, ref_period) rows in diff_diff/spillover.py:L878-L927 and docs/methodology/REGISTRY.md:L3029-L3029. Concrete fix: update both docs to say negative-k spillover cells are NaN except the (ring, ref_period) normalization rows.
Execution note: static review only; python -m pytest could not run because pytest is unavailable in this environment.

Path to Approval

In diff_diff/spillover.py:L976-L999, change the scalar att aggregation to ignore post-treatment direct columns with post-mask n_obs=0 when checking for NaNs and when building the weighted coefficient/vcov subset.
Add a regression test using the existing warn-and-drop pattern (unsupported baseline-treated units) with event_study=True and horizon_max=2 or None, asserting that res.att stays finite and matches the hand-computed share-weighted average over only the positive-n_obs post-treatment horizons.

igerber and others added 2 commits May 16, 2026 13:22

igerber force-pushed the spillover-conley-wave-c-event-study branch from bdfa2b8 to dbe0b78 Compare May 16, 2026 17:30

igerber added the ready-for-ci Triggers CI test workflows label May 16, 2026

Merge main into spillover-conley-wave-c-event-study to resolve [Unrel…

d7043f0

…eased] CHANGELOG conflict (PR #457 BaconDecomposition R parity goldens) # Conflicts: # CHANGELOG.md

igerber merged commit 864c82a into main May 17, 2026
26 checks passed

igerber deleted the spillover-conley-wave-c-event-study branch May 17, 2026 00:16

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SpilloverDiD: event_study=True per-event-time × ring decomposition (Wave C)#456

SpilloverDiD: event_study=True per-event-time × ring decomposition (Wave C)#456
igerber merged 10 commits into
mainfrom
spillover-conley-wave-c-event-study

igerber commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

igerber commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

igerber commented May 16, 2026

Summary

Methodology references (required if estimator / math changes)

Validation

Security / privacy

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

igerber commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Overall Assessment

Executive Summary

Methodology

Code Quality

Performance

Maintainability

Tech Debt

Security

Documentation/Tests

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

github-actions Bot commented May 16, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant