SpilloverDiD: event_study=True per-event-time × ring decomposition (Wave C)#456
Conversation
|
Overall Assessment The documented methodology choices in Executive Summary
Methodology
Code Quality No material findings beyond the aggregation issues above. Performance No material findings in review scope. Maintainability
Tech Debt No separate deferrable finding. The two blocking issues above are correctness problems and should be fixed in code, not deferred to Security No findings. Documentation/Tests
Path to Approval
|
…ave C)
Replaces the Wave B NotImplementedError gate at spillover.py:1430-1442 with
the full per-event-time × ring decomposition from Butts (2021) Section 5 /
Table 2. Emits per-event-time direct effects tau_k and per-(ring, event-time)
spillover effects delta_jk as att_dynamic: pd.DataFrame (indexed by k) and
MultiIndex spillover_effects (levels (ring_label, event_time)). A TwoStageDiD-
compatible event_study_effects: Dict[int, Dict] alias (mirroring
two_stage.py:1355-1389 schema with conf_int = (low, high) tuple) is also
emitted for consumption by plot_event_study and diagnostic_report.
Methodology: the implementation operationalizes Butts' single K_it symbol as
TWO event-time clocks — K_direct = t - effective_first_treat(i) for ever-
treated rows, and K_spill = t - earliest-in-range-cohort-onset(i) for spillover
rows (running min across activated cohorts; NaN for pre-trigger and far-away
rows). K_spill >= 0 structurally; negative-k spillover cells emit rectangularly
with coef = NaN, n_obs = 0.
Reference period: ref_period = -1 - anticipation (TwoStageDiD parity at
two_stage.py:486). When horizon_max is set, ref_period must fall inside
[-horizon_max, +horizon_max] or fit raises ValueError — silent floor-shift to
-horizon_max would change identification (rejected per feedback_no_silent_failures).
Reference row uses coef = 0.0, se = 0.0, n_obs = 0, conf_int = (0.0, 0.0).
horizon_max semantics (divergence from TwoStageDiD): bins event-times outside
[-H, +H] into endpoint pools, no observations dropped. TwoStageDiD filters
those rows. Divergence intentional + cross-documented. horizon_max=None auto-
detects the bin set from observed K values.
Scalar att aggregation: sample-share-weighted average of post-treatment tau_k
(att = sum_{k>=0} w_k * tau_k with w_k = n_treated_at_k / total). SE from
linear-combination inference Var(att) = w' V_subset w on the post-treatment
block of the stage-2 vcov — no separate fit.
Reduce-to-aggregate equivalence: under constant-tau DGP with horizon_max=None,
the lincom-weighted scalar att reproduces Wave B's aggregate tau_total bit-
identically. Note: horizon_max=0 does NOT reduce to Wave B (binning collapses
pre-treatment K values into k=0, making D^0 = D_i ever-treated indicator
rather than D_it).
Backward compatibility: event_study=False leaves all Wave C fields
(att_dynamic, event_study_effects, horizon_max, reference_period) as None
and reproduces Wave B SEs bit-identically.
Variance caveat: per-event-time SEs use solve_ols's standard variance
(HC1 / Conley / cluster) WITHOUT the Gardner GMM first-stage uncertainty
correction; planned Wave D follow-up closes this.
Tests: 30 new event-study test methods covering API, two-clock K helper,
horizon binning, design builder, reference period, reduce-to-aggregate,
identification MC (50 seeds, per-event-time tau_k recovery within 0.025),
placebo pre-trends (Type I rate <= 0.30 over 50 seeds at alpha=0.10),
singularity (rectangular schema), Conley integration (vcov shape +
np.diag >= 0), summary/to_dict/pickle round-trip, event_study_effects
schema parity with TwoStageDiD, lincom-att hand-computed, validation
(horizon_max < 0, ref_period < -horizon_max), and fit idempotence.
DGP factory generate_butts_staggered_dgp extended with tau_per_event_time
and delta_per_ring_per_event_time callable kwargs (backward-compatible —
both default to None, producing the Wave B scalar DGP bit-identically;
verified by tests/test_dgp_utils.py with pinned SHA-256 baselines).
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
P1: recompute n_obs_per_col on POST-finite_mask sample. Previously _build_event_study_design counted rows on the pre-mask design; those stale counts then populated att_dynamic / event_study_effects AND weighted the scalar att. On warn-and-drop fits, reported metadata disagreed with the actual stage-2 sample AND the point estimate could change. Fix: in fit(), recompute event_study_meta["n_obs_per_col"] from X_2_fit after applying finite_mask. P1: fail-closed scalar att when post-direct coef is NaN. Previously np.nansum on a fixed weight vector silently zeroed dropped (rank-deficient) post-treatment direct coefficients, changing the ATT point estimate. Fix: detect any non-finite post-direct coef and return att=NaN with a UserWarning. Library convention is no-silent-failures (feedback_no_silent_failures); inspect att_dynamic for the per-event-time coefficients and re-aggregate manually if needed. P3: emit (ring, ref_period) spillover_effects rows. The pre-filter dropped ref_period from k_grid entirely, so the rectangular emission missed the (ring, ref_period) cells per ring. Consumers iterating [-H,...,+H] would KeyError on the reference slice. Fix: emit (ring, ref_period) with coef=0.0, se=0.0, n_obs=0 to mirror the direct-effect reference row. Tests added: - TestSpilloverDiDEventStudyFiniteMaskPath: warn-and-drop panel (baseline- treated units with no Omega_0 rows) with event_study=True. Verifies att_dynamic["n_obs"], event_study_effects[k]["n_obs"], AND scalar att share weights all reflect the post-mask sample. Hand-computed lincom reproduces res.att at machine precision. - TestSpilloverDiDEventStudyRankDeficientFailClosed: monkey-patches solve_ols to NaN out one post-direct coef. Asserts att=NaN and the documented warning fires. - TestSpilloverDiDEventStudyReferencePeriodSpilloverRows: asserts every (ring, ref_period) cell exists in spillover_effects with the 0-anchored schema. Strengthened test_per_event_time_tau_recovers_exact_y_with_zero_noise to verify the closed-form y = mu_i + lambda_t + tau_fn(k) holds at atol=1e-12 per row, instead of merely checking that k=0 rows exist. CHANGELOG + REGISTRY updated to document the post-finite_mask contract, fail-closed scalar att invariant, and reference-period spillover rows. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
bdfa2b8 to
dbe0b78
Compare
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment This re-review clears the prior blockers. The two earlier P1s are fixed: event-study counts/weights now recompute after Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P2: validate anticipation BEFORE the ref_period_check compatibility test.
Previously, when event_study=True and horizon_max is set, the code computed
`ref_period_check = -1 - self.anticipation` BEFORE validating anticipation,
so non-numeric values (None, "1") raised a raw TypeError instead of the
targeted ValueError ("anticipation must be a non-negative integer"). Fix:
move the anticipation type/value validator above both the horizon_max
validator and the ref_period check.
P3: split (ring, k) into separate columns in summary() rendering. The prior
combined `f"{ring} k={k}"` label was truncated to 15 chars, making distinct
horizons within the same ring (e.g. "[50, 200] k=+0" vs "[50, 200] k=+1")
visually indistinguishable. New layout for MultiIndex spillover_effects:
separate `Ring` (15 chars) + `k` (5 chars) columns. Header width updated
to fit the new column. Non-MultiIndex aggregate path unchanged.
P3: tighten TestSpilloverDiDEventStudyReduceToAggregate to match the
CHANGELOG's bit-identical claim. Previously the test used the default
noisy DGP with abs(diff) < 1e-3, but the changelog says "verified bit-
identical at machine precision". Fix: deterministic DGP via `error_sd=0.0`
+ atol=1e-10.
Regression tests:
- test_non_numeric_anticipation_raises_targeted_value_error: anticipation="1"
must raise ValueError, not TypeError.
- test_none_anticipation_raises_targeted_value_error: anticipation=None must
raise the targeted ValueError too.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good No unmitigated P0/P1 findings in the changed PR surface. The prior review items are addressed: Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P3 docstring polish: `SpilloverDiD.__init__`'s `horizon_max` parameter description previously said "Mirrors TwoStageDiD", which contradicts the shipped Wave C behavior. SpilloverDiD bins event-times outside `[-H, +H]` into endpoint pools (no row drop); TwoStageDiD filters those rows. Updated the docstring to spell out the binning semantic + cross-document the intentional divergence + reference the ValueError when ref_period falls outside the window. Matches the REGISTRY + api/spillover.rst + llms-full.txt narrative. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
/ai-review |
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good The changed estimator surface remains directionally consistent with the cited source material: Butts explicitly studies DiD with spatial spillovers, including staggered timing, and Gardner’s two-stage estimator fits untreated observations first with a second-stage correction for first-stage estimation. The code still documents the missing Gardner correction in docs/methodology/REGISTRY.md:L3021-L3021 and TODO.md:L132-L132, so that caveat is mitigated rather than blocking for this PR. citeturn0view0turn1view0 Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P3: REGISTRY says Wave C ships per-event-time delta_jk recovery via TestSpilloverDiDEventStudyIdentification, but the existing MC only checks tau_k. Added test_per_ring_event_time_delta_jk_recovery: 50-seed staggered MC with delta_per_ring_per_event_time profile, asserts spillover_effects. loc[(ring, k), "coef"] recovers the per-event-time delta_jk target within 0.025 absolute tolerance. P3: CHANGELOG says event_study=False bit-identical to Wave B "verified by TestSpilloverDiDEventStudyBackwardCompat", but the existing test only fits twice on the current code path (determinism, not pre-Wave-C parity). Added test_event_study_false_matches_wave_b_golden which pins att/se/per-ring golden values captured against the Wave C event_study=False path (which is unchanged from Wave B). Since the aggregate stage-2 design, fit, and extraction logic are untouched in Wave C, these golden values ARE the Wave B numerics — any future drift on this PIN indicates an accidental change to the aggregate path. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment The core estimator logic still looks directionally consistent with the cited source material: Butts is explicitly about DiD with spatial spillovers under staggered timing, and Gardner’s two-stage design estimates untreated-outcome structure first, then recovers treatment effects in a second stage. My blocking concern is not a hidden paper-faithfulness break; it is a new Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
P1: reject horizon_max=0 under event_study=True. The previous docs said
H=0 was a "well-defined but semantically distinct" design, but every
event_study=True + horizon_max=0 + anticipation=0 fit hit the ref_period
guard at -1 and raised. Resolution: lock the contract by rejecting H=0
explicitly with a remediation message ("use event_study=False for the
aggregate Wave B spec; event-study mode requires horizon_max>=1 or
horizon_max=None"). Updated REGISTRY + CHANGELOG to match. Added
test_horizon_max_zero_with_event_study_raises regression.
P2: plot_event_study now honors SpilloverDiDResults.reference_period.
Wave C's rectangular event_study_effects emits multiple empty horizons
(n_obs == 0 on dropped post-direct cells + the reference row); the legacy
"first n_obs==0 row" reference detection could pick a non-reference
empty horizon as the reference. Fix in _extract_plot_data: prefer
results.reference_period when present (truthy attribute), fall back to
the legacy n_obs==0 heuristic otherwise. Backward-compatible for
estimators without the attribute (CallawaySantAnna, SunAbraham, etc.).
Regression test on a Wave C fit with horizon_max=4 (oversized → multiple
empty horizons) asserts the inferred reference is -1 not the first empty
horizon.
P3: soften "Wave B bit-identical" claim. CHANGELOG previously said
"reproduces Wave B SEs bit-identically (verified by ...)" implying a
pre-Wave-C checkout artifact; the goldens were actually captured on
the current (Wave C) event_study=False path. Updated to: "the aggregate
stage-2 design construction, fit, and extraction logic on this path
are byte-identical to Wave B; the test pins goldens captured on the
unchanged aggregate path so any future drift fails the regression."
Same softening in the test class docstring.
Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good The prior blocking concerns are addressed, and I did not find a new unmitigated P0/P1 issue in the changed estimator path. The new event-study surface remains consistent with the repo’s documented Butts/Gardner interpretation, while the remaining non-literal-paper choices are explicitly documented in the methodology registry rather than being silent deviations. (arxiv.org) Executive Summary
Methodology No unmitigated findings. The changed estimator path matches the repo’s documented Butts Section 5 / Gardner two-stage reading, and the remaining deviations/limitations in this PR are explicitly labeled in docs/methodology/REGISTRY.md:L2997-L3021 rather than being silent methodology drift. (arxiv.org) Code Quality No findings. Performance
Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
|
P2 perf: remove duplicate cohort distance pass. Previously the event- study path computed cohort-by-unit distances twice on staggered panels: once in _compute_nearest_treated_distance_staggered for d_it (running min), then again in _compute_event_time_per_row to recover the per-row spillover-trigger onset. On large staggered panels this doubled the dominant spatial work. Fix: thread d_bar into _compute_nearest_treated_distance_staggered as an optional kwarg. When supplied, the cohort loop now ALSO computes trigger_onset_per_unit (the first cohort whose treated units fall within d_bar of unit i) and broadcasts it to rows. The helper's return is now a 4-tuple (d_it, row_unit, row_time, trigger_onset_or_None). _compute_event_time_per_row accepts an optional precomputed_trigger_onset_per_row that, when supplied (as fit() now does on the staggered event-study path), skips the redundant cohort loop. Falls back to inline computation for unit-test callers. Test callsites for _compute_nearest_treated_distance_staggered updated to handle the new 4-tuple via `d_it, row_unit, row_time, _trigger = ...`. P3 docs: llms-full.txt and api/spillover.rst now explicitly state that event_study=True requires horizon_max>=1 or None (horizon_max=0 is rejected, with redirect to event_study=False for the aggregate spec). The previous wording described horizon_max=0 as a meaningful collapsed design, which contradicted the new R5 rejection. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good The prior blocking items are addressed, and I did not find a new unmitigated P0/P1 in the changed estimator/math path. The remaining issues I found are secondary P2s outside the core methodology. Static review only; I could not execute the test suite here because the environment is missing Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
P2 perf: extend the trigger_onset_per_row precomputation to the non- staggered event-study branch. Previously only the staggered fit() call populated `trigger_onset_per_row_cached`; the non-staggered branch left it as None, so `_compute_event_time_per_row` fell back to its dense cohort loop AGAIN, defeating the sparse cutoff in `_compute_nearest_treated_distance_static` and adding avoidable cost on large non-staggered panels. Fix: on the non-staggered branch, derive trigger_onset_per_row directly from `unit_to_d` (already built from the static distance result) and the single shared effective onset. In the non-staggered case the trigger collapses to a constant: any unit within `d_bar` triggers at `shared_effective_onset`; far-away units have NaN trigger. No second pairwise pass needed. P2 docs: remove the unwired `diagnostic_report.event_study_diagnostics` consumability claim from CHANGELOG, llms-full.txt, results.py docstring, REGISTRY.md, and api/spillover.rst. SpilloverDiDResults is NOT registered in DiagnosticReport's `_APPLICABILITY` / `_PT_METHOD` tables, so `DiagnosticReport(spillover_result)` does not route to event-study diagnostics. `plot_event_study` integration IS wired and keeps its claim. Added a TODO row tracking the deferred DiagnosticReport routing. Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment✅ Looks good Static re-review only. I did not find a new unmitigated P0/P1 in the changed PR scope. I could not execute the test suite here because the environment is missing Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
CI Pure Python Fallback (Linux py3.14) drifted 1 ULP from the macOS Accelerate capture machine on `test_event_study_false_matches_wave_b_golden` -- expected -0.08620379515400438, got -0.08620379515400439. The 6 `==` checks against _WAVE_B_GOLDEN_* are cross-machine pins, exactly the BLAS reduction-order class that `feedback_assert_allclose_numerical_parity` warns about. Switched all 6 golden assertions to `np.testing.assert_allclose(rtol=1e-14, atol=1e-14)` -- tight enough to catch real aggregate-path drift, loose enough to absorb cross-runner ULP differences. The same-machine determinism check `test_event_study_false_bit_identical_to_wave_b_fixture` keeps `==` (both fits run on the same runner). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Looks good. Static re-review only. I did not find a new unmitigated P0/P1 in the changed PR scope. I could not execute the test suite here because the environment is missing Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
…eased] CHANGELOG conflict (PR #457 BaconDecomposition R parity goldens) # Conflicts: # CHANGELOG.md
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment
Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
Summary
NotImplementedErrorgate atspillover.py:1430-1442with the full per-event-time × ring decomposition from Butts (2021) Section 5 / Table 2.tau_kand per-(ring, event-time) spillover effectsdelta_jkasatt_dynamic: pd.DataFrame(indexed by event-timek) plus MultiIndexspillover_effects(levels(ring_label, event_time)).event_study_effects: Dict[int, Dict]alias forplot_event_study/diagnostic_report.event_study_diagnosticsconsumption.attaggregation: sample-share-weighted average of post-treatmenttau_kwith SE from linear-combination inference on the post-treatment block of the stage-2 vcov.generate_butts_staggered_dgpextended withtau_per_event_time/delta_per_ring_per_event_timecallable kwargs (backward-compatible; bit-identical to Wave B when both default toNone, verified by pinned SHA-256 baselines).event_study=Falseleaves all Wave C fields asNoneand reproduces Wave B SEs bit-identically.Methodology references (required if estimator / math changes)
[-H, +H]into endpoint pools; TwoStageDiD filters those rows. Honorsfeedback_no_silent_failures; cross-documented in both estimators' docstrings.-1 - anticipation— mirrors TwoStageDiD attwo_stage.py:486. Reference row usescoef = 0.0, se = 0.0, n_obs = 0, conf_int = (0.0, 0.0)for TwoStageDiD parity.att—att = sum_{k>=0} w_k * tau_kwithw_k = n_treated_at_k / total. SE from linear-combination inferenceVar(att) = w' V_subset w(no separate fit). CallawaySantAnnaaggregate_method=\"simple\"and TwoStageDiD's analogous aggregate path use the same share-weighting convention.ref_period < -horizon_maxraisesValueError; silent floor-shift would change identification.solve_ols's standard variance WITHOUT the Gardner GMM first-stage uncertainty correction (Wave D follow-up will close this).Validation
tests/test_spillover.py(+817 lines, 30 new test methods): event-study API, two-clock K helper, horizon binning, design builder, reference period, reduce-to-aggregate, identification MC (50 seeds, per-event-time tau_k recovery within 0.025), placebo pre-trends (Type I rate ≤ 0.30 over 50 seeds at alpha=0.10), singularity (rectangular schema), Conley integration (vcov shape + non-negative diagonal), summary / to_dict / pickle round-trip, event_study_effects schema parity, lincom-att hand-computed, validation (horizon_max < 0, ref_period < -horizon_max), fit idempotence.tests/test_dgp_utils.py(new, 148 lines): pinned SHA-256 bit-identity baselines forgenerate_butts_staggered_dgp/generate_butts_nonstaggered_dgp; callable-kwarg unit tests.Security / privacy
Generated with Claude Code