diff --git a/CHANGELOG.md b/CHANGELOG.md index a2841373..252cb4b5 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -8,6 +8,7 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 ## [Unreleased] ### Added +- **StackedDiD `vcov_type` parameter (Phase 1b PR 2/8).** `StackedDiD(vcov_type=...)` now accepts `{"hc1","hc2_bm"}` (defaults to `"hc1"`, preserves prior behavior at machine precision — WLS-CR1 sandwich is algebraically invariant between the prior bake-Q-into-X pattern and the new `solve_ols(weights=composed_weights, vcov_type=...)` path, differing only by float64 multiplication ordering at ~2 ULPs at SE scale; pinned by `test_hc1_se_bit_equal_to_pre_pr_baseline` at `atol=1e-13`). `classical` and `hc2` rejected at `__init__` with cluster-incompatibility `ValueError` — StackedDiD clusters intrinsically at `'unit'` or `'unit_subexp'` (no `cluster=None` opt-out), so one-way families are not composable with the linalg validator's cluster_ids check. `hc2_bm` routes CR2 Bell-McCaffrey through the clubSandwich WLS-CR2 port (PR #475) — matches `clubSandwich::vcovCR(lm(weights=Q,...), cluster=~unit|unit_subexp, type="CR2") + coef_test()$df_Satt` at `atol=1e-10` on the new `benchmarks/data/stacked_did_golden.json` fixture (6 R-parity tests in `tests/test_methodology_stacked_did.py`). HC1 + cluster matches `clubSandwich::vcovCR(..., type="CR1S")` (Stata-style `G/(G-1) * (n-1)/(n-p)` correction; plain `type="CR1"` omits the `(n-1)/(n-p)` term and would diverge by ~1.4% on the test fixture). **Bell-McCaffrey Satterthwaite DOF is threaded into the user-facing aggregated inference for hc2_bm**: per-event-time `event_study_effects[h]['p_value']/['conf_int']` use the per-coefficient contrast DOF from `_compute_cr2_bm_contrast_dof`; `overall_p_value`/`overall_conf_int` use the post-period-average contrast DOF, matching R `Wald_test(test="HTZ")$df_denom` at atol=1e-10. Without this threading the small-sample inference would silently fall back to normal-theory (df=None) — mirrors the SunAbraham aggregated-inference pattern from PR #472 and addresses the R1 CI codex P0 caught at submission time. `vcov_type` propagated to `StackedDiDResults.vcov_type`. `SurveyDesign` combined with `vcov_type != "hc1"` raises `NotImplementedError`: survey TSL/replicate-refit overrides analytical sandwich. Reject order locked: fweight/aweight check at `stacked_did.py:309` fires before the survey + non-hc1 check at `stacked_did.py:~325` (pinned by `test_aweight_plus_hc2_bm_rejected_by_stacked_did_level_guard`). `conley` rejected with a deferral message. `weight_type ∈ {"aweight","fweight"}` + `hc2_bm` continues to raise per the linalg validator's pweight-only restriction in PR #475 (`vcov_type="hc1"` supports all three weight types via the existing StackedDiD Q-weight semantics reject). Second PR of the Phase 1b standalone-estimator threading initiative (6 PRs to follow: WooldridgeDiD-OLS, CallawaySantAnna, ImputationDiD, TripleDifference, TwoStageDiD, EfficientDiD). - **ContinuousDiD methodology-review-tracker promotion.** Tracker row flipped **In Progress** → **Complete** with full Verified Components / Test Coverage / Corrections Made / Deviations / Outstanding Concerns structure mirroring the HAD precedent (PR #473). REGISTRY `## ContinuousDiD` gains a formal Deviations block consolidating the boundary-knots deviation from R `contdid` v0.1.0 (`range(dose)` vs `range(dvals)` — library avoids extrapolation), the `bspline_derivative` derivative-failure `UserWarning` (Phase 2 axis-C #12), the `+inf` → `0` never-treated recoding warning, and the zero-`first_treat`+nonzero-`dose` force-zeroing warning (both axis-E silent-coercion fixes) into a single AI-review-recognized labeled surface. R cross-language coverage for ContinuousDiD runs at relative tolerance across two surfaces: (a) **scalar parity with raw R `cont_did` / `pte_default`** at 1% on overall ATT for all 6 benchmarks and on overall ACRT for benchmarks 4-5 (benchmark 6 is event-study, scalar `overall_att` only); (b) **harmonized boundary-knot-normalized curve parity** with R-side ATT(d) / ACRT(d) reconstructed under `Boundary.knots = range(treated_doses)` (matching the library) on benchmarks 1-3 via the benchmark harness — `_run_r_contdid` does the R-side rebuild at `tests/test_methodology_continuous_did.py:333-367`, and `_compare_with_r` orchestrates the Python-vs-R comparison at `:395-459` — max ATT(d) at 1% and max ACRT(d) at 2%. NOT bit-exact (`atol=1e-8`) like HAD — the boundary-knots deviation precludes algorithmic bit-equality on aggregated dose-response curves. Surface (a) is direct raw-package parity; surface (b) is reconstructed-basis parity because raw `contdid` curves use `range(dvals)`. No source code changes, no new tests, no new docstrings — consolidation only against the existing 15 methodology tests (`tests/test_methodology_continuous_did.py`), 80 unit tests (`tests/test_continuous_did.py`), and `docs/methodology/continuous-did.md` theory note. `METHODOLOGY_REVIEW.md` ContinuousDiD row promoted **In Progress** → **Complete**. - **`SpilloverDiD(vcov_type="conley", survey_design=...)` integration via stratified-Conley sandwich on PSU totals (Wave E.2).** Lifts the Wave E.1 `NotImplementedError` (`spillover.py:2201` upfront, `two_stage.py:217` helper-level) and adds spatial-HAC + design-based variance for the previously deferred composition. **Documented synthesis** of Conley (1999) spatial-HAC × Gerber (2026, arXiv:2605.04124) Proposition 1 Binder TSL (the Wave E.1 foundation) × Wave D Gardner GMM first-stage uncertainty correction (Butts 2021 §3.1 + Gardner 2022 §4) applied to SpilloverDiD's ring-indicator stage-2 design. No reference software combines all three ingredients on a two-stage influence function. **Mechanical composition (panel-aware):** preserves the library's existing `conley_lag_cutoff = 0` semantic at `diff_diff.conley._compute_conley_meat` ("within-period spatial only — exclude cross-period spatial pairs") by looping over periods. For each period `t`, SpilloverDiD's per-obs Hájek-weighted Wave D IF `psi_i` is aggregated to per-period PSU totals `S_psu_t[g] = sum_{i in PSU g, time t} psi_i` (via `np.add.at`); per-PSU spatial centroids are panel-constant (mean of per-observation `conley_coords` within each PSU, vectorized `np.add.at` sums / `np.bincount` counts); for each stratum the within-stratum sandwich is `M_h_t = (1 - f_h) * n_h/(n_h-1) * sum_{j,k in PSUs_h} K(d(centroid_j, centroid_k) / conley_cutoff_km) * (S_psu_t[j] - S_bar_h_t)(S_psu_t[k] - S_bar_h_t)'`, where K is the Bartlett kernel (SpilloverDiD currently exposes Bartlett only and hardcodes it; the survey helper accepts `"uniform"` too but exposing that on the SpilloverDiD constructor is a separate follow-up) and `d` is haversine / euclidean / callable per `ConleyMetric`. Cross-stratum kernel weights are exactly zero by sampling design (strata are independence partitions). Total meat is `sum_t sum_h M_h_t`. Cross-period spatial pairs are excluded by construction — the per-period loop matches the library's panel Conley contract exactly. **Reduction semantics (load-bearing for tests):** the orchestrator's panel-aware meat equals `sum_t` of per-period within-stratum stratified-Conley sandwiches on per-period PSU totals (pinned at `tests/test_spillover.py::TestSpilloverDiDWaveE2ConleySurveyDesign::test_b_panel_aware_per_period_sum_invariant`); single stratum (H = 1, FPC = inf) reduces to `sum_t` plain Conley sandwich on per-period PSU totals (NOT on time-collapsed totals). **Implementation:** new `_compute_stratified_conley_meat_from_psu_scores` helper in `diff_diff/survey.py` (parallel to existing `_compute_stratified_meat_from_psu_scores` 3-tuple `(meat, variance_computed, legitimate_zero_count)` contract; per-stratum loop replaces the inner `centered.T @ centered` with `_compute_conley_meat(scores=centered, coords=psu_coords_h, ...)` in cross-sectional mode); new dispatch wrapper `_compute_stratified_conley_meat` in `diff_diff/two_stage.py` (parallel to existing `_compute_binder_tsl_meat`, performs per-obs Psi → PSU aggregation + centroid derivation + dispatch to survey helper, intentionally drops `cluster_ids` at the dispatch boundary — see Restrictions). `_compute_gmm_corrected_meat` conley branch extended with `if resolved_survey is not None` routing to the new wrapper; the `resolved_survey is None` branch is bit-identical to Wave D. **Singleton-stratum `lonely_psu="adjust"` parity:** the survey helper mirrors the Binder helper's `continue` to skip the FPC scale on singleton strata (with `n_h = 1` the scale `n_h / (n_h - 1)` would divide by zero); the degenerate one-PSU kernel `K = [[K(0)]] = [[1.0]]` reduces to `centered.T @ centered`, matching Binder's singleton-adjust output. **Saturated `df_survey = 0` NaN-fail:** mirrors Wave E.1 (`_compute_stratified_conley_meat` returns NaN meat with `UserWarning` template "Wave E.2 stratified-Conley sandwich: df_survey = 0..." so callers can `pytest.warns(UserWarning, match="Wave E.2 stratified-Conley")`). **Public surface restrictions:** replicate-weight variance (BRR / Fay / JK1 / JKn / SDR) raises `NotImplementedError` (inherits Wave E.1 gate; per-replicate full refit is separate follow-up scope); `cluster= + survey_design.psu + vcov_type="conley"` coerces `cluster=` to PSU per Wave E.1's warn-and-use-PSU pattern (the Conley cluster product kernel becomes a no-op after PSU aggregation, so `cluster_ids` is intentionally not threaded into the inner Conley kernel call — every PSU is its own cluster post-aggregation, which would zero all cross-PSU pairs); LinearRegression-side `vcov_type="conley" + survey_design=` gate at `diff_diff/linalg.py:2853` remains (separate Bertanha-Imbens 2014 weighted-Conley "Phase 5" roadmap, not Wave E); DiagnosticReport routing for `SpilloverDiDResults(vcov_type="conley", survey_design=)` requires `_APPLICABILITY` / `_PT_METHOD` registration (separate Wave F PR). **Tests:** new `TestSpilloverDiDWaveE2ConleySurveyDesign` and `TestSpilloverDiDWaveE2ConleySurveyDesignEventStudy` classes in `tests/test_spillover.py` (bit-identical no-survey fallback; panel-aware per-period sum invariant on the orchestrator + helper composition; hand-computation methodology anchor; single-stratum ≡ plain Conley on PSU totals; cross-stratum independence as a unit test on the survey helper with interleaved cross-stratum centroids; Binder vs Conley singleton-adjust FPC skip parity; lonely-PSU sensitivity across three modes; FPC large ≡ no-FPC and FPC = n_h zeros stratum; saturated NaN-fail with `pytest.warns(match="Wave E.2 stratified-Conley")`; replicate-weight + non-pweight rejections; cluster warn-and-use-PSU; fit idempotency; `finite_mask` survey-array subsetting; no-PSU coverage — weights-only `SurveyDesign(weights=...)`, strata-only `SurveyDesign(weights=..., strata=...)`, and a per-period re-index unit invariant pinning that no cross-period spatial pairs leak into the meat on implicit-PSU layouts; event-study path on both `is_staggered=True`/`False` branches per `feedback_cohort_loop_trigger_cache_both_branches`; drift goldens at `rtol=1e-12 / atol=1e-14`). The pre-existing `tests/test_spillover.py::test_fit_conley_plus_survey_design_not_implemented` Wave E.1-era gate-assertion test is removed (replaced by the positive-path tests above). Wave E.1 entry's "Public surface restrictions" bullet updated to past-tense the conley+survey gate reference. - **`SpilloverDiD(vcov_type="conley", conley_lag_cutoff > 0, survey_design=...)` panel-block composition via spatial + serial Bartlett HAC (Wave E.2 follow-up).** Lifts the Wave E.2 upfront `NotImplementedError` at `spillover.py:2210` and extends the panel-aware stratified-Conley sandwich (cross-sectional `lag=0` shipped in Wave E.2) with a within-PSU serial Bartlett HAC over time (Newey-West 1987 form). **Documented synthesis** of Wave E.2's panel-aware Conley × Binder TSL × Wave D Gardner GMM composition with Newey-West (1987) serial Bartlett HAC, matching the no-survey panel-block decomposition at `diff_diff.conley._compute_conley_meat` (Conley 1999 + Newey-West 1987 separable form, NOT Driscoll-Kraay 2D-HAC). The composition is `meat = meat_spatial + meat_serial` with disjoint index sets: spatial is the shipped Wave E.2 per-period within-stratum sandwich on PSU totals; serial is the new within-PSU sum `meat_serial_h = FPC_h_panel * sum_{g in stratum h} sum_{|t-s| <= L, t != s, both periods present for PSU g} (1 - |t-s|/(L+1)) * S_centered_t[g] @ S_centered_s[g]'` where `S_centered_t[g] = S_psu_t[g] - S_bar_h(g)_t` is per-period within-stratum centered (Binder TSL form — matches the spatial helper's centering exactly), and `|t-s|` uses panel-wide dense time codes (mirrors `conley.py:940` R deviation that matches R `conleyreg::time_dist`). Serial Bartlett kernel is hardcoded regardless of `conley_kernel` (the user-selected kernel governs the spatial term only). **FPC convention (panel-wide per-stratum):** standalone Newey-West composition on stratified clusters — the serial sum is a PANEL-level construct, so the cluster set is the panel-wide PSU set in stratum h; FPC denominator uses `n_h_panel = |unique PSUs in stratum h across active sample|`, NOT per-period `n_h_t`. The spatial term keeps its existing per-period FPC unchanged. For balanced panels the two FPC denominators converge; the difference surfaces under unbalanced panels. Citation chain: Binder (1983) for the FPC factor form, Gerber (2026) Prop 1 for the Binder TSL composition with two-stage IF, Newey-West (1987) for the serial Bartlett kernel weights, Conley (1999) for the spatial kernel and panel-block decomposition (deliberately NOT by analogy to the Binder helper's cross-sectional per-stratum FPC convention). **Centering asymmetry vs no-survey reference:** the no-survey panel-block path at `conley.py:949-965` uses RAW scores for the serial term because it assumes `E[scores] = 0` under correct specification; the survey-weighted Binder TSL form centers explicitly (textbook stratified-cluster sandwich). Using raw scores in the survey case would inflate variance by twice the squared per-period stratum mean and would NOT reduce to the cross-sectional Wave E.2 form at lag=0. **Reduction semantics (load-bearing for tests):** `conley_lag_cutoff = 0` or `None` produces bit-identical ATT and scalar SE to shipped Wave E.2 (orchestrator skips the serial helper invocation; the spatial loop + saturation guard + new PSD/finite guard still run on the spatial-only meat). `assert_array_equal` regression pin at `test_a` covers user-visible ATT + scalar SE; `test_a2` mock-spy independently asserts the serial helper is NOT invoked at `lag=0`. The meat matrix itself is not exposed on `SpilloverDiDResults`, so full meat-matrix equality is implied (not asserted); `conley_time is None` or `T = 1` short-circuits the serial helper to zero meat (degenerate panel-block, not a saturation diagnostic); single-stratum H=1 with FPC=inf reduces to Newey-West Bartlett HAC on per-period within-stratum-CENTERED per-PSU score sequences (NOT raw scores — Binder TSL centering is retained at H=1; the panel-wide `G/(G-1)` survey factor replaces FPC); bandwidth → 0 with L > 0 reduces the spatial term to per-period within-stratum HC sandwich while leaving the serial term unchanged (separable form). **Singleton-stratum `lonely_psu="adjust"` panel-wide mean asymmetry:** for the serial helper, `_global_psu_mean` is the panel-wide mean of per-period PSU totals (averaged over all `(g, t)` with `present[g, t]`), NOT the per-period within-stratum mean used by the spatial helper. The `continue`-skip-FPC pattern matches the spatial helper at `survey.py:2007-2017` to avoid divide-by-zero on `n_h_panel = 1`. **Restrictions inherited:** replicate-weight variance still raises `NotImplementedError`; DiagnosticReport routing for the panel-block case is queued for the same Wave F follow-up as the cross-sectional case. **Implementation:** new sibling helper `_compute_stratified_serial_bartlett_meat` in `diff_diff/two_stage.py` (parallel to the Wave E.2 spatial orchestrator; ~200 LoC). Orchestrator `_compute_stratified_conley_meat` signature extended with `conley_lag_cutoff: Optional[int] = None`; spatial loop unchanged; serial helper called after spatial loop when `conley_lag_cutoff > 0`; saturation NaN-fail accounting merges both terms' `(variance_computed, legitimate_zero)` flags into the same template (`UserWarning` "Wave E.2 stratified-Conley sandwich: df_survey = 0..." covers both spatial-only and panel-block cases). Dispatch in `_compute_gmm_corrected_meat` conley branch threads `conley_lag_cutoff` through; the `cluster_ids` non-threading rationale (post-PSU-aggregation every PSU is its own cluster) still applies to the new serial branch. Spillover-side gate at `spillover.py:2210-2230` (Wave E.2-era `NotImplementedError` for `lag > 0 + survey`) deleted; gate rationale comment replaced with shipped-behaviour note. Stage-1 weighted FE solver, `finite_mask` survey-array subsetting, `df_survey` threading, bread weighting, and `SpilloverDiDResults` survey metadata are all inherited UNCHANGED — Psi construction is bit-identical regardless of vcov_type or lag. **Tests:** new `TestSpilloverDiDWaveE2FollowupConleySurveyLagCutoff` and `TestSpilloverDiDWaveE2FollowupConleySurveyLagCutoffEventStudy` classes in `tests/test_spillover.py`; existing `test_j0_panel_conley_lag_cutoff_rejected_under_survey` (Wave E.2-era gate-assertion) DELETED. diff --git a/TODO.md b/TODO.md index 6ca61bc0..a642394e 100644 --- a/TODO.md +++ b/TODO.md @@ -99,8 +99,9 @@ Deferred items from PR reviews that were not addressed before merge. | PreTrendsPower: CS/SA `anticipation=1` R-parity fixture. The PR-C R-parity goldens cover NIS power + γ_p MDV at `atol=1e-4` on four shifted-grid / regular / irregular / K=1 fixtures, but R `pretrends` has no anticipation parameter so the Python-side `_extract_pre_period_params` anticipation filter (`if t < _pre_cutoff` in `pretrends.py` lines 1138-1150 for CS; mirror in SA branch) is not R-parity-locked. Build a synthetic `CallawaySantAnnaResults` (or `SunAbrahamResults`) with `anticipation=1` and a t=-1 event-study entry that should be filtered before reaching `_compute_power_nis`, then assert the resulting γ_p matches R's `slope_for_power()` on the K=4 shifted-grid fixture. Existing PR-B MC-based tests (`TestPretrendsPropositions`) and full-VCV tests (`TestPretrendsCovarianceSource`) already cover the filter mechanically; this would close the loop against R. | `tests/test_methodology_pretrends.py::TestPretrendsParityR`, `benchmarks/R/generate_pretrends_golden.R` | PR-C follow-up | Low | -| Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the 7 standalone estimators that expose `cluster=` but not yet `vcov_type=`: `CallawaySantAnna`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added the chain to DiD/MPD/TWFE; Phase 1b PR 1/8 added `SunAbraham` (this row tracks the remaining 7). | multiple | Phase 1b | Medium | +| Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the standalone estimators that expose `cluster=` but not yet `vcov_type=`: `CallawaySantAnna`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added the chain to DiD/MPD/TWFE; Phase 1b PR 1/8 added `SunAbraham`; Phase 1b PR 2/8 added `StackedDiD` (this row tracks the remaining 6). | multiple | Phase 1b | Medium | | Extend `SunAbraham` with `vcov_type="conley"` (Conley spatial-HAC) as a first-class feature: thread `conley_coords` / `conley_cutoff_km` / `conley_metric` / `conley_kernel` / `conley_time` / `conley_unit` / `conley_lag_cutoff` through `_fit_saturated_regression`. Phase 1b PR 1/8 deferred this; SA currently rejects `vcov_type="conley"` at `__init__` with a deferral message. | `diff_diff/sun_abraham.py` | follow-up | Medium | +| Extend `StackedDiD` with `vcov_type="conley"` (Conley spatial-HAC) — thread the six `conley_*` params through `solve_ols` at `stacked_did.py:419` (and the `_refit_stacked` closure at `:444`). Phase 1b PR 2/8 deferred this; StackedDiD currently rejects `vcov_type="conley"` at `__init__` with a deferral message. Same shape as the SunAbraham conley follow-up. | `diff_diff/stacked_did.py` | follow-up | Medium | | Harmonize SunAbraham's HC1 within-transform finite-sample correction with `fixest::sunab()`. SA's `solve_ols` applies `n / (n - k_dm)` (within-transform columns only); fixest applies `n / (n - k_total)` (counts absorbed FE). SE values differ by ~1-2% on typical panel sizes (documented in REGISTRY.md "Deviation from R"; pinned at `atol=5e-3` in `tests/test_methodology_sun_abraham.py`). Either thread `df_adjustment` into the vcov scaling or document as an intentional difference. | `diff_diff/sun_abraham.py`, `diff_diff/linalg.py::compute_robust_vcov` | follow-up | Low |