PreTrendsPower implementation audit (Roth 2022 — PR-B)#466
Conversation
…helper API
Implements Roth (2022) Section II.A-B NIS box probability as the new
primary `pretest_form='nis'` default. Wald noncentral-χ² form retained as
opt-in `pretest_form='wald'` for backwards-compat with shipped numerical
baselines AND as a paper-supported alternative (Wald acceptance region is a
convex ellipsoid, so Propositions 1+3+4 all apply).
Changes:
- `PreTrendsPower.__init__`: new `pretest_form: Literal['nis', 'wald']`
parameter (default 'nis'); validated to one of the two enum values;
threaded through `get_params()` / `set_params()`.
- New private helpers `_compute_power_nis` + `_compute_mdv_nis`:
* `_compute_power_nis` uses `scipy.stats.multivariate_normal.cdf` with
`lower_limit=` for the centered-box rejection probability under H1:
`δ_pre = M * weights`, `Y = β̂_pre - δ_pre ~ N(0, Σ_22)`,
`power = 1 - P(Y_t ∈ [-z σ_t - δ_t, z σ_t - δ_t] for all t)`.
Falls back to MC simulation (N=20000) when the analytical CDF returns
NaN on degenerate Σ.
* `_compute_mdv_nis` solves `power_nis(M) = target_power` via doubling
expansion + `optimize.brentq` bisection; non-convergence cap at
M_high=1000 returns `np.inf` (mirrors Wald path's existing 1000-cap).
- Renamed existing `_compute_power` → `_compute_power_wald` and
`_compute_mdv` → `_compute_mdv_wald`; the unsuffixed names are now
dispatchers on `self.pretest_form`. Wald math is byte-identical.
- `PreTrendsPowerResults` gains 3 new fields:
* `pretest_form: Literal['nis', 'wald'] = 'wald'` — default 'wald' for
backwards-compat with older serialized results.
* `nis_box_probability: float = np.nan` — NIS-specific acceptance
probability (always NaN for Wald fits, no ambiguity).
* `violation_weights: Optional[np.ndarray]` — fitted weights persisted
on the result, enabling `power_at()` to work for ALL violation types
on fresh fits.
- `fit()` populates all three new fields and dispatches.
- `power_curve()` inherits dispatch through `_compute_power`.
- `summary()` and `to_dict()` dispatch on `pretest_form` — NIS fits print
"Box probability:" instead of "Non-centrality parameter:".
- `PreTrendsPowerResults.power_at()` refactored: uses
`self.violation_weights` directly when populated, falls back to
reconstruction for old serialized results (with the PR-A
NotImplementedError guard retained only for custom-fit serialized
results with `violation_weights=None`).
- `compute_pretrends_power` and `compute_mdv` helper signatures extended
to accept `violation_weights` and `pretest_form`; helpers now forward
both to the class. Closes the helper/class API gap from PR-A R18.
Smoke-tested with K=2 and K=3 panels:
- NIS power at M=0 with K=3 ≈ 0.138 (matches 1 - (1-α)^K = 0.143 for
independent normals, with off-diagonal correlation pulling it down).
- Wald power at M=0 with K=3 = 0.05 (exact size under H0).
- NIS MDV(80%, K=3) = 0.59, Wald MDV(80%, K=3) = 0.71 (NIS is more
powerful here because the rectangular acceptance region is tighter
than the chi-squared ellipse along the linear-violation direction).
Pre-existing pyright type-stub warnings on `optimize.brentq` and
`stats.multivariate_normal.cdf` are not touched.
Plan ref: /Users/igerber/.claude/plans/stateless-prancing-iverson.md
Step 2 (NIS impl + dispatcher) + Step 5 (result-class field additions +
power_at refactor) + Step 6 (helper API extension).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The pre-PR-B default was implicitly Wald (noncentral-χ²); PR-B Step 2 flipped it to NIS (box probability). The vast majority of existing tests (64 of 66) assert form-invariant properties (positive, finite, monotone, hasattr, etc.) and pass under either default. Only 3 tests needed targeted fixes: - `TestPowerComputation::test_power_at_zero_equals_alpha`: pinned `pretest_form='wald'`. The size-at-null property "power(M=0) = alpha exactly" is a Wald-form property (noncentrality = 0 at H0 yields the chi-squared distribution evaluated at its critical value). Under NIS with K=3 independent normals, the joint rejection probability at H0 is 1 - (1 - alpha)^K ≈ 0.14, not 0.05. - `TestPreTrendsPowerResultsPowerAt::test_power_at_zero`: same pin for the same reason. - `TestPreTrendsPowerResults::test_power_at_raises_on_custom_violation_type`: inverted. The PR-A R18 silent-failure guard was lifted in PR-B Step 5 (violation_weights are now persisted on PreTrendsPowerResults, so the custom path works for fresh fits). Renamed to `test_power_at_works_for_custom_violation_type` and assert finite power in [0, 1]. Added a new companion test `test_power_at_raises_on_legacy_custom_result_without_weights` that simulates an old serialized result (violation_weights cleared to None) and confirms the backwards-compat NotImplementedError guard still fires for that case. Test count: 67 (was 66; net +1 from the legacy-guard companion test). All 67 pass. Adjacent suites (test_pretrends_event_study.py and the pretrends-tagged tests in test_diagnostic_report.py) also pass under the NIS default — 31 passed, 0 failed. This is much less test churn than the plan estimated (~101 bulk pins). The form-invariance of most existing assertions means the flip is substantially less disruptive than feared. Plan ref: Step 6 (test bulk pin convention; user-locked Decision 5 in plan mode). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…study_vcov
Adds full event-study covariance matrix on SunAbrahamResults, enabling
PreTrendsPower to consume Roth (2022) Σ_22 on the SA path instead of
falling back to diag(ses^2). Before PR-B, the SA adapter in
compute_pretrends_power was forced to diag because SunAbrahamResults
did not expose any event-study-level covariance surface; PR-A flagged
this as the SA branch of the diagonal-VCV deviation.
Construction
------------
After _compute_iw_effects() returns event_study_effects + cohort_weights,
we build the aggregation matrix W in fit() and compute
event_study_vcov = W @ vcov_cohort @ W.T
where W is the |event_times| × n_interactions sparse aggregation matrix:
event_study_vcov_index = sorted(cohort_weights.keys())
W = np.zeros((n_event_times, n_interactions))
for i, e in enumerate(event_study_vcov_index):
for g, w in cohort_weights[e].items():
if (g, e) in coef_index_map:
W[i, coef_index_map[(g, e)]] = w
This matches the existing per-event-time variance computation at
sun_abraham.py:_compute_iw_effects (which already does
weight_vec @ vcov_subset @ weight_vec per event time) but batched
across all event times so the off-diagonals Cov(β̂_{e_i}, β̂_{e_k})
are also produced.
Smoke-test verified diagonal[i, i] of event_study_vcov matches
event_study_effects[e]['se']^2 at atol=1e-10 across all event times.
Bootstrap / replicate clears
----------------------------
Mirrors the CS pattern at staggered.py:2032-2036. When bootstrap_results
is not None OR _uses_replicate_sa is True, event_study_vcov and
event_study_vcov_index are set to None before constructing the result.
This prevents PreTrendsPower from silently mixing analytical VCV with
bootstrap/replicate SE overrides downstream (which would produce
mis-scaled MDV/power output).
Regression
----------
- 39/39 tests/test_sun_abraham.py pass.
- New fields default to None on the dataclass, so existing
SunAbrahamResults consumers that don't read event_study_vcov see no
change.
Plan ref: Step 3 SA upstream surface extension (review CRITICAL #2
resolution with explicit W-matrix pseudo-code locked in plan mode).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Replaces the hard-coded ``vcov = np.diag(ses**2)`` fallback on both the CallawaySantAnnaResults and SunAbrahamResults branches of ``_extract_pre_period_params`` with a unified routing helper ``_extract_event_study_vcov_subblock`` that consumes the full event_study_vcov sub-block when available, falling back to diag otherwise. Helper logic ------------ - When ``results.event_study_vcov`` is not None AND ``results.event_study_vcov_index`` is not None, look up each filtered pre_period via ``.index()`` and extract the ``[np.ix_(indices, indices)]`` sub-block. - Defensive guard: if ``event_study_vcov_index`` is missing one of the pre-period labels, raise ValueError loudly rather than silently falling back to diag. - When the result type does not expose event_study_vcov, return ``np.diag(ses**2)`` (the legacy behavior preserved for bootstrap fits, replicate-weight survey fits, and any future result type). Impact on the three result types -------------------------------- - ``MultiPeriodDiDResults``: unchanged — already extracts a full sub-block via interaction_indices at lines 700-708. - ``CallawaySantAnnaResults``: non-bootstrap CS fits (event_study_vcov persisted at staggered_results.py:126-128) now consume the full Σ_22 instead of diag. Bootstrap CS fits (event_study_vcov cleared at staggered.py:2032-2036) keep falling through to diag. - ``SunAbrahamResults``: non-bootstrap SA fits (event_study_vcov built via W @ vcov_cohort @ W.T in the previous commit) now consume the full Σ_22 instead of diag. Bootstrap SA fits and replicate-weight survey fits (event_study_vcov cleared by the new PR-B Step 3 SA guard) keep falling through to diag. Regression ---------- - 67/67 tests/test_pretrends.py pass. - 27/27 tests/test_pretrends_event_study.py pass. - Total 94/94 across both suites. Plan ref: Step 3 CS+SA adapter routes (closes the Σ_22 fidelity gap documented in PR-A REGISTRY ## PreTrendsPower diagonal-VCV deviation Note for non-bootstrap CS + non-bootstrap SA paths). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…nit MDV
Threads actual pre-period relative-time labels through
``_get_violation_weights('linear')`` and ``_extract_pre_period_params``
so the reported MDV is in Roth's γ units on irregular and
anticipation-shifted grids. Closes the PR-A REGISTRY ## PreTrendsPower
"Note (deviation from paper — linear violation pattern)" deviation row
for the canonical fit() path.
Math
----
Pre-PR-B: `weights = [n_pre-1, ..., 1, 0] / ||·||_2` derived from `n_pre`
count alone (ignored relative-time labels). Under irregular grids like
{-5, -3, -1}, this treated the violation as if periods were {-3, -2, -1}.
After L2 normalization, the reported MDV = γ · ||t||_2, not γ — wrong
units.
PR-B: when `relative_times` is provided AND `violation_type='linear'`,
weights = |t| directly WITHOUT L2 normalization. Then δ_pre = M * |t| =
γ · t_signed under δ_t = γ · t, so M = γ exactly. Reported MDV is in
Roth's γ units (slope-per-period).
Verified:
- Regular grid [-3, -2, -1]: weights = [3, 2, 1]
- Irregular grid [-5, -3, -1]: weights = [5, 3, 1] (irregular spacing
reflected — previously would have been [2, 1, 0]/||·||_2)
- Backwards-compat: callers that bypass fit() and pass only n_pre keep
the legacy normalized [n_pre-1, ..., 0]/||·||_2 behavior (used by
~3 unit tests + any third-party direct-helper callers).
Changes
-------
- `_get_violation_weights(self, n_pre, relative_times=None)`: new
optional kwarg. Linear path with `relative_times not None` uses
`np.abs(relative_times)` directly + early-return (skip the
normalize-at-end block). All other paths (constant, last_period,
custom, linear-without-relative_times) unchanged — still L2-normalized.
- `_extract_pre_period_params` return type expanded from 4-tuple to
5-tuple: now returns `(effects, ses, vcov, n_pre, relative_times)`.
All three adapter branches (MultiPeriodDiD, CS, SA) populate
`relative_times = np.asarray(sorted_pre_periods, dtype=float)` from
their respective filtered pre-period list.
- `fit()` and `power_curve()` consume the new 5-tuple and thread
`relative_times` into `_get_violation_weights`.
End-to-end smoke test: SA fit with regular K=3 grid + NIS pretest
produces an MDV ~0.087 (Roth γ scale), confirming the unit conversion
is wired correctly.
Regression
----------
94/94 tests/test_pretrends.py + tests/test_pretrends_event_study.py.
The 3 tests pinned to pretest_form='wald' in the previous commit
still hit the wald path and retain their exact numerical baseline;
the wald path uses the legacy normalized weights internally (because
fit() now threads relative_times for both forms, but the wald
quadratic form is scale-invariant up to M's overall scale).
Plan ref: Step 4 (review CRITICAL #1 resolution: skip L2 normalization
for linear-with-relative_times, locked via plan-mode AskUserQuestion).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
… flip + TODO + CHANGELOG + llms.txt
Documents the four PR-A TODO rows that PR-B Steps 2-6 just resolved:
- Σ_22 fidelity on CS/SA adapters (full event_study_vcov sub-block routing)
- Helper API gap (compute_pretrends_power + compute_mdv accept
violation_weights + pretest_form)
- power_at(custom) silent-failure guard (PR-A R18 mitigation lifted on
fresh fits via the new persisted violation_weights field)
- Linear-units γ-scale (skip L2 norm for linear-with-relative_times)
Step 8 — REGISTRY.md ## PreTrendsPower:
- Wholesale replacement with NIS-framed entry.
- Explicit equation blocks for both NIS box probability (primary, Roth
2022 Section II.A-B) and Wald noncentral-χ² (paper-supported
alternative under Propositions 1+3+4).
- Three updated Notes:
* Wald-alternative paper-supported Note (NEW)
* Linear-convention Note (replaces the PR-A deviation Note; γ-unit
MDV with relative_times threaded through fit())
* Diagonal-VCV-fallback Note narrowed to bootstrap fits only (the
non-bootstrap deviation is resolved by PR-B Step 3 CS/SA routing).
- Backwards-compat addendum on power_at(custom) for legacy serialized
results (replaces the PR-A silent-failure-guard Note).
- Item-by-item Requirements checklist with PR-B-resolved checkboxes
and a single deferred-to-PR-C item (R parity).
- Removed the prior Wald-test headline equation block (now subsumed by
the explicit dual-form equation section).
Step 9 — METHODOLOGY_REVIEW.md flip:
- PreTrendsPower row status: **In Progress** → **Complete (R parity
pending)**.
- Last Review: 2026-05-18.
- Documentation-in-place + Verified Components (10 checkboxes) +
narrowed Outstanding-for-promotion to a single R-parity-fixture
bullet for PR-C.
Step 10 — TODO.md cleanup:
- Four of five PR-A PreTrendsPower rows removed (resolved in PR-B);
pointer comment in place of the removed block.
- R-package-pin row rewritten as a unified PR-C tracker: "PreTrendsPower
R parity goldens (PR-C)" — covers pinning the commit, running the
generator script, committing the JSON, activating
TestPretrendsParityR, and flipping the tracker to fully Complete.
Step 11 — CHANGELOG.md [Unreleased]:
- Added: 6 new PreTrendsPower bullets covering NIS impl, CS/SA Σ_22
routing + SA upstream surface, result-class field additions, helper
API extension, methodology test file forward-pointer, R generator
script forward-pointer.
- Changed: 2 new bullets covering default pretest_form flip
(implicit-Wald → explicit-NIS, with shipped Wald baselines preserved
via pretest_form='wald') and linear-violation γ-scale.
- Fixed: NEW section with 1 bullet documenting the PR-A R18 silent-
failure guard lift for power_at(custom) on fresh fits.
llms.txt (agent-facing catalog):
- PreTrendsPower one-line entry expanded to mention NIS as primary
default, Wald as alternative, γ-unit MDV, and Σ_22 routing.
Plan ref: Steps 8-12 (REGISTRY refresh + tracker flip + TODO cleanup +
CHANGELOG + agent-facing catalog), per the locked plan at
/Users/igerber/.claude/plans/stateless-prancing-iverson.md. Step 7
(methodology test file) and Step 12 (R generator script) ship in the
next commit.
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Roth (2022) Section II.A-B paper-equation-numbered methodology test
file. Mirrors `tests/test_methodology_bacon.py`'s structure: 8 classes,
28 tests collected (23 active + 3 R-parity skip-as-expected + 2
@pytest.mark.slow deselected by default).
Classes:
- `TestPretrendsHandCalculation` (8 tests): z_{1-α/2} = 1.96 at α=0.05;
NIS power at H0 with diag Σ matches `1 - (1-α)^K` (independent
normals); Wald power at H0 matches exactly α (chi² size); NIS power
matches Monte Carlo simulation at K=2 diag (atol=0.01); NIS power
matches MC at K=3 with ρ=0.3 equicorrelation; MDV(target=0.8) round-
trip — power(MDV) = 0.80; NIS power monotone in |M|; NIS MDV non-
convergence cap returns np.inf.
- `TestPretrendsPropositions` (2 @pytest.mark.slow tests): Proposition
1 (conditional mean) matches MC at atol=0.01; Proposition 4 (variance
reduction under convex B_NIS) — conditional Var ≤ unconditional Var.
- `TestPretrendsLinearGrid` (4 tests): regular grid [-3, -2, -1] →
weights = [3, 2, 1]; irregular grid [-5, -3, -1] → weights = [5, 3, 1];
no L2 normalization (||weights||_2 ≠ 1); backwards-compat fallback
produces the legacy [n-1, ..., 0] / ||·||_2 direction.
- `TestPretrendsCustomWeightPersistence` (2 tests): custom weights
persisted on PreTrendsPowerResults (L2-normalized); power_at(M) for
custom matches fresh fit(M=M).
- `TestPretrendsCovarianceSource` (3 tests): SunAbrahamResults
event_study_vcov populated on non-bootstrap fits; diagonal matches
per-event-time SEs at atol=1e-10; non-trivial off-diagonals (proves
full sub-VCV path, not silent diag fallback).
- `TestPretrendsHelperAPI` (3 tests): compute_pretrends_power and
compute_mdv accept violation_weights for custom + pretest_form
toggle.
- `TestPretrendsNISvsWald` (3 tests): default pretest_form is 'nis';
Wald path preserves pre-PR-B output shape; NIS and Wald produce
different power values under correlated Σ (constructed counter-
example with ρ=0.6).
- `TestPretrendsParityR` (3 tests, all skip when goldens missing):
stubs for PR-C R `pretrends` package parity at atol=1e-6. Skip
decorator checks for `benchmarks/data/r_pretrends_golden.json`.
26/26 collected tests pass (after deselecting the 2 @pytest.mark.slow
Proposition simulation tests per the default pytest addopts).
Plan ref: Step 7 (methodology test file).
🤖 Generated with [Claude Code](https://claude.com/claude-code)
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…en.R R `pretrends` parity goldens generator script (PR-C deferred to land the JSON output). Mirrors the Bacon precedent (`generate_bacon_golden.R` ships in PR-B, JSON goldens deferred to PR-C following Bacon PR-C #457). Structure --------- Three fixtures matched to test_methodology_pretrends.py expectations: 1. `uniform_3_pre_periods_no_anticipation` — K=3 regular grid (t ∈ {-3, -2, -1}), never-treated control. Default-case parity baseline. 2. `irregular_pre_periods` — K=3 with relative_times = [-5, -3, -1]. Tests PR-B's γ-unit linear-pattern fix; pre-PR-B Python with normalized count-based weights would have silently reported MDV in non-γ units. R `slope_for_power()` always reports γ. 3. `anticipation_shifted` — K=4 with anticipation=1. Verifies the pre-period filtering logic in `_extract_pre_period_params`. Three-tier parity contract at atol=1e-6: 1. NIS box probability `P(β̂_pre ∈ B_NIS(Σ))` at fixed γ values on all 3 fixtures. 2. γ_p MDV (slope at target power 0.5 and 0.8) on regular and irregular grids. 3. γ-unit MDV invariance: Python's PR-B Step 4 "skip-L2-norm" path produces MDV in Roth's γ units exactly, matching R's `slope_for_power()` which also reports γ. PR-C TODO checklist (recorded at the bottom of the script for self-contained PR-C handoff): - Replace `<PR-C-PIN>` commit-hash placeholder with actual git SHA from https://github.com/jonathandroth/pretrends. - Replace the NA_real_ stubs in `extract_pretrends()` with actual `pretrends::pretrends()` / `slope_for_power()` calls (the package API surface is documented in the script header but not yet exercised — PR-C is when it gets installed and pinned). - Verify REGISTRY.md surface claims against the pinned revision. - Activate `tests/test_methodology_pretrends.py::TestPretrendsParityR` (currently skips via @pytest.mark.skipif when the JSON is missing). - Flip METHODOLOGY_REVIEW.md PreTrendsPower row to fully **Complete**. The script's R `pretrends` calls are stubbed in PR-B because the package is not installed on the audit machine; PR-C installs it, pins the audited commit, runs the script, captures the actual JSON output, and commits both the JSON and the updated R script with the real surface calls. Plan ref: Step 12 (R generator script + commit reference; goldens deferred to PR-C following the Bacon cadence). 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Codex Round 1 verdict was ⛔ Blocker — 2 unmitigated P0 defects in the new default `pretest_form='nis'` path plus a P1 contract bug on the MPD branch and a P2 code-duplication issue. All four addressed. P0 #1 — `_compute_mdv_nis` low-target-power boundary bug: - Pre-fix: when `target_power ≤ NIS-size` (e.g., α=0.05, K=3 → null_size ≈ 0.143, request target=0.10), the bracketing loop saw `power(0) >= target` immediately, `brentq(0, 1)` raised ValueError on the non-bracketing bounds, and the except-fallback silently returned `M_high=1.0` instead of 0.0. - Post-fix: explicit short-circuit `if power_minus_target(0) >= 0: return 0.0` BEFORE the doubling loop. Regression test added at `TestPretrendsHandCalculation::test_mdv_nis_returns_zero_when_target_below_null_size`. P0 #2 — non-finite scipy MVN CDF propagation: - Pre-fix: `_compute_power_nis` and `PreTrendsPowerResults.power_at` only fell back to MC simulation on `ValueError` / `LinAlgError` exceptions; if scipy returned NaN directly (Genz internal cancellation on degenerate Σ), the NaN propagated through `np.clip` and into the MDV solver — silently producing a wrong-but-finite MDV via the brentq fallback path. - Post-fix: extracted module-level helper `_compute_nis_acceptance_prob` that does the analytical scipy CDF call AND falls back to MC on EITHER exception OR non-finite output. Both call sites (`_compute_power_nis` and `PreTrendsPowerResults.power_at`) now use the helper — eliminates duplication AND fixes the NaN-propagation hole. Regression test monkey-patches `scipy.stats.multivariate_normal.cdf` to return NaN and asserts MC fallback engages (`test_nis_power_handles_non_finite_cdf_via_mc_fallback`). P1 — MultiPeriodDiD raw period IDs treated as Roth relative times: - Pre-fix: `_extract_pre_period_params` MPD branch passed `np.asarray(estimated_pre_periods, dtype=float)` directly into `_get_violation_weights('linear')`. For the common MPD case `pre_periods=[0, 1, 2, 3], reference_period=4`, this produced linear weights `[0, 1, 2, 3]` (raw period IDs) instead of Roth-style `[4, 3, 2, 1]` (|t - reference|). Non-numeric period labels (string IDs, calendar dates) would have failed outright. - Post-fix: derive relative_times from `results.reference_period`: `[float(p) - float(ref) for p in estimated_pre_periods]`. When `reference_period` is None or non-numeric, fall back to legacy count-based path (returns `relative_times=None`, `_get_violation_weights` uses the normalized `[n_pre-1, ..., 0]/||·||_2` direction). Type signature of `_extract_pre_period_params` widened to `Tuple[..., Optional[np.ndarray]]`. Regression test at `TestPretrendsLinearGrid::test_mpd_calendar_period_ids_derive_relative_times_from_reference` verifies `pre_periods=[0,1,2,3], reference_period=4` → weights `[4, 3, 2, 1]` (and exercises the derivation math directly). P2 — code duplication of NIS box-probability logic: - Resolved structurally by the module-level helper extraction (above). - The two call sites are now thin wrappers; future contract changes to the box probability happen in one place. Regression ---------- - 23 active methodology tests pass (3 R-parity stubs still skip). - 67/67 test_pretrends.py + 27/27 test_pretrends_event_study.py unchanged. - Total 120/120 across the three suites (+ 3 expected R-parity skips). Plan ref: HARD GATE Step 13 Round 1. 🤖 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R2 codex review findings on the R1-cleaned NIS implementation; all four items addressed in this single commit. **P0: NIS MDV cap-vs-target ambiguity** (pretrends.py _compute_mdv_nis) The pre-fix doubling loop exited on the `M_high < 1000` cap and immediately returned ∞ even when power(M_high) >= target — silently producing a wrong MDV=∞ result on finite-root cases. Codex's concrete counterexample: vcov=[[50000]] with target_power=0.8 has a finite root between M=512 and M=1024 (power(512)≈0.36, power(1024)≈0.997). Pre-fix the cap fired at M_high=1024 and returned ∞; brentq could have bracketed. Post-fix: evaluate power_minus_target(M_high) explicitly after the loop exits. Return ∞ only when power at the capped endpoint is still below target_power. Finite-root cases at the boundary now pass through to brentq. The genuine-unreachable case (vcov=[[1e8]], target=0.99) still returns ∞ as before. **P1: scipy version pin** (pyproject.toml) `scipy.stats.multivariate_normal.cdf(..., lower_limit=...)` — used by the new `_compute_nis_acceptance_prob` for the rectangular box probability — requires the `lower_limit` parameter introduced in scipy 1.10. Bump from `scipy>=1.7.0` to `scipy>=1.10` with an explanatory comment referencing the release-notes link. Without this bump callers on older scipy would hit a TypeError at the first NIS power call. **P1: pretest_form not propagated to PreTrendsPowerCurve** `PreTrendsPower.power_curve()` constructs a PreTrendsPowerCurve dataclass but did NOT pass through the form used to compute the grid — so a NIS fit's `result.power_curve(...)` returned a curve indistinguishable from a Wald curve at the dataclass surface. Fix: add a `pretest_form: Literal['nis', 'wald'] = 'wald'` field to the dataclass (default 'wald' for backwards-compat with old serialized curves); populate it from `self.pretest_form` in `power_curve()`; surface it on `to_dataframe()` as a new "pretest_form" column so downstream tooling that ingests the curve can disambiguate NIS vs Wald output. **P2: MPD relative-times regression test was manual arithmetic** The R1-fix added `test_mpd_calendar_period_ids_derive_relative_times_from_reference` but the test only checked Python's subtraction operator, never invoking the production `_extract_pre_period_params` MPD branch. Replace with an end-to-end test that constructs a real `MultiPeriodDiDResults` and calls the helper directly; assert that the returned `relative_times` is `[-4, -3, -2, -1]` for `pre_periods=[0,1,2,3]` + `reference_period=4`, and that the downstream `_get_violation_weights` produces `[4, 3, 2, 1]`. Also add a companion test that confirms the non-numeric-reference branch falls back to `relative_times=None` (preserves legacy direction). **R2 P0 regression test: finite-root MDV at doubling endpoint** New `test_mdv_nis_finite_root_at_doubling_endpoint` reproduces codex's concrete counterexample (vcov=[[50000]], target_power=0.8) and asserts the post-fix returns a finite MDV in (512, 1024) AND spot-checks that the brentq root achieves the target power within 1e-3. Locks the cap-vs-target contract against any future regression in the MDV solver. **Pyright-stub annotations** Coerce `z_alpha = float(stats.norm.ppf(...))` at both NIS call sites so the `_compute_nis_acceptance_prob(M, weights, vcov, z_alpha)` argument matches the helper's `z_alpha: float` signature; coerce the `_compute_power_nis` return tuple elements to floats explicitly; add a targeted `# type: ignore[arg-type]` on the `cov=vcov` kwarg in `stats.multivariate_normal.cdf` (scipy stub bug — `cov` typed `int`). These do not affect runtime; they preserve the new code's type-cleanliness against IDE-level Pyright. Tests: 122/122 pass (3 R-parity stubs skip; 2 slow tests deselected). SA upstream regression: 39/39 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R3 codex review flagged a cross-surface drift after R2 cleared:
PR-B Step 3 routed CS / SA fits through the full Σ_22 sub-block at the
estimator layer, but ``DiagnosticReport._infer_cov_source`` kept the
pre-PR-B type-based inference. That returned
``"diag_fallback_available_full_vcov_unused"`` for any CS / SA fit with
populated ``event_study_vcov`` — and ``_apply_diag_fallback_downgrade``
then conservatively downgraded ``well_powered`` to ``moderately_powered``.
Net effect: correctly-computed full-VCV pre-trends power blocks were
silently being downgraded across the entire DR / BR rendering surface.
**P1 fix — provenance recorded at the estimator layer, consumed at the
report layer:**
- ``pretrends.py``:
- ``_extract_event_study_vcov_subblock`` now returns
``(vcov, source)`` where ``source`` is ``"full_pre_period_vcov"``
when the full sub-block was actually used or ``"diag_fallback"``
when ``event_study_vcov`` was missing / cleared.
- ``_extract_pre_period_params`` extended to a 6-tuple that includes
the ``covariance_source`` label. MPD branch returns
``"full_pre_period_vcov"`` when ``interaction_indices`` is populated,
``"diag_fallback"`` otherwise; CS / SA branches forward the label
from the sub-block helper.
- ``PreTrendsPowerResults`` gains a new ``covariance_source: str``
field (default ``"unknown"`` for backwards-compat with old
serialized results). ``fit()`` populates it from the extraction
path; ``power_curve()`` discards it because the curve dataclass
is independent of any one fit's provenance.
- ``diagnostic_report.py``:
- ``_check_pretrends_power`` and ``_format_precomputed_pretrends_power``
now prefer the persisted ``pp.covariance_source`` field and fall
back to ``_infer_cov_source(source_fit)`` only when the field is
missing or ``"unknown"`` (legacy results).
- ``_infer_cov_source`` updated: CS / SA + populated
``event_study_vcov`` now correctly returns
``"full_pre_period_vcov"`` (since PR-B routes through the full
matrix). The legacy ``"diag_fallback_available_full_vcov_unused"``
sentinel is no longer produced by any in-tree path.
- ``_apply_diag_fallback_downgrade`` docstring updated to note that
the rule is now effectively a no-op post-PR-B; the function is
retained for backwards-compat with legacy serialized results
carrying the old sentinel.
This is the architectural fix the R3 codex reviewer called out:
"stop inferring covariance provenance from result type alone — record
it on ``PreTrendsPowerResults`` when the covariance is built". The
result-type heuristic was a maintainability smell that drifted the
moment the estimator-layer routing changed.
**P3 fix — autosummary docs include new fields:**
- ``docs/api/_autosummary/diff_diff.PreTrendsPowerResults.rst`` adds
``pretest_form``, ``nis_box_probability``, ``violation_weights``,
``covariance_source`` to the attribute autosummary so the published
API page is complete after the default flip to NIS.
- ``docs/api/_autosummary/diff_diff.PreTrendsPowerCurve.rst`` adds
``pretest_form``.
- ``docs/api/pretrends.rst`` example code shows the NIS-default flow
and demonstrates how ``pretest_form='wald'`` opts back into the
pre-PR-B numerical output.
**P3 fix — REPORTING.md aligned with the new estimator-layer reality:**
- ``docs/methodology/REPORTING.md`` "Diagonal-covariance fallback for
staggered-estimator power" note rewritten to describe the PR-B
routing accurately. Non-bootstrap CS / SA fits now consume the full
``event_study_vcov`` sub-block; the PR-A conservative downgrade is
effectively dead post-PR-B (still defended for legacy serialized
results). Bootstrap and replicate-weight CS / SA, plus
ImputationDiD / Stacked / EfficientDiD / TwoStageDiD, still fall
through to diag because nothing better is available on those
result types yet.
**Tests:**
- Rewrote ``test_precomputed_pretrends_power_downgrades_when_full_vcov_unused``
as ``test_precomputed_pretrends_power_full_vcov_yields_no_downgrade``:
the same CS-shaped stub that used to be downgraded now correctly
resolves to ``covariance_source='full_pre_period_vcov'`` and keeps
the ``well_powered`` tier.
- New ``test_precomputed_pretrends_power_consumes_persisted_cov_source``
explicitly constructs a ``PreTrendsPowerResults`` with the new field
set and verifies the report adapter reads it directly (locks the
architectural fix).
- Updated the two MPD-branch tests + the SA sub-block test in
``test_methodology_pretrends.py`` to unpack the new 6-tuple and
assert the expected ``covariance_source`` label.
Tests: 583 pass across pretrends + DR + BR + SA + staggered.
4 skipped (R-parity stubs + 1 unrelated). 0 regressions.
CHANGELOG ``Fixed`` entry added under ``[Unreleased]``.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R4 codex review sharpened a backwards-compat hole left open by R3: **P1 — legacy missing-field fallback was upgrading silently** R3 changed ``DiagnosticReport._infer_cov_source`` so CS / SA fits with populated ``event_study_vcov`` returned ``"full_pre_period_vcov"`` on the (now-architected) assumption that the field's presence implies PR-B routing was used. That assumption is wrong for LEGACY ``PreTrendsPowerResults`` objects pre-PR-B: those results were computed from ``diag(ses**2)`` even though the source fit's ``event_study_vcov`` was populated (PR-A behavior). Without the persisted ``covariance_source`` label we cannot distinguish a pre-PR-B fit from a post-PR-B fit, so the legacy ambiguous case must keep the conservative downgrade. Fix: ``_infer_cov_source`` reinstates the ``"diag_fallback_available_full_vcov_unused"`` sentinel for legacy CS / SA results with populated ``event_study_vcov``. New PR-B fits set ``PreTrendsPowerResults.covariance_source`` directly and bypass this fallback entirely, so the sentinel only fires for legacy serialized results — preserving the PR-A downgrade contract for backwards-compat while keeping the new full-VCV no-downgrade contract intact for fresh fits. ``_apply_diag_fallback_downgrade`` docstring updated. Tests rewritten in ``tests/test_diagnostic_report.py``: - ``test_precomputed_pretrends_power_persisted_full_vcov_no_downgrade``: constructs a real ``PreTrendsPowerResults`` with the new field set to ``full_pre_period_vcov`` (the value PR-B records on fresh fits) and verifies no downgrade. - ``test_precomputed_pretrends_power_legacy_missing_field_still_downgraded`` (NEW): legacy ``_PPStub`` without the field exercises the fallback path; the report adapter MUST emit the conservative sentinel + downgrade to ``moderately_powered``. - ``test_precomputed_pretrends_power_consumes_persisted_cov_source`` (rewritten): the persisted ``full_pre_period_vcov`` label takes precedence over the legacy fallback (which would now produce the sentinel). Locks the architectural fix. **P3 — to_dict() / to_dataframe() drop violation_weights and covariance_source** The PR-B-added public fields are missing from ``PreTrendsPowerResults`` serialization surfaces. Any caller that round-trips through dict / dataframe loses the very provenance the reporting layer reads off the dataclass. Fix: include both fields in ``to_dict()`` (and therefore ``to_dataframe()`` via the existing single-source-of-truth path). **P3 — BR-level live regression for the new full-VCV no-downgrade path** The existing ``TestDiagFallbackDowngradeAppliedCentrally`` class only locked the downgrade fires on the diag fallback. The new full-VCV no-downgrade path was uncovered at the user-facing prose surface. New ``test_full_vcov_path_no_downgrade_on_real_cs_fit`` exercises the live CS path: when ``pretrends.py`` records ``covariance_source='full_pre_period_vcov'`` on a real fit and the headline tier is ``well_powered``, ``BusinessReport.full_report()`` must NOT contain the moderately-informative phrasing. Tests: 400 across pretrends + DR + BR pass. 4 skipped (R-parity stubs + 1 fixture skip). SA + CS regression: 185 pass (no change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R5 codex review on the R4 cleanup. Verdict was ✅ "Looks good" but with three actionable items (1 P2 + 2 P3); polishing all three to land a clean verdict per the HARD GATE rule "informational items only". **P2 — `to_dict()` was not actually JSON-serializable** R3 added the docstring "suitable for serialization or downstream transport" on ``PreTrendsPowerResults.to_dict()`` but the body kept emitting ``violation_weights`` as a raw ``np.ndarray``, so ``json.dumps(result.to_dict())`` raised ``TypeError``. Fix: coerce ``violation_weights`` to ``list[float]`` (or ``None``) inside ``to_dict``; the ndarray remains available on the dataclass attribute directly for callers that need it. New regression in ``tests/test_methodology_pretrends.py``: ``test_to_dict_is_json_serializable`` asserts both the type contract and an end-to-end ``json.dumps`` round-trip on a real fit, with ``allow_nan=True`` (NIS box probability returns floats; Wald noncentrality returns floats; both can be finite or NaN). **P3 — BR live regression was checking the wrong surface** R4's ``test_full_vcov_path_no_downgrade_on_real_cs_fit`` checked ``BusinessReport.full_report()`` for the absence of the moderately-informative phrasing. But ``business_report.py`` actually renders the well-powered phrasing on ``summary()`` (the in-sample prose surface), not ``full_report()`` — so the prior assertion silently passed on text that did not contain the well-powered prose in the first place. Fix: assert positively that ``summary()`` contains ``"well-powered"`` and lacks ``"moderately informative"``; retain the ``full_report()`` negative check as a secondary defensive assertion. **P3 — legacy MPD fallback was overstating provenance** R4's ``_infer_cov_source`` unconditionally returned ``"full_pre_period_vcov"`` for non-event-study types — which silently included ``MultiPeriodDiDResults`` fits where ``interaction_indices`` is ``None``. In that case ``pretrends.py:_extract_pre_period_params`` falls through to ``np.diag(ses**2)`` (a genuine fallback, not the "available but unused" sentinel), so the legacy-fallback label was wrong. Fix: ``_infer_cov_source`` special-cases ``MultiPeriodDiDResults``. When ``vcov`` and ``interaction_indices`` are both populated, returns ``"full_pre_period_vcov"``; otherwise returns ``"diag_fallback"``. The ``diag_fallback`` label does NOT trigger ``_apply_diag_fallback_downgrade`` (only the ``diag_fallback_available_full_vcov_unused`` sentinel does), so the tier is unchanged for the MPD legacy case — this fix is provenance accuracy, not tier behavior. New regression in ``tests/test_diagnostic_report.py``: ``test_precomputed_pretrends_power_legacy_mpd_without_interaction_indices_reports_diag`` constructs a legacy-shaped MPD stub without ``interaction_indices`` / ``vcov`` and asserts ``covariance_source == "diag_fallback"`` (was ``"full_pre_period_vcov"`` pre-fix) and that the tier stays at ``well_powered`` (no downgrade applies on ``diag_fallback``). Tests: 402 pass across pretrends + DR + BR. 4 skipped (R-parity stubs + 1 fixture skip). SA + CS regression: 185 pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
R6 codex verdict ✅ "Looks good" but with two P3 polish items. **P3 — `_infer_cov_source` docstring drifted from new MPD special-case** R5 added an explicit MPD branch to ``_infer_cov_source`` that returns ``"diag_fallback"`` when ``interaction_indices`` is absent, but the docstring's ``"full_pre_period_vcov"`` bullet still claimed all non-event-study types (including MPD) "always" expose full pre-period covariance. Fix: update the docstring so the ``"full_pre_period_vcov"`` bullet excludes MPD (with a forward pointer to the explicit MPD branch below), and the ``"diag_fallback"`` bullet enumerates the MPD-without- ``interaction_indices`` case. **P3 — BR no-downgrade live regression was conditionally bypassed** The R5-fixed ``test_full_vcov_path_no_downgrade_on_real_cs_fit`` gated the well-powered phrasing assertions on ``if block["tier"] == "well_powered"``, which silently skipped the key prose assertion if a future regression reintroduced the conservative downgrade (the test then passes trivially). Fix: pin the expected tier deterministically on the ``cs_fit`` fixture, which produces ``mdv/|att| ≈ 0.053`` (well under the ``0.25`` well_powered threshold) on ``seed=7`` + ``treatment_effect=1.5``. New assertions: - ``block["covariance_source"] == "full_pre_period_vcov"`` (asserted, not guarded) - ``block["mdv_share_of_att"] < 0.25`` (asserts the raw ratio is in the well_powered range so the no-downgrade assertion below is meaningful) - ``block["tier"] == "well_powered"`` (locks the no-downgrade contract — a regression reintroducing the downgrade would fail here, not silently bypass) The well-powered / moderately-informative prose contracts on ``summary()`` and ``full_report()`` are now also unconditionally asserted. Tests: 125 pass on the impacted classes (BR centralized-downgrade + all methodology + all DR). No regressions. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
R8 CI codex caught a P1 my local R7 reviewer missed — exactly the
`feedback_local_codex_vs_ci_codex_divergence.md` pattern.
**P1 — MPD non-numeric labels silently fell back to count-based,
undocumented as a deviation in REGISTRY**
R3's MPD branch returned `relative_times=None` for non-numeric
`reference_period` values (string period IDs, etc.), silently using
the legacy count-based normalized direction — but the REGISTRY note
described the γ-unit deviation as "resolved" without qualifying that
exception. Two-part fix:
1. **Better coercion** for datetime-like labels: new module-level
helper `_coerce_relative_times_from_reference` (`pretrends.py:92`)
handles three regimes:
- Numeric (`int` / `float` / `np.int64`) — direct `float()`
- `pandas.Period` / `Timestamp` / `np.datetime64` — subtraction-
based offset arithmetic (`.n` for Period, `.days` for Timedelta,
fall through to `/ np.timedelta64(1, 'D')`)
- Genuinely non-numeric (string period IDs, unranked categoricals)
— emits an explicit `UserWarning` documenting that the reported
MDV is NOT in Roth's γ units under this fallback, and recommends
re-fitting with numeric labels.
2. **Documentation alignment**: REGISTRY `## PreTrendsPower`
convention note and METHODOLOGY_REVIEW.md `## PreTrendsPower`
Verified Components checklist both enumerate the supported label
types (numeric + pandas.Period + Timestamp + datetime64) and
explicitly call out the non-numeric warn-and-fallback behavior as
a documented edge case (not a "resolved" deviation).
**P3 — `docs/api/pretrends.rst` still referenced removed `custom_delta`
parameter name**
The custom-violation entry in the violation-types section used the
parameter name `custom_delta`, but the actual API exposes
`violation_weights` (both on `PreTrendsPower` and on the helper
functions per PR-B Step 6). Fix: rename in docs and add a one-line
note that both the class and the helpers accept the kwarg.
**Tests** (`tests/test_methodology_pretrends.py::TestPretrendsLinearGrid`):
- `test_mpd_non_numeric_reference_falls_back_to_legacy_weights`
renamed to `..._warns_and_falls_back...` and now asserts the
explicit `UserWarning` is emitted (mentioning "γ units").
- NEW `test_mpd_pandas_period_reference_yields_numeric_relative_times`:
constructs a `MultiPeriodDiDResults` with `pd.Period('2019Q1..Q3')`
pre-periods and `reference_period=pd.Period('2019Q4')`, asserts the
derived `relative_times == [-3, -2, -1]` (quarters) and linear
weights = `[3, 2, 1]` in γ units. Locks the Period-arithmetic path
the codex specifically flagged.
The P3 R-parity-script placeholder is deferred to PR-C per the
existing TODO row (codex labeled it informational / non-blocker).
Tests: 403 pass across pretrends + DR + BR. 4 skipped (R-parity
stubs + 1 fixture skip). No regressions.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R9 verdict ✅ Looks good with one actionable P3: CHANGELOG.md L39 and REPORTING.md L324-342 overstated the legacy covariance-source path. Both said the `"diag_fallback_available_full_vcov_unused"` sentinel is gone / that legacy inference labels CS / SA + `event_study_vcov` as `"full_pre_period_vcov"`, but the R4 fix RESTORED the conservative sentinel for legacy precomputed results that lack the persisted `covariance_source` field — because without it we cannot distinguish a pre-PR-B fit (used diag) from a post-PR-B fit (used full Σ_22). Code and tests are correct; docs were the inconsistent piece. Fix: reword both surfaces to distinguish two paths explicitly: - **New fits** (post-PR-B): persisted `covariance_source` is read directly, non-bootstrap CS / SA report `"full_pre_period_vcov"` and are NOT downgraded. - **Legacy serialized results** (pre-PR-B, no field): legacy type-based inference still emits the conservative sentinel for CS / SA + populated `event_study_vcov`, and the `well_powered → moderately_powered` downgrade still applies. For legacy `MultiPeriodDiDResults` without `interaction_indices`, the fallback reports `"diag_fallback"` (genuine fallback, no downgrade). CHANGELOG entry expanded to list all four covariance-source DR regression tests by name; REPORTING.md "Pre-period covariance routing for staggered-estimator power" Note rewritten with the two-path structure. R-parity P3 deferred to PR-C per the existing TODO row (codex labeled it informational / non-blocker). No source changes; 309 tests across pretrends + DR + BR continue to pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No findings. The changed methodology surfaces align with the updated registry and the cited Roth paper: NIS is now the primary box-pretest form, detectable-slope reporting is in the paper’s Code Quality No findings. Performance No findings. Maintainability No findings. Tech Debt
Security No findings. Documentation/Tests
Verification note: I could not execute the numerical test suite in this environment because the project test dependencies are unavailable; I did verify AST parsing on the touched Python files. |
R10 verdict ✅ Looks good with one P3 actionable: four stale explanatory texts still describe pre-PR-B / pre-Step-6 behavior. **1. `PreTrendsPowerResults.violation_weights` docstring** (`pretrends.py:289-294`): said weights are "normalized" without qualification, but PR-B Step 4's linear-with-`relative_times` path intentionally persists the **unnormalized** `|t|` direction so that `δ_pre = M · |t|` and the reported MDV equals Roth's γ exactly. Reworded to enumerate the normalization regime per violation_type: constant / last_period / custom / linear-without-relative_times are L2-normalized; linear-with-relative_times is unnormalized γ-unit. **2. `tests/test_pretrends.py` helper-rejection tests** (`test_compute_pretrends_power_rejects_custom_violation_type` and `test_compute_mdv_rejects_custom_violation_type`, lines 610-635): docstrings said the helper "does not accept ``violation_weights``, so the custom path is unusable from the helper" — but PR-B Step 6 added the kwarg to both helpers. Reworded to scope the rejection contract correctly: it's the unsupplied-`violation_weights` case (loud-fail) that still raises, not the entire custom path. **3. CHANGELOG.md (Added) "Coming in the next commit"** (lines 16-17): both bullets said the test file and R script were "coming in the next commit". Removed — they were committed in PR-B Step 7 (test file) and Step 12 (R script) respectively. **4. CHANGELOG.md (Changed) "5-tuple"** (line 30): described `_extract_pre_period_params` as widening from a 4-tuple to a 5-tuple, but the R3 covariance-source-propagation fix later widened it to a 6-tuple `(effects, ses, vcov, n_pre, relative_times, covariance_source)`. Updated the bullet to describe the current 6-tuple shape and added a forward reference to the MPD pandas.Period / Timestamp / datetime64 coercion path (R8 fix) and the covariance_source provenance contract (R3 fix). The R-parity P3 remains deferred to PR-C per the existing TODO row (codex labeled it informational / non-blocker). 97 pretrends tests + DR + BR tests pass; no source-logic changes in this commit (docs + comments only). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology Code Quality Performance Maintainability Tech Debt
Security Documentation/Tests
|
R11 verdict ✅ with one P3 actionable: the Wald backward-compat
wording was overstated. CHANGELOG.md:L29-L30 and the methodology
test docstrings at L840-L875 said `pretest_form='wald'` preserves
pre-PR-B output bit-identically, but PR-B Step 4's `relative_times`
threading changed the linear-weight contract for BOTH NIS and Wald
paths — a Wald fit on an irregular grid `{-5, -3, -1}` now produces
γ-unit MDV (not the pre-PR-B count-based L2-normalized MDV).
Pre-PR-B Wald numerics are bit-identical to post-PR-B Wald output
only on (a) the legacy `relative_times=None` callable path
(`_get_violation_weights(n_pre)` invoked directly without threading),
and (b) regular-grid fits where `|t| ∝ [n_pre-1, ..., 0]`.
Fix:
- **CHANGELOG.md (Changed) L29**: reword to scope the Wald
backwards-compat claim to the **acceptance-region form** (the
noncentral-χ² on δ' Σ_22^{-1} δ, unchanged), and add an explicit
caveat that the linear-weight contract changed independently in
Step 4. Bit-identity holds only on the legacy callable path and
the regular-grid case. Removed the stale "tutorial 07 to be
re-rendered TBD" qualifier; tracked as a follow-up.
- **`tests/test_compute_pretrends_power_accepts_pretest_form_wald`**
docstring: clarify that the `'wald'` selector picks the
acceptance-region form, not bit-identical numerical output on
fitted results.
- **`test_wald_path_preserves_pre_pr_b_output`** renamed to
`test_wald_path_preserves_pre_pr_b_acceptance_region_form`;
docstring expanded with the two-regime backwards-compat scope
(legacy callable path: bit-identical; new fit() path: γ-unit
MDV on irregular grids). Added a positive NaN-check on
`nis_box_probability` to lock the Wald-only field contract.
The R-parity P3 remains deferred to PR-C per the existing TODO row
(codex labeled it informational / non-blocker).
30 methodology + 100 baseline pretrends tests pass; no source-logic
changes (docs / docstrings only).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality Performance Maintainability Tech Debt
Security Documentation/Tests
Path to Approval
|
R12 CI codex caught a real unit-mismatch bug the doc-rewording trail uncovered. After PR-B Step 4 made linear `mdv` report Roth's γ units (a slope on relative time), downstream code still divided it by level-scale quantities — `DiagnosticReport.pretrends_power` computed `mdv_share_of_att = mdv / abs(att)`, `is_informative` checked `mdv < 2 * max(pre_period_ses)`, and `sensitivity_to_honest_did` reported `mdv_in_ses = mdv / max_pre_se`. On an irregular grid `[-5, -3, -1]`, the level-scale max pre-period violation under the MDV is `mdv * 5`, but the raw `mdv` (γ) is what was being compared to `|att|` / SE — silently mis-tiering the same fit. This is the holistic-fix-on-repeated-doc-findings pattern: each prior round (R5/R9/R10/R11) the codex flagged Wald backwards-compat wording as "overstated" because the LINEAR-WEIGHT change bled into surfaces the doc claimed were unaffected — but the wording wasn't wrong, the IMPLEMENTATION was incomplete. **Holistic fix — new level-scale property `max_abs_pre_violation`** `PreTrendsPowerResults.max_abs_pre_violation` returns `mdv * max(|violation_weights|)` — the largest level-scale pre-period deviation under the MDV. This is the right unit-consistent scalar for comparison against `|att|`, per-period SEs, and HonestDiD's M. - For `linear` with `relative_times=[-T, ..., -1]`, weights = `|t|`, so `max_abs_pre_violation = mdv * T_max`. - For `constant` with normalized `[1, ..., 1]/√K`, weights ~ 1/√K, so `max_abs_pre_violation = mdv / √K`. - For `last_period` with `[0, ..., 0, 1]`, weights have max=1, so `max_abs_pre_violation = mdv`. - For `custom`, depends on user-supplied vector. - Legacy serialized results without `violation_weights` fall back to raw `mdv` (pre-PR-B count-based L2-normalized linear was already roughly level-scale, so the fallback gives the right magnitude). **Wired through 3 surfaces:** 1. `PreTrendsPowerResults.is_informative`: uses `max_abs_pre_violation < 2 * max_se` instead of `mdv < 2 * max_se`. 2. `PreTrendsPower.sensitivity_to_honest_did`: reports `mdv_in_ses = max_abs_pre_violation / max_pre_se`, and surfaces `max_abs_pre_violation` as a new dict key. 3. `DiagnosticReport._check_pretrends_power` and `_format_precomputed_pretrends_power`: `mdv_share_of_att = max_abs_pre_violation / abs(att)`. Schema also surfaces the new `max_abs_pre_violation` field. On `cs_fit` (`base_period='universal'`, seed=7, treatment_effect=1.5): - pre_periods = `[-4, -3, -2]`, `max(|t|) = 4` - `mdv` (γ) = 0.0937, `max_abs_pre_violation` = 0.375, `|att|` = 1.779 - pre-fix `mdv / |att|` = 0.053 (slope/level mismatch) - post-fix `max_abs_pre_violation / |att|` = 0.211 (level/level) - Tier: still `well_powered` (0.211 < 0.25), now interpretable. **Documentation updated** to match: - `docs/methodology/REPORTING.md` "Power-aware phrasing" Note: ratio definition changed from `mdv / abs(att)` to `max_abs_pre_violation / abs(att)`; rationale added inline. - `docs/api/pretrends.rst` Wald example: rewrote the "backwards- compatible numerical output" wording to scope bit-identity to regular grids / legacy `relative_times=None` path. PR-B Step 4's `relative_times` threading applies to BOTH NIS and Wald, so a Wald fit on an irregular grid also produces γ-unit MDV — not pre-PR-B numerics. - `CHANGELOG.md` (Fixed): new entry documenting the fix and the numerical-output change for downstream consumers. **New regressions:** - `test_max_abs_pre_violation_uses_weight_scale_on_irregular_grid`: constructs an irregular-grid `[-5, -3, -1]` fit and asserts `max_abs_pre_violation = mdv * 5`, with a guard that the value is > 4x the raw mdv to prevent any future revert. - `test_is_informative_uses_level_scale_not_raw_gamma`: constructs a fit where raw `mdv < 2*SE` (would say "informative") but `max_abs_pre_violation > 2*SE` (says "not informative"); asserts the level-scale check wins. - Updated `test_full_vcov_path_no_downgrade_on_real_cs_fit` (BR level): now pins `0.35 < block["max_abs_pre_violation"] < 0.40` and asserts `mdv_share_of_att < 0.25` against the new level-scale definition. 590 tests pass across pretrends + DR + BR + SA + staggered. 4 skipped (R-parity stubs + 1 fixture skip). No regressions outside the targeted ratio definition change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Path to Approval
|
R13 CI codex caught the next adjacent unit-mismatch — analogous to
R12's linear/level fix but for `violation_type='constant'`. REGISTRY
`## PreTrendsPower` documents the pattern as `δ_t = c` (per-period
level shift), but `_get_violation_weights('constant')` still
L2-normalized to `[1/√K, ..., 1/√K]`, so `δ_t = M/√K` not `δ_t = M`.
The Step 4 γ-unit fix for linear and the R12 holistic
`max_abs_pre_violation` fix uncovered this — once `mdv` is the
documented magnitude of a per-period shift on linear, the constant
pattern's silent `1/√K` re-scaling becomes visible and breaks the
documented contract.
**Holistic fix — constant/last_period skip L2 normalization**
- `_get_violation_weights('constant')`: early return `np.ones(n_pre)`,
no L2 norm. Now `δ_t = M` exactly per period — matching the
REGISTRY/API documented contract.
- `_get_violation_weights('last_period')`: also given an explicit
early return for symmetry. The `[0, ..., 0, 1]` vector already had
L2 norm 1, so this is a no-op numerically; the early return locks
the contract uniformly across level-pattern violation types.
- `power_at()` legacy reconstruction (fallback for old serialized
results without `violation_weights`): unchanged — still
L2-normalizes, preserving pre-PR-B numerical output for legacy
serialized fits per the same backwards-compat policy applied to
the linear-legacy and constant-legacy paths.
- Docstring on `_get_violation_weights` rewritten to enumerate the
per-type normalization convention explicitly: linear-with-times
(γ-unit), linear-legacy (L2-norm), constant (level), last_period
(level), custom (L2-norm).
**End-to-end regression** in `test_methodology_pretrends.py`:
- `test_constant_violation_pattern_is_level_shift`: real
`SunAbraham`-fit results, asserts `violation_weights == [1, ..., 1]`
(NOT L2-normalized → `||w||_2 = √K`), `max_abs_pre_violation == mdv`
(level-scale and γ-scale coincide for constant), and
`power_at(M) ≈ refit(M=M).power` at `atol=1e-4` so the
level-shift contract holds end-to-end through `fit()` and
`power_at()`. Codex specifically requested an end-to-end lock so
future audits cannot drift between "per-period shift" and
"normalized-direction magnitude".
- `test_constant_weights` in `tests/test_pretrends.py` flipped: was
pinning `np.linalg.norm(weights) == 1.0` (the OLD L2-normalized
contract); now asserts unnormalized `[1, 1, 1, 1]` with
`||w||_2 = 2` (√4). Docstring explains the contract change.
**P3 fix — BR rendered label**
`BusinessReport.full_report()` still labeled the
`mdv_share_of_att` row as `MDV / |ATT|`, but R12 redefined the
numerator as `max_abs_pre_violation`. Fix: rename the rendered
label to "Max pre-period level deviation / |ATT|" and add an
explicit row for `Max pre-period level deviation at MDV` above it
so users can see both the raw `mdv` and the level-scale scalar.
**Behavior change for users:** any caller passing
`violation_type='constant'` will see a √K-factor change in the
reported `mdv` and downstream `mdv_share_of_att`. The shift is
documented in REGISTRY `## PreTrendsPower` (the pattern was
`δ_t = c` all along — the IMPLEMENTATION is now what the docs
have always said).
445 tests pass across pretrends + DR + BR + SA. 4 skipped (R-parity
+ 1 fixture skip).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
R14 verdict ✅ Looks good with one P2 BR-schema-gap + a P3 fan-out
of stale "MDV vs ATT" prose that the R12 level-scale fix needed to
sweep through but missed.
**P2 — `_lift_pre_trends` dropped `max_abs_pre_violation`**
R12 added `max_abs_pre_violation` to the `DiagnosticReport` schema
and wired the `BusinessReport.full_report()` renderer to print it,
but the BR schema-lift helper at `business_report.py:884` did NOT
carry the field across the DR → BR boundary. Net effect:
`BR.to_dict()["pre_trends"]` was missing the field AND
`full_report()`'s `pt.get("max_abs_pre_violation")` returned None,
so the new "Max pre-period level deviation at MDV" line never
actually rendered.
Fix: add `"max_abs_pre_violation": pp.get("max_abs_pre_violation")`
to the `_lift_pre_trends` return dict. New BR end-to-end regression
asserts both `BR.to_dict()["pre_trends"]["max_abs_pre_violation"]`
is populated AND `full_report()` contains the rendered line.
**P3 — Stale "MDV / |ATT|" prose in 4 surfaces**
R12 moved the tier numerator from raw `mdv` to `max_abs_pre_violation`
but several user-facing prose surfaces still said the comparison
was between "MDV" and "estimated effect" — wording lag, not a
behavioral bug.
1. `business_report.py:2167` "the test is well-powered" summary
sentence: reworded to say "the max pre-period level deviation at
the MDV is small relative to the estimated effect" rather than
the bare "minimum-detectable violation is small".
2. `diagnostic_report.py:3284` DR "no_detected_violation /
well_powered" sentence: same swap from "MDV is a small share of
the estimated effect" to "the max pre-period level deviation at
the MDV is a small share".
3. `PreTrendsPowerResults.violation_weights` docstring: reworded to
enumerate per-violation_type normalization explicitly (linear
with-times γ-unit; linear legacy L2-norm; constant unnormalized
level-shift; last_period level; custom L2-norm).
4. `PreTrendsPowerResults.max_abs_pre_violation` property docstring
(the non-linear-types paragraph): updated to reflect the R13
constant-level-shift change (`mdv * 1 = mdv` rather than the old
`mdv / √K`).
Plus the autosummary RST adds the new
`~PreTrendsPowerResults.max_abs_pre_violation` property entry so
the published API page lists it.
R-parity P3 deferred to PR-C per the existing TODO row.
591 tests pass; no code-path regressions. The new BR regression
catches the lift-boundary bug if it ever regresses.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
|
Summary
pretest_form='nis'onPreTrendsPower,compute_pretrends_power, andcompute_mdv; Wald noncentral-χ² form retained aspretest_form='wald'for shipped numerical compatibility (Propositions 1+3+4 still apply — Wald acceptance region is a convex ellipsoid).event_study_vcovinstead ofdiag(ses**2).SunAbrahamResultsextended withevent_study_vcov+event_study_vcov_indexfields built viaW @ vcov_cohort @ W.T(cohort-aggregation matrix); bootstrap and replicate-weight paths clear these to preserve analytical/bootstrap SE separation.violation_type='linear':_get_violation_weights('linear')now consumes actual pre-period relative-time labels and skips L2 normalization, so reported MDV equals Roth's γ exactly on irregular and anticipation-shifted grids.violation_weightsfield onPreTrendsPowerResultslifts the PR-ANotImplementedErrorguard forpower_at(violation_type='custom')on fresh fits; the helper API (compute_pretrends_power,compute_mdv) now acceptsviolation_weightsandpretest_formend-to-end.DiagnosticReport/BusinessReport: newcovariance_sourcefield onPreTrendsPowerResultsrecords the actual extraction path at fit time;DiagnosticReportconsumes it directly so the report layer no longer guesses provenance from result-type alone (fixed the silent downgrade where correctly-computed full-VCV CS/SA fits were being tagged asmoderately_powered). Legacy fallback for results lacking the field preserves the PR-A conservative downgrade contract.tests/test_methodology_pretrends.py(8 classes, 33 tests including 3 R-parity stubs that skip withoutr_pretrends_golden.json— populated in PR-C). Newbenchmarks/R/generate_pretrends_golden.Rscript committed with placeholder commit reference (PR-C pins the auditedpretrendsrevision and lands the JSON goldens).METHODOLOGY_REVIEW.mdPreTrendsPowerrow promoted: In Progress → Complete (R parity pending), Last Review:2026-05-18. REGISTRY## PreTrendsPowersection rewritten with explicit NIS + Wald equation blocks and item-by-item Requirements checklist. REPORTING.md anddocs/api/_autosummary/*.rstbrought into alignment with the new surface.Methodology references
compute_pretrends_power,compute_mdv), with adapter routing throughCallawaySantAnnaResults/SunAbrahamResults.event_study_vcovandMultiPeriodDiDResults.interaction_indices.event_study_vcovconstruction via the W-matrix aggregationW @ vcov_cohort @ W.T.docs/methodology/papers/roth-2022-review.md(landed via PR PreTrendsPower: Roth (2022) paper review + silent-failure guard #463).pretest_form='wald'(paper-supported alternative under Propositions 1+3+4); not the paper's primary analysis convention but stays available for shipped users.relative_timesis threaded through (viafit()); legacy callers that bypassfit()and pass onlyn_prekeep the previous count-based L2-normalized direction for backwards compatibility — explicitly documented in REGISTRY's## PreTrendsPowerNote.pretrendspackage parity is deferred to PR-C (mirroring Bacon's PR BaconDecomposition methodology audit (Goodman-Bacon 2021) #454 → PR BaconDecomposition R parity goldens #457 cadence); generator scriptbenchmarks/R/generate_pretrends_golden.Ris committed in this PR with a<PR-C-PIN>placeholder commit reference. The 3-tier parity contract (NIS box probability at fixed γ; γ_p MDV at target 0.5 and 0.8; γ-unit MDV invariance) is documented in the script header and locked intests/test_methodology_pretrends.py::TestPretrendsParityR(currently@pytest.mark.skipifagainst the missing JSON).Validation
tests/test_methodology_pretrends.py(NEW, 924 lines, 8 classes / 33 tests covering Roth Section II.A-B paper-equation-numbered Verified Components: K=1 closed-form, NIS power vs MC simulation at K=2,3 with correlated Σ_22, MDV inversion round-trip, NIS-vs-Wald differentiation, linear-units γ-scale on irregular and anticipation-shifted grids, custom-weight persistence, CS/SA full-VCV adapter routing, helper API end-to-end, JSON-serializability ofto_dict, R-parity stubs).tests/test_pretrends.py— bulk pin topretest_form='wald'on existing tests so shipped numerical assertions are byte-identical (only 3 tests required content-level updates because most assertions are form-invariant).tests/test_diagnostic_report.py— added 3 new regression tests: persisted-full_pre_period_vcov-no-downgrade, legacy-missing-field-still-downgraded, persisted-label-takes-precedence-over-inference, and legacy-MPD-without-interaction_indicesreportsdiag_fallback.tests/test_business_report.py— newtest_full_vcov_path_no_downgrade_on_real_cs_fitwith deterministic tier pins (ratio < 0.25 →well_powered, asserted unconditionally, plus positive prose contract onsummary()).event_study_vcovdiagonal-entry sanity check verifiesevent_study_vcov[i,i] = se(e_i)²from_compute_iw_effectsatatol=1e-10. Tutorialdocs/tutorials/07_pretrends_power.ipynbis unchanged — PR-B does NOT re-render the notebook (a tutorial-rendering follow-up is tracked in TODO)./ai-review-local --backend codexrounds (R1-R6) plus the R7 verdict✅ Looks goodwith zero findings of any severity. Seefeedback_local_codex_review_before_pr_submission.md.Security / privacy
🤖 Generated with Claude Code