dCDH heterogeneity wrap-up: post-drop rank df + negative-baseline path test#452
Conversation
…h test Two follow-ups closing the residual Low TODOs from PR #449: 1. #64 — Rank-deficient df threading. `_compute_heterogeneity_test`'s non-survey OLS path now computes `df = n_obs - rank(design)` via `_detect_rank_deficiency` (the same helper `solve_ols` calls internally), matching R's `df.residual = n - rank(design)` post-drop convention. For full-rank designs `rank == n_params` and behavior is bit-identical to the pre-PR `n_obs - n_params` path — all 4 forward-horizon parity tests still pass at the same `BETA_RTOL=1e-6` / `SE_RTOL=1e-5` / `INFERENCE_RTOL=1e-4` tolerances. For near-rank-deficient designs that `solve_ols` retains rather than NaN-out (e.g., cohort-collinearity at high horizons), the post-drop rank is strictly lower than `n_params`, so the post-PR `df` is strictly larger, matching R's `lm()` convention. Locked by `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo::test_heterogeneity_df_uses_post_drop_rank` which pins `safe_inference(... df=n_obs - rank)` against the stored `t_stat` / `p_value` / `conf_int` at `atol=1e-12`. 2. #62 — Negative-baseline path regression test. New `TestByPathNonBinary::test_negative_baseline_path_supported` exercises switchers with `D_{g,1} = -1` and asserts that `path_effects` correctly contains negative-baseline tuple keys (`(-1, 0, 0, 0)`, `(-1, 1, 1, 1)`). The existing `test_negative_integer_D_supported` only covered paths with negative values in non-baseline positions (e.g., `(0, -1, -1, -1)`), which does not trigger R's documented `substr(path, 1, 1)` baseline-extraction bug. Python's tuple-key matching is correct under any baseline value; this test pins the contract. No R-parity fixture is added because R is the buggy side on this regime — the deviation is already documented in the REGISTRY non-binary treatment Note. TODO.md: drops the two corresponding Low rows (#419 negative-baseline test gap + follow-up rank-deficient df threading). REGISTRY heterogeneity caveat updated to drop the "rank-deficient gap" prose and replace with the positive "uses post-drop rank, matching R's lm() convention" claim. CHANGELOG `[Unreleased]` extended with both items. Test count: 314 -> 316. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from the pre-push codex audit on the wrap-up bundle: 1. P1 (Methodology). The new `df = n_obs - rank(design)` claim only partially held: the small-sample short-circuit at `_compute_heterogeneity_test` still used pre-drop `n_obs <= n_params`, so the boundary case `n_obs == n_params > rank(design)` would NaN- fill in Python while R's `lm()` would alias-drop and fit with `df = n_obs - rank`. Replaced with `n_obs <= rank` short-circuit computed via `_detect_rank_deficiency` BEFORE the OLS call. Also added an X_het-aliased check after `solve_ols`: if the X_het column itself is the alias-dropped one (rare; pivoted QR keeps columns with larger norm), NaN-fill the inference tuple (matches R's lm() returning NA for aliased coefficients). 2. P2 (Tests). My initial test `test_heterogeneity_df_uses_post_drop_rank` built a panel where all switchers shared a single cohort, which collapsed the cohort dummies to zero columns and left a full-rank `[intercept, X_het]` regression. The test only re-verified the unchanged full-rank `df = n_obs - 2` path; it would have passed even if the post-drop-rank wiring never fired. Rewrote with a panel of 5 switchers across 4 unique cohorts where X_het = 1 on the F_g=3 reference cohort and 0 elsewhere, producing the exact collinearity `X_het = intercept - sum(non-reference cohort dummies)`. Design has 5 columns, rank 4. Pre-PR would short- circuit; post-PR fits with df=1. Test now also pins that `safe_inference(beta, se, df=n_obs - n_params)` would produce a DIFFERENT p_value, catching reversion to pre-drop n_params. Added a sibling `test_heterogeneity_underidentified_nan_fills` pinning that the `n_obs <= rank` guard still NaN-fills the degenerate case (4 obs, rank 4, df=0). 3. P3 (Docs). REGISTRY heterogeneity Note still said `df = n_obs - n_params` immediately before describing post-drop- rank behavior — internally inconsistent at the exact point users verify deviations. Rewrote to say `df = n_obs - rank(design)` consistently and mention the matching `n_obs <= rank` guard. 317 tests pass (+1 new). Forward-horizon parity scenarios (20-23) remain bit-identical at the same tolerances. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two P3 informational findings from R2: 1. TODO.md Tier B backlog still listed the dCDH heterogeneity df- threading and by-path placebo predict_het items as open, but PR #449 closed both. Replaced the two stale bullets with a single bullet for the remaining survey + backward-horizon allocator derivation (the one Medium follow-up explicitly tracked in the wrap-up commit). 2. Three test-prose comments still said `df = n_obs - n_params` while the implementation and REGISTRY now use `df = n_obs - rank(design)`. Updated each comment to the post-drop rank wording; full-rank designs continue to have `rank == n_params` so the SE-derivation invariants under test are unchanged. Comment-only drift; no behavior change. 317 tests still pass. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
Overall Assessment ✅ Looks good Executive Summary
Methodology
Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Verification note: I could not execute the test suite in this environment because the local runtime does not have |
…e deferral Single CI codex R1 P3 finding: the non-binary `by_path` Note in REGISTRY still said the negative-baseline regression (`(-1, 0, 0, 0)`) was deferred, but this PR adds `test_negative_baseline_path_supported` that covers exactly that regime. Updated the Note to list both negative-D regression tests (non-baseline and baseline positions) as shipped Python-only coverage. No code/test change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology No findings. The affected methods are Code Quality No findings. The changed inference path still uses Performance No findings. The extra rank-detection pass is an added QR per horizon, but it is on the same small design that Maintainability No findings. Reusing Tech Debt No findings. The PR properly retires the previously tracked low-priority follow-ups for negative-baseline path coverage and pre-drop heterogeneity df handling, while leaving the still-underived survey + backward-horizon heterogeneity limitation tracked in Security No findings. Documentation/Tests
Verification note: |
Single CI codex R2 P3 finding: the non-survey heterogeneity parity path uses OLS (only the survey path uses WLS), but two recently- added comments described its df source as "WLS df". Updated: - REGISTRY heterogeneity Note: "df from the WLS regression" -> "df from the OLS regression". - Parity test inline comment: "WLS df" -> "OLS df". Wording drift only; no behavior or test contract change. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
|
🔁 AI review rerun (requested by @igerber) Head SHA: Overall Assessment ✅ Looks good Executive Summary
Methodology
No other methodology findings. The cited paper’s Lemma 7 is the heterogeneity-testing result, and the REGISTRY still documents Python as a partial implementation that does not attempt the joint F-test. On the changed path itself, the CRAN reference implementation builds the heterogeneity CI critical value from Code Quality
Performance
Maintainability
Tech Debt
Security
Documentation/Tests
Verification note: static inspection only. Test execution was not possible in this sandbox because the Python environment is missing |
Summary
Two follow-ups closing the residual Low TODOs from PR #449:
Prepare v2.0.2 release: SE accuracy improvements #64 — Rank-deficient df threading.
_compute_heterogeneity_test's non-survey OLS path now usesdf = n_obs - rank(design)(matching R'slm()/df.residualpost-drop convention) AND usesn_obs <= rankas the small-sample short-circuit (instead of pre-dropn_obs <= n_params). For full-rank designsrank == n_paramsand behavior is bit-identical to the pre-PR path — all 4 forward-horizon R-parity scenarios remain atBETA_RTOL=1e-6/SE_RTOL=1e-5/INFERENCE_RTOL=1e-4. For near-rank-deficient designs thatsolve_olsretains rather than NaN-out (e.g., cohort-collinearity at high horizons), the post-drop rank is strictly lower thann_paramsso the post-PRdfis larger AND the post-PR short-circuit doesn't NaN-fill boundary cases R would fit. Locked by two new tests:test_heterogeneity_df_uses_post_drop_rank(5 obs, rank 4, fits with df=1; pre-PR would have short-circuited) andtest_heterogeneity_underidentified_nan_fills(4 obs, rank 4, df=0 → NaN-fill).Update CHANGELOG for v2.0.0 and v2.0.1 releases #62 — Negative-baseline path regression test. New
TestByPathNonBinary::test_negative_baseline_path_supportedexercises switchers withD_{g,1} = -1and asserts thatpath_effectscorrectly contains negative-baseline tuple keys ((-1, 0, 0, 0),(-1, 1, 1, 1)). The existingtest_negative_integer_D_supportedonly covered paths with negative values in non-baseline positions; this closes the test-coverage gap from PR Broaden dCDH by_path R-parser caveat to cover negative integers (re-audit follow-up to #401) #419 that we couldn't trigger from inside that PR (R'ssubstr(path, 1, 1)baseline-extraction bug regime). Python's tuple-key matching is correct under any baseline value; no R-parity fixture is added because R is the buggy side on this regime — the deviation is documented in the REGISTRY non-binary treatment Note.Methodology references (required if estimator / math changes)
ChaisemartinDHaultfoeuille.predict_hetnon-survey OLS inference (Web Appendix Section 1.5 / Lemma 7);by_pathpath-tuple encoding (negative-baseline support).DIDmultiplegtDYN 2.3.3did_multiplegt_main(t_stat <- qt(0.975, df.residual(model))site) for the df convention.substr(path, 1, 1)baseline-extraction bug for multi-character baselines (D ≥ 10 OR negative D). Python's tuple-key matching is correct on both regimes; no R-parity fixture is added because R is the buggy side. Deviation documented in the REGISTRY non-binary treatment Note (pre-PR; this PR only adds a Python-side regression test pinning the negative-baseline case).Validation
test_negative_baseline_path_supported— pins(-1, 0, 0, 0)and(-1, 1, 1, 1)tuple keys appear inpath_effectsfor switchers withD_{g,1} = -1.test_heterogeneity_df_uses_post_drop_rank— pins the post-drop rank short-circuit fires correctly on a constructedn_obs == n_params > rankdesign.test_heterogeneity_underidentified_nan_fills— pins then_obs <= rankguard still NaN-fills genuine degenerate cases.df = n_obs - n_paramstodf = n_obs - rank(design)(full-rank designs haverank == n_params).benchmarks/data/dcdh_dynr_golden_values.jsonremain pinned atBETA_RTOL=1e-6/SE_RTOL=1e-5/INFERENCE_RTOL=1e-4, verifying the post-drop rank change is bit-identical forrank == n_params.Security / privacy
🤖 Generated with Claude Code