Skip to content

dCDH heterogeneity wrap-up: post-drop rank df + negative-baseline path test#452

Merged
igerber merged 5 commits into
mainfrom
dcdh-heterogeneity-wrapup
May 16, 2026
Merged

dCDH heterogeneity wrap-up: post-drop rank df + negative-baseline path test#452
igerber merged 5 commits into
mainfrom
dcdh-heterogeneity-wrapup

Conversation

@igerber
Copy link
Copy Markdown
Owner

@igerber igerber commented May 16, 2026

Summary

Two follow-ups closing the residual Low TODOs from PR #449:

  • Prepare v2.0.2 release: SE accuracy improvements #64 — Rank-deficient df threading. _compute_heterogeneity_test's non-survey OLS path now uses df = n_obs - rank(design) (matching R's lm() / df.residual post-drop convention) AND uses n_obs <= rank as the small-sample short-circuit (instead of pre-drop n_obs <= n_params). For full-rank designs rank == n_params and behavior is bit-identical to the pre-PR path — all 4 forward-horizon R-parity scenarios remain at BETA_RTOL=1e-6 / SE_RTOL=1e-5 / INFERENCE_RTOL=1e-4. For near-rank-deficient designs that solve_ols retains rather than NaN-out (e.g., cohort-collinearity at high horizons), the post-drop rank is strictly lower than n_params so the post-PR df is larger AND the post-PR short-circuit doesn't NaN-fill boundary cases R would fit. Locked by two new tests: test_heterogeneity_df_uses_post_drop_rank (5 obs, rank 4, fits with df=1; pre-PR would have short-circuited) and test_heterogeneity_underidentified_nan_fills (4 obs, rank 4, df=0 → NaN-fill).

  • Update CHANGELOG for v2.0.0 and v2.0.1 releases #62 — Negative-baseline path regression test. New TestByPathNonBinary::test_negative_baseline_path_supported exercises switchers with D_{g,1} = -1 and asserts that path_effects correctly contains negative-baseline tuple keys ((-1, 0, 0, 0), (-1, 1, 1, 1)). The existing test_negative_integer_D_supported only covered paths with negative values in non-baseline positions; this closes the test-coverage gap from PR Broaden dCDH by_path R-parser caveat to cover negative integers (re-audit follow-up to #401) #419 that we couldn't trigger from inside that PR (R's substr(path, 1, 1) baseline-extraction bug regime). Python's tuple-key matching is correct under any baseline value; no R-parity fixture is added because R is the buggy side on this regime — the deviation is documented in the REGISTRY non-binary treatment Note.

Methodology references (required if estimator / math changes)

  • Method name(s): ChaisemartinDHaultfoeuille.predict_het non-survey OLS inference (Web Appendix Section 1.5 / Lemma 7); by_path path-tuple encoding (negative-baseline support).
  • Paper / source link(s): de Chaisemartin & D'Haultfœuille (NBER WP 29873) Web Appendix Section 1.5 (Lemma 7); R DIDmultiplegtDYN 2.3.3 did_multiplegt_main (t_stat <- qt(0.975, df.residual(model)) site) for the df convention.
  • Any intentional deviations from the source (and why):
    • R has a documented substr(path, 1, 1) baseline-extraction bug for multi-character baselines (D ≥ 10 OR negative D). Python's tuple-key matching is correct on both regimes; no R-parity fixture is added because R is the buggy side. Deviation documented in the REGISTRY non-binary treatment Note (pre-PR; this PR only adds a Python-side regression test pinning the negative-baseline case).

Validation

  • Tests added/updated:
    • New (test-only Update CHANGELOG for v2.0.0 and v2.0.1 releases #62): test_negative_baseline_path_supported — pins (-1, 0, 0, 0) and (-1, 1, 1, 1) tuple keys appear in path_effects for switchers with D_{g,1} = -1.
    • New (Prepare v2.0.2 release: SE accuracy improvements #64): test_heterogeneity_df_uses_post_drop_rank — pins the post-drop rank short-circuit fires correctly on a constructed n_obs == n_params > rank design.
    • New (Prepare v2.0.2 release: SE accuracy improvements #64): test_heterogeneity_underidentified_nan_fills — pins the n_obs <= rank guard still NaN-fills genuine degenerate cases.
    • Updated: 3 test-prose comments updated from df = n_obs - n_params to df = n_obs - rank(design) (full-rank designs have rank == n_params).
    • 317 dCDH tests pass (was 314 pre-PR; +3 new). All 4 forward-horizon R-parity scenarios bit-identical.
  • Backtest / simulation / notebook evidence: full-rank R-parity scenarios (20-23) in benchmarks/data/dcdh_dynr_golden_values.json remain pinned at BETA_RTOL=1e-6 / SE_RTOL=1e-5 / INFERENCE_RTOL=1e-4, verifying the post-drop rank change is bit-identical for rank == n_params.

Security / privacy

  • Confirm no secrets/PII in this PR: yes (canonical secret-pattern scan returned no hits).

🤖 Generated with Claude Code

igerber and others added 3 commits May 16, 2026 07:20
…h test

Two follow-ups closing the residual Low TODOs from PR #449:

1. #64 — Rank-deficient df threading. `_compute_heterogeneity_test`'s
   non-survey OLS path now computes `df = n_obs - rank(design)` via
   `_detect_rank_deficiency` (the same helper `solve_ols` calls
   internally), matching R's `df.residual = n - rank(design)`
   post-drop convention. For full-rank designs `rank == n_params` and
   behavior is bit-identical to the pre-PR `n_obs - n_params` path —
   all 4 forward-horizon parity tests still pass at the same
   `BETA_RTOL=1e-6` / `SE_RTOL=1e-5` / `INFERENCE_RTOL=1e-4`
   tolerances. For near-rank-deficient designs that `solve_ols`
   retains rather than NaN-out (e.g., cohort-collinearity at high
   horizons), the post-drop rank is strictly lower than `n_params`,
   so the post-PR `df` is strictly larger, matching R's `lm()`
   convention. Locked by
   `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo::test_heterogeneity_df_uses_post_drop_rank`
   which pins `safe_inference(... df=n_obs - rank)` against the
   stored `t_stat` / `p_value` / `conf_int` at `atol=1e-12`.

2. #62 — Negative-baseline path regression test. New
   `TestByPathNonBinary::test_negative_baseline_path_supported`
   exercises switchers with `D_{g,1} = -1` and asserts that
   `path_effects` correctly contains negative-baseline tuple keys
   (`(-1, 0, 0, 0)`, `(-1, 1, 1, 1)`). The existing
   `test_negative_integer_D_supported` only covered paths with
   negative values in non-baseline positions (e.g.,
   `(0, -1, -1, -1)`), which does not trigger R's documented
   `substr(path, 1, 1)` baseline-extraction bug. Python's tuple-key
   matching is correct under any baseline value; this test pins the
   contract. No R-parity fixture is added because R is the buggy
   side on this regime — the deviation is already documented in the
   REGISTRY non-binary treatment Note.

TODO.md: drops the two corresponding Low rows (#419 negative-baseline
test gap + follow-up rank-deficient df threading). REGISTRY heterogeneity
caveat updated to drop the "rank-deficient gap" prose and replace with
the positive "uses post-drop rank, matching R's lm() convention" claim.
CHANGELOG `[Unreleased]` extended with both items.

Test count: 314 -> 316.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three findings from the pre-push codex audit on the wrap-up bundle:

1. P1 (Methodology). The new `df = n_obs - rank(design)` claim only
   partially held: the small-sample short-circuit at
   `_compute_heterogeneity_test` still used pre-drop `n_obs <= n_params`,
   so the boundary case `n_obs == n_params > rank(design)` would NaN-
   fill in Python while R's `lm()` would alias-drop and fit with
   `df = n_obs - rank`. Replaced with `n_obs <= rank` short-circuit
   computed via `_detect_rank_deficiency` BEFORE the OLS call. Also
   added an X_het-aliased check after `solve_ols`: if the X_het
   column itself is the alias-dropped one (rare; pivoted QR keeps
   columns with larger norm), NaN-fill the inference tuple (matches
   R's lm() returning NA for aliased coefficients).

2. P2 (Tests). My initial test `test_heterogeneity_df_uses_post_drop_rank`
   built a panel where all switchers shared a single cohort, which
   collapsed the cohort dummies to zero columns and left a full-rank
   `[intercept, X_het]` regression. The test only re-verified the
   unchanged full-rank `df = n_obs - 2` path; it would have passed
   even if the post-drop-rank wiring never fired. Rewrote with a
   panel of 5 switchers across 4 unique cohorts where X_het = 1 on
   the F_g=3 reference cohort and 0 elsewhere, producing the exact
   collinearity `X_het = intercept - sum(non-reference cohort
   dummies)`. Design has 5 columns, rank 4. Pre-PR would short-
   circuit; post-PR fits with df=1. Test now also pins that
   `safe_inference(beta, se, df=n_obs - n_params)` would produce a
   DIFFERENT p_value, catching reversion to pre-drop n_params.
   Added a sibling `test_heterogeneity_underidentified_nan_fills`
   pinning that the `n_obs <= rank` guard still NaN-fills the
   degenerate case (4 obs, rank 4, df=0).

3. P3 (Docs). REGISTRY heterogeneity Note still said
   `df = n_obs - n_params` immediately before describing post-drop-
   rank behavior — internally inconsistent at the exact point users
   verify deviations. Rewrote to say `df = n_obs - rank(design)`
   consistently and mention the matching `n_obs <= rank` guard.

317 tests pass (+1 new). Forward-horizon parity scenarios (20-23)
remain bit-identical at the same tolerances.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two P3 informational findings from R2:

1. TODO.md Tier B backlog still listed the dCDH heterogeneity df-
   threading and by-path placebo predict_het items as open, but PR
   #449 closed both. Replaced the two stale bullets with a single
   bullet for the remaining survey + backward-horizon allocator
   derivation (the one Medium follow-up explicitly tracked in the
   wrap-up commit).

2. Three test-prose comments still said `df = n_obs - n_params`
   while the implementation and REGISTRY now use
   `df = n_obs - rank(design)`. Updated each comment to the
   post-drop rank wording; full-rank designs continue to have
   `rank == n_params` so the SE-derivation invariants under test
   are unchanged.

Comment-only drift; no behavior change. 317 tests still pass.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 findings.
  • The methodology-affecting change is ChaisemartinDHaultfoeuille.predict_het non-survey heterogeneity inference; the new df = n_obs - rank(design) logic and n_obs <= rank guard align with the project’s documented R-parity convention and the underlying solve_ols alias-drop behavior at diff_diff/chaisemartin_dhaultfoeuille.py:L5193-L5261 and diff_diff/linalg.py:L736-L790, diff_diff/linalg.py:L1001-L1064.
  • The new rank-deficiency tests cover both the retained-rank boundary case and the truly underidentified case at tests/test_chaisemartin_dhaultfoeuille.py:L11529-L11684.
  • The by_path negative-baseline change is test-only and consistent with the already documented R deviation for multi-character baselines at tests/test_chaisemartin_dhaultfoeuille.py:L8315-L8378.
  • One P3 documentation inconsistency remains: the Methodology Registry still says the negative-baseline regression is deferred even though this PR adds it and removes the TODO.

Methodology

  • No findings. Affected methods: ChaisemartinDHaultfoeuille.predict_het non-survey heterogeneity inference and by_path negative-baseline tuple handling. The estimator change at diff_diff/chaisemartin_dhaultfoeuille.py:L5193-L5261 is consistent with the Registry’s stated post-drop-rank convention and the library’s R-style rank-deficiency handling in solve_ols; the added coefs[1] NaN gate is also consistent with the existing NaN-on-aliased-coefficient contract.

Code Quality

  • No findings. The new OLS-path guard uses safe_inference(...) rather than inline inference math, and the NaN handling stays all-or-nothing for the heterogeneity surface.

Performance

  • No findings. The extra _detect_rank_deficiency call adds one QR pass per heterogeneity horizon, but that is proportionate to the small design sizes described in the surrounding comment and does not look material for this code path.

Maintainability

  • No findings. The implementation reuses the same rank detector as solve_ols, which reduces methodology drift versus hand-rolled df logic.

Tech Debt

  • No findings. The two TODO rows addressed by this PR were removed from TODO.md:L77-L80, and I did not find a new untracked correctness debt in the changed estimator path.

Security

  • No findings.

Documentation/Tests

  • P3: docs/methodology/REGISTRY.md:L643-L643 still says that a dedicated negative-baseline regression is deferred, but this PR adds test_negative_baseline_path_supported at tests/test_chaisemartin_dhaultfoeuille.py:L8315-L8378 and describes the gap as closed in CHANGELOG.md:L16-L16. Impact: future reviewers and users consulting the Registry will be told the (-1, 0, 0, 0) regime is still untested when it is now covered. Concrete fix: update the non-binary by_path note in REGISTRY.md to replace the “deferred to a follow-up” sentence with the new Python-only regression coverage statement.

Verification note: I could not execute the test suite in this environment because the local runtime does not have pytest or package dependencies such as numpy installed, so this assessment is from static review only.

…e deferral

Single CI codex R1 P3 finding: the non-binary `by_path` Note in
REGISTRY still said the negative-baseline regression (`(-1, 0, 0, 0)`)
was deferred, but this PR adds `test_negative_baseline_path_supported`
that covers exactly that regime. Updated the Note to list both
negative-D regression tests (non-baseline and baseline positions) as
shipped Python-only coverage.

No code/test change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: ca602b79deaa0df697885a41d1c599601afe37f7


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 findings.
  • Re-review item addressed: the prior stale REGISTRY/TODO language around negative-baseline by_path coverage is now resolved; the PR adds the Python-side regression and removes the deferred TODO (tests/test_chaisemartin_dhaultfoeuille.py:L8315-L8378, docs/methodology/REGISTRY.md:L643, TODO.md:L77-L80).
  • The methodology-affecting change in predict_het is consistent with the library’s existing QR alias-drop machinery: _compute_heterogeneity_test now uses n_obs <= rank for the small-sample guard and df = n_obs - rank(design) for non-survey inference, matching the same post-drop rank convention used by solve_ols (diff_diff/chaisemartin_dhaultfoeuille.py:L5193-L5261, diff_diff/linalg.py:L94-L155, diff_diff/linalg.py:L736-L790, diff_diff/linalg.py:L1001-L1064).
  • The added heterogeneity tests cover both the retained-rank boundary case and the genuinely underidentified case (tests/test_chaisemartin_dhaultfoeuille.py:L11529-L11684).
  • One P3 documentation wording issue remains: a newly updated note/comment calls this non-survey parity path “WLS df” even though the shipped estimator path is OLS.

Methodology

No findings. The affected methods are ChaisemartinDHaultfoeuille.predict_het non-survey heterogeneity inference and by_path negative-baseline tuple handling. The estimator change at diff_diff/chaisemartin_dhaultfoeuille.py:L5193-L5261 is aligned with the shared post-drop-rank logic in solve_ols, and the negative-baseline behavior remains a documented Python-vs-R deviation rather than an undocumented mismatch.

Code Quality

No findings. The changed inference path still uses safe_inference and preserves the all-or-nothing NaN contract when the heterogeneity coefficient is unidentified (diff_diff/chaisemartin_dhaultfoeuille.py:L5239-L5261).

Performance

No findings. The extra rank-detection pass is an added QR per horizon, but it is on the same small design that solve_ols already analyzes and does not look material for this code path.

Maintainability

No findings. Reusing _detect_rank_deficiency here reduces drift between the heterogeneity surface and the library’s central rank-deficiency handling.

Tech Debt

No findings. The PR properly retires the previously tracked low-priority follow-ups for negative-baseline path coverage and pre-drop heterogeneity df handling, while leaving the still-underived survey + backward-horizon heterogeneity limitation tracked in TODO.md.

Security

No findings.

Documentation/Tests

  • Severity P3docs/methodology/REGISTRY.md:L643 and tests/test_chaisemartin_dhaultfoeuille_parity.py:L1412-L1416 describe the non-survey predict_het parity path as using “WLS df,” but the implementation and surrounding methodology text describe this surface as OLS (diff_diff/chaisemartin_dhaultfoeuille.py:L5231-L5261, docs/methodology/REGISTRY.md:L637). Impact: low-risk wording drift that could mislead future reviewers about which regression surface is being parity-pinned. Concrete fix: change the new “WLS df” wording to “OLS df” in the REGISTRY paragraph and the parity-test comment.

Verification note: pytest is not installed in this environment, so this review is based on static inspection only.

Single CI codex R2 P3 finding: the non-survey heterogeneity parity
path uses OLS (only the survey path uses WLS), but two recently-
added comments described its df source as "WLS df". Updated:
- REGISTRY heterogeneity Note: "df from the WLS regression" -> "df
  from the OLS regression".
- Parity test inline comment: "WLS df" -> "OLS df".

Wording drift only; no behavior or test contract change.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown

🔁 AI review rerun (requested by @igerber)

Head SHA: 07791f359025946c3bd7cb5483d428af9e019bbd


Overall Assessment

✅ Looks good

Executive Summary

  • No unmitigated P0/P1 findings.
  • Re-review item addressed: the prior OLS/WLS wording drift is fixed in the REGISTRY and parity comment (docs/methodology/REGISTRY.md:L643, tests/test_chaisemartin_dhaultfoeuille_parity.py:L1412-L1419).
  • The predict_het change now uses post-drop rank for both the small-sample guard and inference df, aligned with the shared QR alias-drop solver and covered by new boundary-case regressions (diff_diff/chaisemartin_dhaultfoeuille.py:L5193-L5261, tests/test_chaisemartin_dhaultfoeuille.py:L11529-L11684).
  • The negative-baseline by_path follow-up is handled as a documented Python-vs-R deviation, with the previously missing regression test now added (tests/test_chaisemartin_dhaultfoeuille.py:L8315-L8378, docs/methodology/REGISTRY.md:L643).
  • The two low-priority TODO items this PR targeted are appropriately retired, while the separate survey + backward-horizon allocator gap remains tracked (TODO.md:L80, TODO.md:L187).

Methodology

  • Severity P3 — Documented R deviation: negative-baseline by_path paths. Impact: informational only; the R helper derives baseline_XX with substr(path, 1, 1), so paths like (-1, 0, 0, 0) are a known R bug regime, not a Python defect. This PR correctly keeps the behavior documented in REGISTRY and adds Python-side regression coverage instead of forcing parity. Concrete fix: none. docs/methodology/REGISTRY.md:L643, tests/test_chaisemartin_dhaultfoeuille.py:L8315-L8378. (rdrr.io)

No other methodology findings. The cited paper’s Lemma 7 is the heterogeneity-testing result, and the REGISTRY still documents Python as a partial implementation that does not attempt the joint F-test. On the changed path itself, the CRAN reference implementation builds the heterogeneity CI critical value from qt(..., df.residual(model)); R’s lm/summary.lm docs define residual d.f. relative to the fitted model’s rank/non-aliased coefficient count, so the PR’s switch from pre-drop n_obs - n_params to post-drop n_obs - rank(design) is consistent with the source material and not an undocumented deviation (diff_diff/chaisemartin_dhaultfoeuille.py:L5193-L5261, diff_diff/linalg.py:L94-L155, diff_diff/linalg.py:L736-L790, diff_diff/linalg.py:L930-L1064, docs/methodology/REGISTRY.md:L637-L643). (elearning.unito.it)

Code Quality

  • No findings. The new non-survey path still uses safe_inference, and the added coefs[1] alias check preserves all-or-nothing NaN inference when the heterogeneity coefficient itself is dropped (diff_diff/chaisemartin_dhaultfoeuille.py:L5239-L5261).

Performance

  • No findings. The extra _detect_rank_deficiency call adds one QR per horizon, but it runs on the same small per-horizon design matrix already fed to solve_ols, and nothing in the diff suggests a material regression (diff_diff/chaisemartin_dhaultfoeuille.py:L5193-L5211).

Maintainability

  • No findings. Reusing _detect_rank_deficiency keeps the guard and df calculation aligned with the shared linear-algebra behavior instead of introducing a second rank convention (diff_diff/chaisemartin_dhaultfoeuille.py:L5193-L5218, diff_diff/linalg.py:L94-L155).

Tech Debt

  • Severity P3 — Deferred nearby work remains properly tracked. Impact: informational only; this PR retires the two low-priority follow-ups it targeted, and the remaining survey + backward-horizon predict_het allocator derivation is still explicitly tracked in TODO.md, so there is no untracked correctness debt introduced here. Concrete fix: none in this PR. TODO.md:L80, TODO.md:L187.

Security

  • No findings. The diff is limited to estimator logic, docs, tests, changelog, and TODO tracking.

Documentation/Tests

  • No findings. The prior re-review wording issue is fixed, and the new tests cover both sides of the rank-deficiency boundary (n_obs == n_params > rank vs. n_obs <= rank) plus the previously missing negative-baseline by_path tuple case (tests/test_chaisemartin_dhaultfoeuille.py:L11529-L11684, tests/test_chaisemartin_dhaultfoeuille.py:L8315-L8378, tests/test_chaisemartin_dhaultfoeuille_parity.py:L1412-L1419).

Verification note: static inspection only. Test execution was not possible in this sandbox because the Python environment is missing numpy/pytest.

@igerber igerber added the ready-for-ci Triggers CI test workflows label May 16, 2026
@igerber igerber merged commit b926a10 into main May 16, 2026
33 of 34 checks passed
@igerber igerber deleted the dcdh-heterogeneity-wrapup branch May 16, 2026 13:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

ready-for-ci Triggers CI test workflows

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant