diff --git a/CHANGELOG.md b/CHANGELOG.md index 4e17a329..4ce1a45e 100644 --- a/CHANGELOG.md +++ b/CHANGELOG.md @@ -11,7 +11,9 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0 - **`ChaisemartinDHaultfoeuille.predict_het` × `placebo`: R-parity on both global and per-path surfaces.** R-verified — `did_multiplegt_dyn(predict_het, placebo)` emits heterogeneity OLS results on backward (placebo) horizons via R's `DIDmultiplegtDYN:::did_multiplegt_main` placebo block (`effect = matrix(-i, ...)` rbind site); the same block runs per-by_level under `did_multiplegt_dyn(by_path, predict_het, placebo)`, so both global `res$results$predict_het` and per-by_level `res$by_level_i$results$predict_het` slots emit backward rows. R's predict_het syntax with `placebo > 0` requires the `c(-1)` sentinel in the horizon vector to trigger "compute heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo) positions" — passing positive-only horizons errors with "specified numbers in predict_het that exceed the number of placebos". Python mirrors via `_compute_heterogeneity_test(..., placebo=L_max)` (set automatically from `self.placebo` at both global and per-path call sites in `fit()`) — the function iterates forward (1..L_max) and backward (-1..-L_max) horizons in a single loop with an explicit `out_idx < 0` eligibility guard for backward horizons whose `F_g` is too small (would otherwise silently misread `N_mat` via numpy negative indexing). `results.heterogeneity_effects` uses negative-int keys for backward horizons; `path_heterogeneity_effects` does the same per path. Placebo rows in `to_dataframe(level="by_path")` have non-NaN `het_*` columns when `placebo=True` and `heterogeneity=` are both set. **Survey gate (warn + skip):** `survey_design + placebo + heterogeneity` emits a `UserWarning` at fit-time and falls back to forward-horizon-only heterogeneity on both surfaces — the Binder TSL cell-period allocator's REGISTRY justification is tied to **post-period** attribution; backward-horizon attribution puts ψ_g mass on a pre-period cell, a separate library-extension claim that needs its own derivation. Forward-horizon `predict_het + survey_design` continues to work unchanged on both global and per-path surfaces. The function-level `_compute_heterogeneity_test` keeps a per-iteration `NotImplementedError` backstop for direct callers that bypass fit(). Pre-period allocator derivation deferred to a follow-up methodology PR (tracked in TODO.md). R parity confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneityWithPlacebo` (scenario 23, `multi_path_reversible_predict_het_with_placebo_global`, `placebo=2, effects=3, no by_path`) and `::TestDCDHDynRParityByPathHeterogeneityWithPlacebo` (scenario 22, same DGP plus `by_path=3`); pinned at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5` for `beta` / `se` / `t_stat` / `n_obs` and `INFERENCE_RTOL=1e-4` for `p_value` / `conf_int` across 3 paths × (3 forward + 2 placebo) = 15 horizons + 1 global × 5 horizons. Cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo` (placebo het column population, survey-gate warn+skip behavior, forward+survey anti-regression, `out_idx<0` eligibility guard, single-path telescope `path_heterogeneity_effects[(only_path,)] == heterogeneity_effects` bit-exactly, summary rendering, direct-call `NotImplementedError` backstop). Closes TODO #422. ### Changed -- **`ChaisemartinDHaultfoeuille.predict_het` inference: t-distribution df threading (closes TODO pilot-412).** `_compute_heterogeneity_test` now passes `df = n_obs - n_params` to `safe_inference` on the non-survey OLS path, matching R `did_multiplegt_dyn(predict_het=...)`'s t-distribution inference (`DIDmultiplegtDYN:::did_multiplegt_main` `t_stat <- qt(0.975, df.residual(model))` site). Pre-PR Python used `df=None` (normal Z critical), producing 0.1-2% rtol gaps on `p_value` and `conf_int` vs R. Parity tolerance tightened on the existing forward-horizon scenarios (`multi_path_reversible_predict_het`, `multi_path_reversible_by_path_predict_het`) from "unpinned" to `INFERENCE_RTOL=1e-4` on `p_value` and `conf_int`; `beta` / `se` / `t_stat` continue at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5`. **Rank-deficient caveat:** `n_params = design.shape[1]` is the pre-drop column count; under near-rank-deficient designs that `solve_ols` retains rather than NaN-out, the actual rank may be lower than `n_params` (R's `df.residual` uses post-drop rank). Fully rank-deficient designs are NaN-filled by the existing short-circuit; the gap only affects near-rank-deficient edge cases (tracked as a Low TODO follow-up). The Z-vs-t REGISTRY deviation note is replaced with an "R parity (post-2026-05-15 df threading)" positive-claim note. +- **`ChaisemartinDHaultfoeuille.predict_het` inference: t-distribution df threading (closes TODO pilot-412).** `_compute_heterogeneity_test` now passes `df = n_obs - rank(design)` to `safe_inference` on the non-survey OLS path, matching R `did_multiplegt_dyn(predict_het=...)`'s t-distribution inference (`DIDmultiplegtDYN:::did_multiplegt_main` `t_stat <- qt(0.975, df.residual(model))` site). Pre-PR Python used `df=None` (normal Z critical), producing 0.1-2% rtol gaps on `p_value` and `conf_int` vs R. Parity tolerance tightened on the existing forward-horizon scenarios (`multi_path_reversible_predict_het`, `multi_path_reversible_by_path_predict_het`) from "unpinned" to `INFERENCE_RTOL=1e-4` on `p_value` and `conf_int`; `beta` / `se` / `t_stat` continue at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5`. **Post-drop rank (post-2026-05-16 wrap-up):** the df denominator uses the post-drop numerical rank via `_detect_rank_deficiency`, which `solve_ols` already calls internally. For full-rank designs `rank == n_params` and behavior is bit-identical to the pre-PR `n_obs - n_params` path; for near-rank-deficient designs that `solve_ols` retains rather than NaN-out (e.g., cohort-collinearity at high horizons), the post-drop rank is strictly lower and the post-PR `df` is larger, matching R's `lm()` convention. The Z-vs-t REGISTRY deviation note is replaced with an "R parity (post-2026-05-15 df threading)" positive-claim note. + +- **`ChaisemartinDHaultfoeuille.by_path` negative-baseline path regression coverage.** New `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary::test_negative_baseline_path_supported` exercises switchers with `D_{g,1} = -1` and asserts that `path_effects` correctly contains negative-baseline tuple keys (e.g., `(-1, 0, 0, 0)`, `(-1, 1, 1, 1)`). This closes the test-coverage gap from PR #419: the existing `test_negative_integer_D_supported` only covered paths with negative values in non-baseline positions (e.g., `(0, -1, -1, -1)`), which does not trigger R's documented `substr(path, 1, 1)` baseline-extraction bug. Python's tuple-key matching is correct under any baseline value; this test pins the contract. No R-parity fixture is added because R is the buggy side on this regime — the deviation is documented in the REGISTRY non-binary treatment Note. ## [3.3.3] - 2026-05-15 diff --git a/TODO.md b/TODO.md index 42ddd3a5..13e39cdc 100644 --- a/TODO.md +++ b/TODO.md @@ -77,9 +77,7 @@ Deferred items from PR reviews that were not addressed before merge. | dCDH: Phase 1 per-period placebo DID_M^pl has NaN SE (no IF derivation for the per-period aggregation path). Multi-horizon placebos (L_max >= 1) have valid SE. | `chaisemartin_dhaultfoeuille.py` | #294 | Low | | dCDH: Survey cell-period allocator's post-period attribution is a library convention, not derived from the observation-level survey linearization. MC coverage is empirically close to nominal on the test DGP; a formal derivation (or a covariance-aware two-cell alternative) is deferred. Documented in REGISTRY.md survey IF expansion Note. | `chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | #408 | Medium | | dCDH: Parity test SE/CI assertions only cover pure-direction scenarios; mixed-direction SE comparison is structurally apples-to-oranges (cell-count vs obs-count weighting). | `test_chaisemartin_dhaultfoeuille_parity.py` | #294 | Low | -| dCDH by_path: negative-baseline path regression (e.g. `(-1, 0, 0, 0)`) is not yet exercised. The existing negative-D test (`test_negative_integer_D_supported`) only covers paths with negative values in non-baseline positions like `(0, -1, -1, -1)`, which does not trigger the R `substr(path, 1, 1)` bug regime (the bug needs a multi-character baseline). Add a switcher fixture with `D_{g,1} = -1` and assert the resulting path tuple key. | `tests/test_chaisemartin_dhaultfoeuille.py` | #419 | Low | | dCDH by_path: survey-aware backward-horizon (`placebo + predict_het + survey_design`) raises `NotImplementedError` because the Binder TSL cell-period allocator's REGISTRY justification is tied to post-period attribution. Backward horizons would put ψ_g mass on a pre-period cell. Deriving the pre-period cell allocator (or adding a covariance-aware two-cell alternative) is deferred to a follow-up methodology PR. | `diff_diff/chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md` | follow-up | Medium | -| dCDH heterogeneity: rank-deficient designs use `df = n_obs - n_params` (pre-drop column count) in the t-distribution inference. R's `lm(predict_het=...)` uses `df.residual = n - rank(design)` post-drop. Fully rank-deficient designs are NaN-filled by the rank-deficient short-circuit at `_compute_heterogeneity_test:5141-5150`, so the gap only affects near-rank-deficient designs where `solve_ols` retains the design. Thread actual rank from `solve_ols` to close the gap. | `diff_diff/chaisemartin_dhaultfoeuille.py` | follow-up | Low | | CallawaySantAnna: consider materializing NaN entries for non-estimable (g,t) cells in group_time_effects dict (currently omitted with consolidated warning); would require updating downstream consumers (event study, balance_e, aggregation) | `staggered.py` | #256 | Low | | ImputationDiD dense `(A0'A0).toarray()` scales O((U+T+K)^2), OOM risk on large panels | `imputation.py` | #141 | Medium (deferred — only triggers when sparse solver fails) | | Multi-absorb weighted demeaning needs iterative alternating projections for N > 1 absorbed FE with survey weights; unweighted multi-absorb also uses single-pass (pre-existing, exact only for balanced panels) | `estimators.py` | #218 | Medium | @@ -186,8 +184,7 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk, - WooldridgeDiD: optional `weights="cohort_share"` on `aggregate()` (`wooldridge_results.py`) - HAD survey-design API consolidation: drop deprecated `survey=`/`weights=` kwargs (`had.py`, `had_pretests.py`; gated on next minor bump) - Survey-design resolution / collapse helper extraction across `continuous_did.py`, `efficient_did.py`, `stacked_did.py` -- dCDH heterogeneity df threading: t-distribution at heterogeneity surface (or formalize the tolerance constant) (`chaisemartin_dhaultfoeuille.py`) -- dCDH by_path placebo `predict_het` parity vs R `did_multiplegt_dyn(..., by_path, predict_het)` (`chaisemartin_dhaultfoeuille.py`, `chaisemartin_dhaultfoeuille_results.py`) +- dCDH survey + backward-horizon `predict_het` allocator derivation: lift the warn-and-skip fallback at `_compute_heterogeneity_test` once the pre-period Binder TSL cell-period allocator is derived (currently the gate emits a `UserWarning` and falls back to forward-horizon-only heterogeneity under `survey_design + placebo + heterogeneity`) (`chaisemartin_dhaultfoeuille.py`, `docs/methodology/REGISTRY.md`) - Rust local-method solver path unification to `solve_wls_svd` + bootstrap-weight RNG parity audit (`rust/src/trop.rs`, `rust/src/bootstrap.rs`) - AI review CI workflow-contract pin test expansion (`tests/test_openai_review.py`) - In-site Sphinx render of `REPORTING.md` and `REGISTRY.md` (`docs/conf.py` + `:doc:` link migration) diff --git a/diff_diff/chaisemartin_dhaultfoeuille.py b/diff_diff/chaisemartin_dhaultfoeuille.py index a1485efd..cea7c91d 100644 --- a/diff_diff/chaisemartin_dhaultfoeuille.py +++ b/diff_diff/chaisemartin_dhaultfoeuille.py @@ -5190,9 +5190,32 @@ def _compute_heterogeneity_test( else: design = np.hstack([intercept, x_arr]) - # Guard: need more observations than parameters + # Compute post-drop numerical rank for both the small-sample + # guard and the t-distribution df. `_detect_rank_deficiency` is + # the same helper `solve_ols` calls internally; calling it + # explicitly here lets the guard use post-drop rank (matching + # R's `df.residual = n_obs - rank(design)` convention from + # `DIDmultiplegtDYN:::did_multiplegt_main` `qt(0.975, + # df.residual(model))`) instead of pre-drop column count, + # which would incorrectly short-circuit cases where solve_ols's + # R-style alias drop leaves `n_obs > rank > 0` (e.g., cohort- + # dummy collinearity at high horizons). For full-rank designs + # `rank == n_params` and behavior is bit-identical to the + # pre-PR `n_obs - n_params` path. The extra O(nk^2) cost is + # negligible at heterogeneity scale (k = intercept + X_het + + # cohort dummies, typically 5-30 columns; n = path-switcher + # count, typically 30-300 groups). + from diff_diff.linalg import _detect_rank_deficiency + n_params = design.shape[1] - if n_obs <= n_params: + rank, _dropped, _pivot = _detect_rank_deficiency(design) + + # Guard: need MORE observations than rank for a well-defined + # residual df. When `n_obs <= rank`, the regression has zero + # residual df (perfect fit or under-identified) and inference + # is undefined. This is the rank-based replacement for the + # pre-PR `n_obs <= n_params` short-circuit. + if n_obs <= rank: results[l_h] = { "beta": float("nan"), "se": float("nan"), @@ -5203,27 +5226,39 @@ def _compute_heterogeneity_test( } continue + df_ols = int(n_obs) - int(rank) + if not use_survey: - # Plain OLS path: standard inference per Lemma 7. df is the - # pre-drop column count (n_obs - n_params); matches R's - # did_multiplegt_dyn(predict_het=...) which uses the - # t-distribution with df = n - k from the OLS regression - # (DIDmultiplegtDYN:::did_multiplegt_main `t_stat <- qt(0.975, - # df.residual(model))` site). Under near-rank-deficient - # designs that solve_ols retains rather than NaN-out, n_params - # may exceed actual rank; see TODO row for the deferred - # rank-tracking follow-up. + # Plain OLS path: standard inference per Lemma 7. coefs, _residuals, vcov = solve_ols( design, dep_arr, return_vcov=True, rank_deficient_action=rank_deficient_action, ) + # Under rank-deficient designs solve_ols R-style-drops one + # or more columns and NaN-fills their coefs. If `X_het` + # (column index 1) is the dropped column, the heterogeneity + # coefficient is unidentified — NaN-fill the inference + # tuple (matches R's lm() returning NA for aliased + # coefficients). Other columns being dropped is fine: the + # X_het coefficient and its (1,1) vcov entry remain + # identified, df_ols already reflects the post-drop rank. + if not np.isfinite(coefs[1]): + results[l_h] = { + "beta": float("nan"), + "se": float("nan"), + "t_stat": float("nan"), + "p_value": float("nan"), + "conf_int": (float("nan"), float("nan")), + "n_obs": n_obs, + } + continue beta_het = float(coefs[1]) se_het = float("nan") if vcov is not None and np.isfinite(vcov[1, 1]) and vcov[1, 1] > 0: se_het = float(np.sqrt(vcov[1, 1])) - t_stat, p_val, ci = safe_inference(beta_het, se_het, alpha=alpha, df=n_obs - n_params) + t_stat, p_val, ci = safe_inference(beta_het, se_het, alpha=alpha, df=df_ols) else: # Survey-aware path: WLS with per-group weights + TSL IF variance. W_elig = W_g_all[eligible] diff --git a/docs/methodology/REGISTRY.md b/docs/methodology/REGISTRY.md index 099b8059..2f04566d 100644 --- a/docs/methodology/REGISTRY.md +++ b/docs/methodology/REGISTRY.md @@ -640,7 +640,7 @@ The guard is fired by `_survey_se_from_group_if` (analytical and replicate) and - **Note (Phase 3 Design-2 switch-in/switch-out):** Convenience wrapper for Web Appendix Section 1.6 (Assumption 16). Identifies groups with exactly 2 treatment changes (join then leave), reports switch-in and switch-out mean effects. This is a descriptive summary, not a full re-estimation with specialized control pools as described in the paper. **Always uses raw (unadjusted) outcomes** regardless of active `controls`, `trends_linear`, or `trends_nonparam` options - those adjustments apply to the main estimator surface but not to the Design-2 descriptive block. For full adjusted Design-2 estimation with proper control pools, the paper recommends "running the command on a restricted subsample and using `trends_nonparam` for the entry-timing grouping." Activated via `design2=True` in `fit()`, requires `drop_larger_lower=False` to retain 2-switch groups. -- **Note (Phase 3 `by_path` per-path event-study disaggregation):** Per-path disaggregation of the multi-horizon event study, mirroring R `did_multiplegt_dyn(..., by_path=k)`. Activated via `ChaisemartinDHaultfoeuille(by_path=k, drop_larger_lower=False)` where `k` is a positive integer (top-k most common observed paths by switcher-group frequency). **Window convention:** the path tuple for a switcher group `g` is `(D_{g, F_g-1}, D_{g, F_g}, ..., D_{g, F_g-1+L_max})` — length `L_max + 1`, matching R's window `[F_{g-1}, F_{g-1+l}]`. **Ranking:** paths are ranked by descending frequency; ties are broken lexicographically on the path tuple for deterministic ordering, so every selected path has a unique `frequency_rank`. If `by_path` exceeds the number of observed paths, all observed paths are returned with a `UserWarning`. **Per-path SE convention (joiners/leavers precedent):** the per-path influence function follows the joiners-only / leavers-only IF construction at `chaisemartin_dhaultfoeuille.py:5495-5504`: the switcher-side contribution `+S_g * (Y_{g,out} - Y_{g,ref})` is zeroed for groups whose observed trajectory is NOT the selected path; control contributions and the full cohort structure `(D_{g,1}, F_g, S_g)` are unchanged. After applying the singleton-baseline eligible mask and cohort-recentering with the original cohort IDs, the plug-in SE uses the path-specific divisor `N_l_path` (count of path switchers eligible at horizon `l`) — same pattern as `joiners_se` using `joiner_total`. This gives the **within-path mean** estimand `DID_{path,l}` as the within-path average of `DID_{g,l}`. **Degenerate-cohort behavior per path:** when a path's centered IF at some horizon is identically zero (every variance-eligible path switcher forms its own `(D_{g,1}, F_g, S_g)` cohort, or the path has a single contributing group), SE / t_stat / p_value / conf_int are NaN-consistent and a `UserWarning` is emitted scoped to `(path, horizon)`. This mirrors the overall-path degenerate-cohort surface and is common for rare paths with few contributing groups. **Empty-state contract:** `results.path_effects` distinguishes "not requested" (`None`) from "requested but empty" (`{}` — all switchers have windows outside the panel or unobserved cells). The empty-dict case emits a `UserWarning` at fit-time and renders as an explicit "no observed paths" notice in `summary()`; `to_dataframe(level="by_path")` returns an empty DataFrame with the canonical column set (mirrors the `linear_trends` pattern when `trends_linear=True` but no horizons survive). **Requirements:** `drop_larger_lower=False` (multi-switch groups are the object of interest; default `True` filters them out) and `L_max >= 1` (path window depends on the horizon). **Scope:** combinations with `design2` and `honest_did` remain gated behind explicit `NotImplementedError` (deferred to follow-up wave PRs); `heterogeneity` is supported per-path — see the **Per-path heterogeneity testing** paragraph below. `n_bootstrap > 0` is now supported — see the **Bootstrap SE** paragraph below. `survey_design` is supported under analytical Binder TSL and replicate-weight bootstrap — see the **Per-path survey-design SE** paragraph below; multiplier bootstrap (`n_bootstrap > 0`) under `survey_design + by_path/paths_of_interest` remains gated. `placebo=True` is now supported per-path — see the **Per-path placebos** paragraph below. **TWFE diagnostic** remains a sample-level summary (not computed per path) in this release. Results are exposed on `results.path_effects` as `Dict[Tuple[int, ...], Dict[str, Any]]` with nested `horizons` dicts per horizon `l`, and on `results.to_dataframe(level="by_path")` as a long-format table with columns `[path, frequency_rank, n_groups, horizon, effect, se, t_stat, p_value, conf_int_lower, conf_int_upper, n_obs, cband_lower, cband_upper, cumulated_effect, cumulated_se, het_beta, het_se, het_t_stat, het_p_value, het_conf_int_lower, het_conf_int_upper]` (the `cband_*` columns are added by the joint sup-t Note below, populated for positive-horizon rows of paths with a finite sup-t crit and NaN otherwise; the `cumulated_*` columns are added by the per-path linear-trends Note below, populated for positive-horizon rows when `trends_linear=True` is set and NaN otherwise). Gated tests live in `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathGates` / `::TestByPathBehavior` / `::TestByPathEdgeCases`. **R-parity** against `DIDmultiplegtDYN 2.3.3` is confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPath` via two scenarios: `mixed_single_switch_by_path` (2 paths, `by_path=2`) and `multi_path_reversible_by_path` (4 paths, `by_path=3`; path-assignment deterministic on `F_g` so each `(D_{g,1}, F_g, S_g)` cohort contains switchers from a single path). Per-path point estimates and per-path switcher counts match R exactly; per-path SE matches within the Phase 2 multi-horizon SE envelope (observed rtol ≤ 10.2% on the 2-path mixed scenario, ≤ 4.2% on the 4-path cohort-clean scenario). **Deviation from R (cross-path cohort-sharing SE):** our analytical SE is the marginal variance of the path-contribution estimator cohort-centered on the *full-panel* cohort structure (joiners/leavers precedent — non-path switchers contribute to cohort means via their zeroed switcher row). R's `did_multiplegt_dyn(..., by_path=k)` re-runs the estimator per path, so cohort means are computed over the path's own switchers only. When a cohort `(D_{g,1}, F_g, S_g)` spans multiple observed paths, Python and R SE diverge materially (our empirical probes with random post-window toggling saw rtol > 100%); when every cohort is single-path (scenario 13 by design, scenario 14 by construction), the two approaches coincide up to the documented Phase 2 envelope. Practitioners with cohort structures that mix paths should interpret the per-path SE as a within-full-panel marginal variance, not a per-path conditional variance. **Bootstrap SE:** when `n_bootstrap > 0` is set, the top-k paths are enumerated once on the observed data (R-faithful: matches `did_multiplegt_dyn(..., by_path=k, bootstrap=B)`'s path-stability convention — verified empirically against DIDmultiplegtDYN 2.3.3) and the multiplier bootstrap (`bootstrap_weights ∈ {"rademacher", "mammen", "webb"}`) runs per `(path, horizon)` target via the shared `_bootstrap_one_target` / `compute_effect_bootstrap_stats` helpers. Point estimates are unchanged from the analytical path. Bootstrap SE replaces the analytical SE in `path_effects[path]["horizons"][l]["se"]`, and `p_value` / `conf_int` are taken as the **bootstrap percentile** statistics, matching the Round-10 library convention for overall / joiners / leavers / multi-horizon bootstrap (see the `Note (bootstrap inference surface)` elsewhere in this file and the pinned regression `test_bootstrap_p_value_and_ci_propagated_to_top_level`). `t_stat` is SE-derived via `safe_inference` per the anti-pattern rule. Interpretation: inference is *conditional on the observed path set*. **SE inherits the analytical cross-path cohort-sharing deviation:** the bootstrap input is the exact same full-panel cohort-centered path IF that the analytical path computes (`_collect_path_bootstrap_inputs` reuses the same enumeration / cohort IDs / IF construction), so the bootstrap SE is a Monte Carlo analog of the analytical SE — it inherits the same cross-path cohort-sharing deviation from R's per-path re-run convention documented above. On single-path-cohort panels (scenarios 13 and 14 of the R-parity fixture, and any DGP where `(D_{g,1}, F_g, S_g)` cohorts never span multiple observed paths), bootstrap SE tracks analytical SE up to Monte Carlo noise and both coincide with R up to the Phase 2 envelope. On cross-path cohort panels, bootstrap SE inherits the >100% rtol divergence from R that analytical already has. **Deviation from R (CI method):** R's per-path CI is normal-theory around the bootstrap SE (half-width ≈ `1.96·se`); ours is the bootstrap percentile CI, intentionally diverging from R to keep the dCDH inference surface internally consistent across all bootstrap targets. Practitioners who want *unconditional* inference capturing path-selection uncertainty need a pairs-bootstrap (deferred — no R precedent). Positive regressions live in `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathBootstrap` (gated `@pytest.mark.slow`): point-estimate invariance, finite positive SE on non-degenerate panels, SE-within-30%-rtol of analytical on cohort-clean fixtures, degenerate-cohort NaN propagation, Rademacher/Mammen/Webb parity, seed reproducibility, and percentile-vs-normal-theory CI pinning. **Per-path placebos:** when `placebo=True` (and `L_max >= 1`) is combined with `by_path=k`, per-path backward-horizon placebos `DID^{pl}_{path, l}` for `l = 1..L_max` are computed using the same joiners/leavers IF precedent applied to `_compute_per_group_if_placebo_horizon` (with the new `switcher_subset_mask` parameter): switcher contributions are zeroed for groups not in the path; the control pool and the variance-eligible cohort structure `(D_{g,1}, F_g, S_g)` are unchanged. Plug-in SE uses the path-specific divisor `N^{pl}_{l, path}` (count of path switchers eligible at backward lag `l`). Surfaced on `results.path_placebo_event_study[path][-l]` with the same `{effect, se, t_stat, p_value, conf_int, n_obs}` shape as `placebo_event_study` (negative-int inner keys parallel the existing per-path event-study positive-int keys, so a unified forward+backward view is well-formed). **Inherits the cross-path cohort-sharing SE deviation from R** documented above for `path_effects` (same convention applied backward); tracks R within numerical tolerance on single-path-cohort panels and diverges on cohort-mixed panels. Multiplier bootstrap (when `n_bootstrap > 0`) runs per `(path, lag)` target via the same `_bootstrap_one_target` dispatch used for the per-path event-study, with the canonical NaN-on-invalid contract. The bootstrap SE is a Monte Carlo analog of the analytical placebo SE — same per-path centered IF input — and inherits the same deviation. Surfaced through `summary()` (negative-keyed rows rendered alongside positive-keyed event-study rows under each path block) and `to_dataframe(level="by_path")` (`horizon` column takes negative ints for placebo rows). **Empty-state contract:** `results.path_placebo_event_study` mirrors `path_effects` — `None` when `by_path + placebo` was not requested, `{}` when requested but no observed path has a complete window within the panel (same regime that returns `{}` for `path_effects`, with the same fit-time `UserWarning`). R-parity is confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathPlacebo` on the `multi_path_reversible_by_path_placebo` scenario; positive analytical + bootstrap invariants live in `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPlacebo` (with the gated `::TestByPathPlacebo::TestBootstrap` subclass). **Per-path covariate residualization (DID^X):** when `controls=[...]` is set with `by_path=k`, the per-baseline OLS residualization (Web Appendix Section 1.2) runs once on the first-differenced outcome BEFORE path enumeration. All four downstream surfaces — analytical per-path SE, bootstrap SE, per-path placebos, and per-path joint sup-t bands — consume the residualized `Y_mat` automatically (Frisch-Waugh-Lovell). Per-period effects remain unadjusted, consistent with the existing `controls` + per-period DID contract (per-period DID does not support residualization). Failed-stratum baselines (rank-deficient X) zero out `N_mat` for affected groups, which the path enumeration treats as ineligible per its existing convention. **Deviation from R on multi-baseline switcher panels (point estimates):** R `did_multiplegt_dyn(..., by_path, controls)` re-runs the per-baseline residualization on each path's restricted subsample (`R/R/did_multiplegt_dyn.R` lines 401-405: rows of the path's switchers OR rows where `yet_to_switch=1 AND baseline matches the path's baseline`). The first-stage residualization sample R uses for path B equals: pre-switch rows of all switchers with matching baseline + all rows of never-switchers with matching baseline — bit-identical to our global first-stage sample under single-baseline switcher panels (every switcher shares the same `D_{g,1}`, regardless of how `F_g` or path identity varies across switchers). Per-path point estimates therefore coincide with R on those panels up to the existing **DID^X first-stage cell-weighting deviation** documented above in `Note (Phase 3 DID^X covariate adjustment)` (Python's first-stage OLS uses equal cell weights — one observation per `(g, t)` cell, consistent with the library's cell-aggregated input convention; R weights by `N_gt`). On panels with one observation per `(g, t)` cell (the common case after the cell-aggregation step in `fit()`), Python matches R bit-exactly: the `multi_path_reversible_by_path_controls` parity fixture has 4 paths with switcher `F_g` values spanning [0..6] under `D_{g,1}=0` and Python matches R to rtol ~1e-11. On multi-baseline switcher panels (some switchers have `D_{g,1}=0`, others have `D_{g,1}=1`) R's per-path subset drops switchers whose baseline differs from the path's baseline, so the per-baseline regression coefficients diverge per path under R and point estimates can diverge between Python and R — a `UserWarning` is emitted at fit-time when this configuration is detected so practitioners do not silently consume estimates that disagree with R. The warning filters to switcher groups only; never-switchers (never-treated + always-treated controls) at multiple baseline values do NOT trigger the warning because they don't affect R's per-path subset construction. **Inherits the cross-path cohort-sharing SE deviation from R** documented above for `path_effects` — bootstrap SE, placebo SE, and sup-t crit are Monte Carlo / joint-distribution analogs of the same residualized analytical IF and carry the same deviation. R-parity is confirmed against `did_multiplegt_dyn(..., by_path=3, controls="X1")` at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathControls` on the `multi_path_reversible_by_path_controls` scenario (single-baseline DGP, exact point-estimate match measured rtol ~1e-11); cross-surface inheritance and the multi-baseline warning are regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathControls` (analytical + bootstrap + placebo + sup-t + `to_dataframe(level="by_path")` cband columns + multi-baseline `UserWarning`). **Per-path linear-trends DID^{fd}:** when `trends_linear=True` is set with `by_path=k`, the first-differencing transform at `chaisemartin_dhaultfoeuille.py:1599-1630` runs once globally BEFORE path enumeration (replaces `Y_mat` with `Z_mat = Y_t - Y_{t-1}` and shrinks the time axis by one), so per-path raw second-differences `DID^{fd}_{path, l}` surface on `path_effects[path]["horizons"][l]` automatically. Per-path cumulated level effects `delta_{path, l} = sum_{l'=1..l} DID^{fd}_{path, l'}` (the quantity R returns under `did_multiplegt_dyn(..., by_path, trends_lin)` per the existing parity test pivot at `tests/test_chaisemartin_dhaultfoeuille_parity.py:403-409`) surface on the new `results.path_cumulated_event_study[path][l]` field — a per-group running sum of `DID^{fd}_{g, l'}` averaged over the path's switchers eligible at horizon `l`, mirroring the global `linear_trends_effects` cumulation logic at `chaisemartin_dhaultfoeuille.py:3340-3398`. SE on the cumulated layer is the conservative upper bound (sum of per-horizon component SEs from `path_effects[path]["horizons"][l]["se"]`, NaN-consistent: any non-finite component yields a NaN cumulated SE). **Post-bootstrap recomputation:** the cumulated layer is built AFTER the bootstrap propagation block at `chaisemartin_dhaultfoeuille.py:3034-3081` so it reads the FINAL post-bootstrap per-horizon SEs (mirrors the global `linear_trends_effects` placement). When `n_bootstrap > 0`, cumulated SE / t / p / CI are derived from bootstrap per-horizon SEs; when bootstrap produces non-finite SE (e.g., `n_bootstrap=1` degenerate distribution), the cumulated layer's full inference tuple is NaN per the library-wide NaN-on-invalid bootstrap contract. `to_dataframe(level="by_path")` exposes `cumulated_effect` and `cumulated_se` columns (always present, NaN-when-None — mirrors the `cband_*` always-present convention from PR #374). `summary()` renders a `Cumulated Level Effects (DID^{fd}, trends_linear)` sub-section under each per-path block. **Path enumeration uses the post-first-differenced `N_mat_fd`**: switchers with `F_g==2` fail the window-eligibility check and are dropped from path enumeration entirely (the existing global `F_g >= 3` warning at line 1620 surfaces the issue), so a path whose switchers all have `F_g < 3` is silently absent from `path_effects` rather than present-with-NaN. **F_g=3 boundary-case divergence (`by_path + trends_linear`):** `F_g=3` switchers have exactly 2 pre-switch periods, which after first-differencing and the `time==1` filter leaves only 1 valid pre-window Z value. R's per-path full-pipeline call handles this single-pre-period regime differently from Python's global-then-disaggregate architecture, producing 30%+ relative divergence on point estimates for paths whose switchers include `F_g=3` (empirically observed on the parity fixture's earlier `F_g=3` variant). A separate `UserWarning` fires at fit-time when the panel includes any `F_g=3` switcher AND `by_path + trends_linear` is set, mirroring the `F_g < 3` exclusion warning. The shipped parity fixture (`single_baseline_multi_path_by_path_trends_lin`) restricts to `F_g >= 4` exclusively to avoid this regime; per-path R parity is asserted only there. **Placebo under `trends_linear` returns RAW per-horizon values** (no per-path placebo cumulation surface) — verified empirically against the existing `joiners_only_trends_lin` parity fixture: R's per-path Placebo_l matches Python's `path_placebo_event_study[path][-l]` (raw) bit-exactly under non-`by_path` trends_lin. **Deviation from R on multi-baseline switcher panels (point estimates):** R `did_multiplegt_dyn(..., by_path, trends_lin)` re-runs the full pipeline (including first-differencing) on each path's restricted subsample, so it operates on different switcher samples per path when switchers have different baseline values `D_{g,1}`. Python first-differences once globally before path enumeration. On single-baseline switcher panels the two architectures coincide; on multi-baseline switcher panels per-path point estimates can diverge — a `UserWarning` is emitted at fit-time when this configuration is detected so practitioners do not silently consume estimates that disagree with R (mirroring the analogous `by_path + controls` warning). Per-path R parity is confirmed against `did_multiplegt_dyn(..., by_path=3, trends_lin=TRUE, placebo=1)` at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathTrendsLinear` on the `single_baseline_multi_path_by_path_trends_lin` scenario (single-baseline + cohort-single-path + `F_g >= 4` DGP designed to eliminate the multi-baseline divergence, the cross-path cohort-sharing deviation, and the F_g=3 boundary case under R's per-path full-pipeline call). Per-path cumulated point estimates match R bit-exactly (rtol ~1e-9) on event horizons under those conditions; cumulated SE_RTOL is widened to `0.20` (vs `0.12` used for non-cumulated by_path parity) because the conservative upper-bound SE compounds the cross-path cohort-sharing deviation under summation. **Placebo parity is intentionally skipped for `trends_linear`**: R's per-path placebo computation re-runs on the path-restricted subsample with different control eligibility than Python's global-then-disaggregate architecture surfaces, producing a sign-and-magnitude divergence on paths whose switchers have minimal pre-window depth (e.g., `F_g=4` switchers). Placebo under `by_path + trends_linear` is exercised via internal regression in `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathTrendsLinear` (finite values, bootstrap inheritance) but not pinned to R bit-by-bit. Cross-surface invariants (analytical + bootstrap + placebo + sup-t + `path_cumulated_event_study` + `to_dataframe` columns + `summary()` rendering) are regression-tested at `TestByPathTrendsLinear`. **Per-path state-set trends:** when `trends_nonparam="state_col"` is set with `by_path=k`, the set membership column is validated and stored once globally as `set_ids_arr` (time-invariance, NaN rejection, partition-coarseness checks unchanged from the non-by_path path). The `set_ids` parameter is threaded through the four per-path IF helpers (`_compute_path_effects`, `_compute_path_placebos`, `_collect_path_bootstrap_inputs`, `_collect_path_placebo_bootstrap_inputs`) so per-path analytical SE, bootstrap, placebos, and sup-t bands all consume the set-restricted control pool automatically. R does NOT first-difference and does NOT cumulate under `trends_nonparam` (unlike `trends_lin`); per-horizon `Effect_l` is a normal DID with set-restricted controls. Per-path R parity is confirmed against `did_multiplegt_dyn(..., by_path=3, trends_nonparam="state", placebo=1)` at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathTrendsNonparam` on the `multi_path_reversible_by_path_trends_nonparam` scenario; per-path point estimates AND placebos match R bit-exactly (rtol ~1e-9), per-path SE matches within the Phase 2 envelope (~13% rtol observed). Cross-surface invariants are regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathTrendsNonparam`. **Per-path non-binary treatment:** integer-coded discrete treatment (D in Z, e.g. ordinal {0, 1, 2}) is supported under `by_path=k` and `paths_of_interest`. Path tuples become integer-state tuples (`(0, 2, 2, 2)`) keyed bit-for-bit against R's comma-separated path strings (`"0,2,2,2"`) for D in {0..9}. Continuous D (e.g. `1.5`) raises `ValueError` at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). **Deviation from R for multi-character baseline states (D >= 10 or negative D):** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)` (extracted 2026-05-03 via `Rscript -e 'cat(paste(deparse(DIDmultiplegtDYN:::did_multiplegt_by_path), collapse="\n"))'`), capturing only the first character of the comma-separated path string. For multi-character baselines this drops the rest of the value: for `path = "12,12,..."` it captures `"1"` instead of `"12"`; for `path = "-1,-1,..."` it captures `"-"` instead of `"-1"`. R's per-path control-pool subset is mis-allocated in both regimes. Python's tuple-key matching is correct in both — the per-path point estimates we compute are correct, R's per-path subset for the same path is buggy. The shipped R-parity scenarios stay in `D in {0, 1, 2}` to avoid the R bug; R-parity is asserted on that set at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary` via the `multi_path_reversible_by_path_non_binary` scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4). The string-encoding compatibility extends to all single-digit nonnegative D (`{0..9}`) since each value renders as a single character, but no R-parity scenario currently exercises D outside `{0, 1, 2}` — per-path point estimates match R bit-exactly (rtol ~1e-9 events; rtol+atol envelope for placebo near-zero values), SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). Negative-integer treatment-state support (paths containing negative D values in non-baseline positions, e.g. `(0, -1, -1, -1)`) is regression-tested in Python only at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary::test_negative_integer_D_supported`; a dedicated regression for a negative-baseline path (e.g. `(-1, 0, 0, 0)`, the exact regime that would trigger R's `substr` bug) is deferred to a follow-up. Cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`. **Per-path survey-design SE** (analytical Binder TSL + replicate-weight bootstrap): under `by_path` / `paths_of_interest` + `survey_design`, the per-path per-horizon SE routes through `_survey_se_from_group_if` using the cell-period allocator. The per-path influence function `U_pp_l_path` is the per-period IF with non-path switcher-side contributions skipped — control contributions remain unchanged, matching the joiners/leavers IF convention from the **Per-path SE convention** paragraph above (the `switcher_subset_mask` zeroes the switcher row of the per-group IF, which trivially zeroes the corresponding row of the per-cell IF, preserving the row-sum identity `U_pp.sum(axis=1) == U`). The IF is cohort-recentered via `_cohort_recenter_per_period` and expanded to observations as `psi_i = U_pp[g_i, t_i] · (w_i / W_{g_i, t_i})`. Replicate-weight designs unconditionally route through the cell allocator (Class A contract, PR #323). Multiplier bootstrap (`n_bootstrap > 0`) under `survey_design + by_path/paths_of_interest` raises `NotImplementedError` at fit-time — the survey-aware perturbation pivot for path-restricted IFs is methodologically underived and deferred to a future wave; the global non-by_path TSL multiplier bootstrap is unaffected and continues to ship. **Path-enumeration ranking is unweighted** under `survey_design`: top-k selection uses group cardinality (`path_to_count[p]` = number of groups), not population-weight mass — survey weights do not affect which paths are selected as "top-k". A weighted-ranking variant (sum of survey weights per path) is deferred until concrete demand. **`df_survey` propagation:** under replicate weights, every per-path per-horizon fit contributes an `n_valid` count to the shared `_replicate_n_valid_list` accumulator and the final `_effective_df_survey = min(...) - 1` reflects all per-path replicate fits. A post-call `_refresh_path_inference` helper re-runs `safe_inference` on every populated entry so `multi_horizon_inference`, `placebo_horizon_inference`, `path_effects`, and `path_placebos` all use the same final df after per-path appends complete. **Lonely-PSU policy is sample-wide, not per-path** — the `lonely_psu` policy (`remove`/`certainty`/`adjust`) operates on the full design-level PSU/strata structure, not on path-restricted subsamples. **Telescope invariant:** on a single-path panel where every switcher follows the same trajectory and `eligible_groups` matches between by_path and non-by_path, per-path SE equals the global non-by_path survey SE bit-exactly — pinned at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignTelescope::test_telescope_analytical_TSL`. **Deviation from R:** none — R `did_multiplegt_dyn` does not support survey weighting, so this is a Python-only methodology extension (no R parity available; no R parity test class). Regression test anchor: `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical` covering analytical SE, replicate-weight SE, the `n_bootstrap` gate, the global anti-regression, per-path placebos, `trends_linear` composition, and unobserved-path warnings under survey. **Per-path heterogeneity testing** (analytical OLS / WLS + survey-aware Binder TSL + replicate-weight): under `by_path` / `paths_of_interest` + `heterogeneity=""`, the per-path per-horizon coefficient `beta_X^path_l` is computed by re-running `_compute_heterogeneity_test` on the path-restricted switcher subsample. The path filter (`path_groups: Optional[Set[int]]`) restricts eligibility to switchers ON path `p` inside the inner regression; the variance machinery (HC1-robust OLS vcov for non-survey via `solve_ols(..., return_vcov=True)` (`vcov_type="hc1"` default), WLS-on-pweights with cell-period IF allocator for analytical Binder TSL, group-level allocator for Rao-Wu replicate) is unchanged from the global heterogeneity path. **Cohort dummies absorb baseline by construction** — the cohort key `(D_{g,1}, F_g, S_g)` includes baseline, so multi-baseline switcher panels do not produce R-divergence (unlike `controls` / `trends_linear`); no parallel `UserWarning` is emitted. **R parity:** matches `did_multiplegt_dyn(..., by_path, predict_het)` per-by_level on the `multi_path_reversible_by_path_predict_het` scenario for `beta`, `se`, `t_stat`, and `n_obs` (`BETA_RTOL = 1e-6` on `beta`, `SE_RTOL = 1e-5` on `se` / `t_stat`; the SE tolerance is one decade looser than `BETA_RTOL` to absorb the small OLS denominator-and-cohort-recentering numerical drift observed on this fixture; `n_obs` matches exactly). Inherits the same tolerances as the new global `multi_path_reversible_predict_het` scenario (`TestDCDHDynRParityHeterogeneity`) since the per-path R call is `did_multiplegt_main(..., predict_het=...)` per path-restricted subsample with no additional numerical loss. **R parity (heterogeneity inference, post-2026-05-15 df threading):** Python now passes `df = n_obs - n_params` to `safe_inference` on the non-survey OLS path at `chaisemartin_dhaultfoeuille.py`'s `_compute_heterogeneity_test`, matching R's t-distribution with df from the WLS regression (`DIDmultiplegtDYN:::did_multiplegt_main` `t_stat <- qt(0.975, df.residual(model))` site). Parity tolerance is `INFERENCE_RTOL = 1e-4` on `p_value` and `conf_int`; `beta` / `se` / `t_stat` continue to use `BETA_RTOL = 1e-6` / `SE_RTOL = 1e-5`. The `t_stat = beta / se` field is distribution-invariant. **Rank-deficient caveat:** `n_params = design.shape[1]` is the pre-drop column count; under near-rank-deficient designs that `solve_ols` retains rather than NaN-out, the actual rank may be lower than `n_params` (R's `df.residual` uses post-drop rank). Fully rank-deficient designs are NaN-filled by the rank-deficient short-circuit at `_compute_heterogeneity_test:5141-5150`, so the gap only affects edge cases. Tracked as a Low TODO follow-up (rank-from-`solve_ols` threading). R's `dont_drop_larger_lower=TRUE` is set in both fixture scenarios to match the Python `drop_larger_lower=False` requirement. **Survey composition:** inherits from the **Per-path survey-design SE** paragraph above — analytical Binder TSL routes through `_survey_se_from_group_if`'s cell-period allocator on the post-period of the transition; replicate-weights route through the group-level allocator. Multiplier bootstrap (`n_bootstrap > 0`) under `by_path + heterogeneity + survey_design` inherits the existing per-path multiplier-bootstrap-survey gate. **`df_survey` propagation:** every per-(path, horizon) replicate-weight fit appends `n_valid` to the shared `_replicate_n_valid_list` accumulator; per-path heterogeneity inference is refreshed with the FINAL `_effective_df_survey(...)` in the R2 P1b refresh block (separate dedicated loop because the schema shape is `{path: {l: {...}}}` rather than `{path: {"horizons": {l: {...}}}}`). **Result schema:** `results.path_heterogeneity_effects: Dict[Tuple[int, ...], Dict[int, Dict[str, Any]]]` keyed `{path: {l: {beta, se, t_stat, p_value, conf_int, n_obs}}}`. Empty-state contract mirrors `path_effects`: `None` when not requested, `{}` when requested but no path has eligible switchers. **DataFrame integration:** `to_dataframe(level="by_path")` adds always-present `het_*` columns (`het_beta`, `het_se`, `het_t_stat`, `het_p_value`, `het_conf_int_lower`, `het_conf_int_upper`), populated for positive-horizon rows when `heterogeneity` is set and NaN otherwise (mirrors the `cband_*` and `cumulated_*` always-present convention). **Per-path placebo heterogeneity (`placebo + predict_het + by_path`, post-2026-05-15):** R-verified — `did_multiplegt_dyn(by_path, predict_het, placebo)` emits per-path heterogeneity OLS results on backward (placebo) horizons via R's per-by_level dispatcher (`DIDmultiplegtDYN:::did_multiplegt_main` placebo block at the `effect = matrix(-i, ...)` rbind site). R's predict_het syntax: passing `predict_het = list("X", c(-1))` with `placebo > 0` triggers "compute heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo) positions"; forward rows have positive `effect` values, placebo rows negative. Python mirrors via `_compute_heterogeneity_test(..., placebo=L_max)` (set when `self.placebo` is truthy) — the function iterates forward (1..L_max) and backward (-1..-L_max) horizons in a single loop with an explicit `out_idx < 0` eligibility guard for backward horizons whose `F_g` is too small (would otherwise silently misread `N_mat` via numpy negative indexing). Placebo rows in `to_dataframe(level="by_path")` have non-NaN `het_*` columns when `placebo=True` and `heterogeneity=` are both set; `path_heterogeneity_effects` uses negative-int keys for backward horizons, mirroring the existing `path_placebo_event_study` convention. **Survey gate (warn + skip):** `survey_design + placebo + heterogeneity` emits a `UserWarning` at fit-time and falls back to forward-horizon-only heterogeneity (codex R1 P1 #1: the eager raise broke the previously-supported forward-horizon survey + predict_het path under the default `placebo=True`) — the Binder TSL cell-period allocator's justification (Survey IF expansion Note above) is tied to **post-period** attribution (`out_idx = first_switch_idx[g] - 1 + l_h` with `l_h > 0`); backward-horizon attribution puts ψ_g mass on a pre-period cell, which is a separate library-extension claim that needs its own derivation. Forward-horizon `predict_het + survey_design` continues to work unchanged on both global and per-path surfaces. The function-level `_compute_heterogeneity_test` keeps a per-iteration backstop that raises `NotImplementedError` if a direct caller bypasses fit() and passes `survey + placebo > 0` (regression-tested at `test_compute_heterogeneity_test_direct_call_raises_on_backward_survey`). Pre-period allocator derivation is deferred to a follow-up methodology PR. R parity confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathHeterogeneityWithPlacebo` on the `multi_path_reversible_predict_het_with_placebo` fixture (scenario 22, `placebo=2, effects=3, by_path=3, predict_het=list("het_x", c(-1))`) AND `::TestDCDHDynRParityHeterogeneityWithPlacebo` on the global anchor (`multi_path_reversible_predict_het_with_placebo_global`, scenario 23, same DGP without by_path) — both surfaces emit forward + backward heterogeneity rows in matching parity. Pinned at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5` for `beta` / `se` / `t_stat` / `n_obs`; `INFERENCE_RTOL=1e-4` for `p_value` / `conf_int`. Cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo`. Regression test anchors: `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity` (gate dispatch, behavior, telescope-to-global on single-path panel, zero-signal anti-regression, multi-baseline UserWarning anti-regression, DataFrame integration, edge cases) + `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneity` (global anchor, FIRST `predict_het` parity baseline) + `::TestDCDHDynRParityByPathHeterogeneity` (per-path). +- **Note (Phase 3 `by_path` per-path event-study disaggregation):** Per-path disaggregation of the multi-horizon event study, mirroring R `did_multiplegt_dyn(..., by_path=k)`. Activated via `ChaisemartinDHaultfoeuille(by_path=k, drop_larger_lower=False)` where `k` is a positive integer (top-k most common observed paths by switcher-group frequency). **Window convention:** the path tuple for a switcher group `g` is `(D_{g, F_g-1}, D_{g, F_g}, ..., D_{g, F_g-1+L_max})` — length `L_max + 1`, matching R's window `[F_{g-1}, F_{g-1+l}]`. **Ranking:** paths are ranked by descending frequency; ties are broken lexicographically on the path tuple for deterministic ordering, so every selected path has a unique `frequency_rank`. If `by_path` exceeds the number of observed paths, all observed paths are returned with a `UserWarning`. **Per-path SE convention (joiners/leavers precedent):** the per-path influence function follows the joiners-only / leavers-only IF construction at `chaisemartin_dhaultfoeuille.py:5495-5504`: the switcher-side contribution `+S_g * (Y_{g,out} - Y_{g,ref})` is zeroed for groups whose observed trajectory is NOT the selected path; control contributions and the full cohort structure `(D_{g,1}, F_g, S_g)` are unchanged. After applying the singleton-baseline eligible mask and cohort-recentering with the original cohort IDs, the plug-in SE uses the path-specific divisor `N_l_path` (count of path switchers eligible at horizon `l`) — same pattern as `joiners_se` using `joiner_total`. This gives the **within-path mean** estimand `DID_{path,l}` as the within-path average of `DID_{g,l}`. **Degenerate-cohort behavior per path:** when a path's centered IF at some horizon is identically zero (every variance-eligible path switcher forms its own `(D_{g,1}, F_g, S_g)` cohort, or the path has a single contributing group), SE / t_stat / p_value / conf_int are NaN-consistent and a `UserWarning` is emitted scoped to `(path, horizon)`. This mirrors the overall-path degenerate-cohort surface and is common for rare paths with few contributing groups. **Empty-state contract:** `results.path_effects` distinguishes "not requested" (`None`) from "requested but empty" (`{}` — all switchers have windows outside the panel or unobserved cells). The empty-dict case emits a `UserWarning` at fit-time and renders as an explicit "no observed paths" notice in `summary()`; `to_dataframe(level="by_path")` returns an empty DataFrame with the canonical column set (mirrors the `linear_trends` pattern when `trends_linear=True` but no horizons survive). **Requirements:** `drop_larger_lower=False` (multi-switch groups are the object of interest; default `True` filters them out) and `L_max >= 1` (path window depends on the horizon). **Scope:** combinations with `design2` and `honest_did` remain gated behind explicit `NotImplementedError` (deferred to follow-up wave PRs); `heterogeneity` is supported per-path — see the **Per-path heterogeneity testing** paragraph below. `n_bootstrap > 0` is now supported — see the **Bootstrap SE** paragraph below. `survey_design` is supported under analytical Binder TSL and replicate-weight bootstrap — see the **Per-path survey-design SE** paragraph below; multiplier bootstrap (`n_bootstrap > 0`) under `survey_design + by_path/paths_of_interest` remains gated. `placebo=True` is now supported per-path — see the **Per-path placebos** paragraph below. **TWFE diagnostic** remains a sample-level summary (not computed per path) in this release. Results are exposed on `results.path_effects` as `Dict[Tuple[int, ...], Dict[str, Any]]` with nested `horizons` dicts per horizon `l`, and on `results.to_dataframe(level="by_path")` as a long-format table with columns `[path, frequency_rank, n_groups, horizon, effect, se, t_stat, p_value, conf_int_lower, conf_int_upper, n_obs, cband_lower, cband_upper, cumulated_effect, cumulated_se, het_beta, het_se, het_t_stat, het_p_value, het_conf_int_lower, het_conf_int_upper]` (the `cband_*` columns are added by the joint sup-t Note below, populated for positive-horizon rows of paths with a finite sup-t crit and NaN otherwise; the `cumulated_*` columns are added by the per-path linear-trends Note below, populated for positive-horizon rows when `trends_linear=True` is set and NaN otherwise). Gated tests live in `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathGates` / `::TestByPathBehavior` / `::TestByPathEdgeCases`. **R-parity** against `DIDmultiplegtDYN 2.3.3` is confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPath` via two scenarios: `mixed_single_switch_by_path` (2 paths, `by_path=2`) and `multi_path_reversible_by_path` (4 paths, `by_path=3`; path-assignment deterministic on `F_g` so each `(D_{g,1}, F_g, S_g)` cohort contains switchers from a single path). Per-path point estimates and per-path switcher counts match R exactly; per-path SE matches within the Phase 2 multi-horizon SE envelope (observed rtol ≤ 10.2% on the 2-path mixed scenario, ≤ 4.2% on the 4-path cohort-clean scenario). **Deviation from R (cross-path cohort-sharing SE):** our analytical SE is the marginal variance of the path-contribution estimator cohort-centered on the *full-panel* cohort structure (joiners/leavers precedent — non-path switchers contribute to cohort means via their zeroed switcher row). R's `did_multiplegt_dyn(..., by_path=k)` re-runs the estimator per path, so cohort means are computed over the path's own switchers only. When a cohort `(D_{g,1}, F_g, S_g)` spans multiple observed paths, Python and R SE diverge materially (our empirical probes with random post-window toggling saw rtol > 100%); when every cohort is single-path (scenario 13 by design, scenario 14 by construction), the two approaches coincide up to the documented Phase 2 envelope. Practitioners with cohort structures that mix paths should interpret the per-path SE as a within-full-panel marginal variance, not a per-path conditional variance. **Bootstrap SE:** when `n_bootstrap > 0` is set, the top-k paths are enumerated once on the observed data (R-faithful: matches `did_multiplegt_dyn(..., by_path=k, bootstrap=B)`'s path-stability convention — verified empirically against DIDmultiplegtDYN 2.3.3) and the multiplier bootstrap (`bootstrap_weights ∈ {"rademacher", "mammen", "webb"}`) runs per `(path, horizon)` target via the shared `_bootstrap_one_target` / `compute_effect_bootstrap_stats` helpers. Point estimates are unchanged from the analytical path. Bootstrap SE replaces the analytical SE in `path_effects[path]["horizons"][l]["se"]`, and `p_value` / `conf_int` are taken as the **bootstrap percentile** statistics, matching the Round-10 library convention for overall / joiners / leavers / multi-horizon bootstrap (see the `Note (bootstrap inference surface)` elsewhere in this file and the pinned regression `test_bootstrap_p_value_and_ci_propagated_to_top_level`). `t_stat` is SE-derived via `safe_inference` per the anti-pattern rule. Interpretation: inference is *conditional on the observed path set*. **SE inherits the analytical cross-path cohort-sharing deviation:** the bootstrap input is the exact same full-panel cohort-centered path IF that the analytical path computes (`_collect_path_bootstrap_inputs` reuses the same enumeration / cohort IDs / IF construction), so the bootstrap SE is a Monte Carlo analog of the analytical SE — it inherits the same cross-path cohort-sharing deviation from R's per-path re-run convention documented above. On single-path-cohort panels (scenarios 13 and 14 of the R-parity fixture, and any DGP where `(D_{g,1}, F_g, S_g)` cohorts never span multiple observed paths), bootstrap SE tracks analytical SE up to Monte Carlo noise and both coincide with R up to the Phase 2 envelope. On cross-path cohort panels, bootstrap SE inherits the >100% rtol divergence from R that analytical already has. **Deviation from R (CI method):** R's per-path CI is normal-theory around the bootstrap SE (half-width ≈ `1.96·se`); ours is the bootstrap percentile CI, intentionally diverging from R to keep the dCDH inference surface internally consistent across all bootstrap targets. Practitioners who want *unconditional* inference capturing path-selection uncertainty need a pairs-bootstrap (deferred — no R precedent). Positive regressions live in `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathBootstrap` (gated `@pytest.mark.slow`): point-estimate invariance, finite positive SE on non-degenerate panels, SE-within-30%-rtol of analytical on cohort-clean fixtures, degenerate-cohort NaN propagation, Rademacher/Mammen/Webb parity, seed reproducibility, and percentile-vs-normal-theory CI pinning. **Per-path placebos:** when `placebo=True` (and `L_max >= 1`) is combined with `by_path=k`, per-path backward-horizon placebos `DID^{pl}_{path, l}` for `l = 1..L_max` are computed using the same joiners/leavers IF precedent applied to `_compute_per_group_if_placebo_horizon` (with the new `switcher_subset_mask` parameter): switcher contributions are zeroed for groups not in the path; the control pool and the variance-eligible cohort structure `(D_{g,1}, F_g, S_g)` are unchanged. Plug-in SE uses the path-specific divisor `N^{pl}_{l, path}` (count of path switchers eligible at backward lag `l`). Surfaced on `results.path_placebo_event_study[path][-l]` with the same `{effect, se, t_stat, p_value, conf_int, n_obs}` shape as `placebo_event_study` (negative-int inner keys parallel the existing per-path event-study positive-int keys, so a unified forward+backward view is well-formed). **Inherits the cross-path cohort-sharing SE deviation from R** documented above for `path_effects` (same convention applied backward); tracks R within numerical tolerance on single-path-cohort panels and diverges on cohort-mixed panels. Multiplier bootstrap (when `n_bootstrap > 0`) runs per `(path, lag)` target via the same `_bootstrap_one_target` dispatch used for the per-path event-study, with the canonical NaN-on-invalid contract. The bootstrap SE is a Monte Carlo analog of the analytical placebo SE — same per-path centered IF input — and inherits the same deviation. Surfaced through `summary()` (negative-keyed rows rendered alongside positive-keyed event-study rows under each path block) and `to_dataframe(level="by_path")` (`horizon` column takes negative ints for placebo rows). **Empty-state contract:** `results.path_placebo_event_study` mirrors `path_effects` — `None` when `by_path + placebo` was not requested, `{}` when requested but no observed path has a complete window within the panel (same regime that returns `{}` for `path_effects`, with the same fit-time `UserWarning`). R-parity is confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathPlacebo` on the `multi_path_reversible_by_path_placebo` scenario; positive analytical + bootstrap invariants live in `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPlacebo` (with the gated `::TestByPathPlacebo::TestBootstrap` subclass). **Per-path covariate residualization (DID^X):** when `controls=[...]` is set with `by_path=k`, the per-baseline OLS residualization (Web Appendix Section 1.2) runs once on the first-differenced outcome BEFORE path enumeration. All four downstream surfaces — analytical per-path SE, bootstrap SE, per-path placebos, and per-path joint sup-t bands — consume the residualized `Y_mat` automatically (Frisch-Waugh-Lovell). Per-period effects remain unadjusted, consistent with the existing `controls` + per-period DID contract (per-period DID does not support residualization). Failed-stratum baselines (rank-deficient X) zero out `N_mat` for affected groups, which the path enumeration treats as ineligible per its existing convention. **Deviation from R on multi-baseline switcher panels (point estimates):** R `did_multiplegt_dyn(..., by_path, controls)` re-runs the per-baseline residualization on each path's restricted subsample (`R/R/did_multiplegt_dyn.R` lines 401-405: rows of the path's switchers OR rows where `yet_to_switch=1 AND baseline matches the path's baseline`). The first-stage residualization sample R uses for path B equals: pre-switch rows of all switchers with matching baseline + all rows of never-switchers with matching baseline — bit-identical to our global first-stage sample under single-baseline switcher panels (every switcher shares the same `D_{g,1}`, regardless of how `F_g` or path identity varies across switchers). Per-path point estimates therefore coincide with R on those panels up to the existing **DID^X first-stage cell-weighting deviation** documented above in `Note (Phase 3 DID^X covariate adjustment)` (Python's first-stage OLS uses equal cell weights — one observation per `(g, t)` cell, consistent with the library's cell-aggregated input convention; R weights by `N_gt`). On panels with one observation per `(g, t)` cell (the common case after the cell-aggregation step in `fit()`), Python matches R bit-exactly: the `multi_path_reversible_by_path_controls` parity fixture has 4 paths with switcher `F_g` values spanning [0..6] under `D_{g,1}=0` and Python matches R to rtol ~1e-11. On multi-baseline switcher panels (some switchers have `D_{g,1}=0`, others have `D_{g,1}=1`) R's per-path subset drops switchers whose baseline differs from the path's baseline, so the per-baseline regression coefficients diverge per path under R and point estimates can diverge between Python and R — a `UserWarning` is emitted at fit-time when this configuration is detected so practitioners do not silently consume estimates that disagree with R. The warning filters to switcher groups only; never-switchers (never-treated + always-treated controls) at multiple baseline values do NOT trigger the warning because they don't affect R's per-path subset construction. **Inherits the cross-path cohort-sharing SE deviation from R** documented above for `path_effects` — bootstrap SE, placebo SE, and sup-t crit are Monte Carlo / joint-distribution analogs of the same residualized analytical IF and carry the same deviation. R-parity is confirmed against `did_multiplegt_dyn(..., by_path=3, controls="X1")` at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathControls` on the `multi_path_reversible_by_path_controls` scenario (single-baseline DGP, exact point-estimate match measured rtol ~1e-11); cross-surface inheritance and the multi-baseline warning are regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathControls` (analytical + bootstrap + placebo + sup-t + `to_dataframe(level="by_path")` cband columns + multi-baseline `UserWarning`). **Per-path linear-trends DID^{fd}:** when `trends_linear=True` is set with `by_path=k`, the first-differencing transform at `chaisemartin_dhaultfoeuille.py:1599-1630` runs once globally BEFORE path enumeration (replaces `Y_mat` with `Z_mat = Y_t - Y_{t-1}` and shrinks the time axis by one), so per-path raw second-differences `DID^{fd}_{path, l}` surface on `path_effects[path]["horizons"][l]` automatically. Per-path cumulated level effects `delta_{path, l} = sum_{l'=1..l} DID^{fd}_{path, l'}` (the quantity R returns under `did_multiplegt_dyn(..., by_path, trends_lin)` per the existing parity test pivot at `tests/test_chaisemartin_dhaultfoeuille_parity.py:403-409`) surface on the new `results.path_cumulated_event_study[path][l]` field — a per-group running sum of `DID^{fd}_{g, l'}` averaged over the path's switchers eligible at horizon `l`, mirroring the global `linear_trends_effects` cumulation logic at `chaisemartin_dhaultfoeuille.py:3340-3398`. SE on the cumulated layer is the conservative upper bound (sum of per-horizon component SEs from `path_effects[path]["horizons"][l]["se"]`, NaN-consistent: any non-finite component yields a NaN cumulated SE). **Post-bootstrap recomputation:** the cumulated layer is built AFTER the bootstrap propagation block at `chaisemartin_dhaultfoeuille.py:3034-3081` so it reads the FINAL post-bootstrap per-horizon SEs (mirrors the global `linear_trends_effects` placement). When `n_bootstrap > 0`, cumulated SE / t / p / CI are derived from bootstrap per-horizon SEs; when bootstrap produces non-finite SE (e.g., `n_bootstrap=1` degenerate distribution), the cumulated layer's full inference tuple is NaN per the library-wide NaN-on-invalid bootstrap contract. `to_dataframe(level="by_path")` exposes `cumulated_effect` and `cumulated_se` columns (always present, NaN-when-None — mirrors the `cband_*` always-present convention from PR #374). `summary()` renders a `Cumulated Level Effects (DID^{fd}, trends_linear)` sub-section under each per-path block. **Path enumeration uses the post-first-differenced `N_mat_fd`**: switchers with `F_g==2` fail the window-eligibility check and are dropped from path enumeration entirely (the existing global `F_g >= 3` warning at line 1620 surfaces the issue), so a path whose switchers all have `F_g < 3` is silently absent from `path_effects` rather than present-with-NaN. **F_g=3 boundary-case divergence (`by_path + trends_linear`):** `F_g=3` switchers have exactly 2 pre-switch periods, which after first-differencing and the `time==1` filter leaves only 1 valid pre-window Z value. R's per-path full-pipeline call handles this single-pre-period regime differently from Python's global-then-disaggregate architecture, producing 30%+ relative divergence on point estimates for paths whose switchers include `F_g=3` (empirically observed on the parity fixture's earlier `F_g=3` variant). A separate `UserWarning` fires at fit-time when the panel includes any `F_g=3` switcher AND `by_path + trends_linear` is set, mirroring the `F_g < 3` exclusion warning. The shipped parity fixture (`single_baseline_multi_path_by_path_trends_lin`) restricts to `F_g >= 4` exclusively to avoid this regime; per-path R parity is asserted only there. **Placebo under `trends_linear` returns RAW per-horizon values** (no per-path placebo cumulation surface) — verified empirically against the existing `joiners_only_trends_lin` parity fixture: R's per-path Placebo_l matches Python's `path_placebo_event_study[path][-l]` (raw) bit-exactly under non-`by_path` trends_lin. **Deviation from R on multi-baseline switcher panels (point estimates):** R `did_multiplegt_dyn(..., by_path, trends_lin)` re-runs the full pipeline (including first-differencing) on each path's restricted subsample, so it operates on different switcher samples per path when switchers have different baseline values `D_{g,1}`. Python first-differences once globally before path enumeration. On single-baseline switcher panels the two architectures coincide; on multi-baseline switcher panels per-path point estimates can diverge — a `UserWarning` is emitted at fit-time when this configuration is detected so practitioners do not silently consume estimates that disagree with R (mirroring the analogous `by_path + controls` warning). Per-path R parity is confirmed against `did_multiplegt_dyn(..., by_path=3, trends_lin=TRUE, placebo=1)` at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathTrendsLinear` on the `single_baseline_multi_path_by_path_trends_lin` scenario (single-baseline + cohort-single-path + `F_g >= 4` DGP designed to eliminate the multi-baseline divergence, the cross-path cohort-sharing deviation, and the F_g=3 boundary case under R's per-path full-pipeline call). Per-path cumulated point estimates match R bit-exactly (rtol ~1e-9) on event horizons under those conditions; cumulated SE_RTOL is widened to `0.20` (vs `0.12` used for non-cumulated by_path parity) because the conservative upper-bound SE compounds the cross-path cohort-sharing deviation under summation. **Placebo parity is intentionally skipped for `trends_linear`**: R's per-path placebo computation re-runs on the path-restricted subsample with different control eligibility than Python's global-then-disaggregate architecture surfaces, producing a sign-and-magnitude divergence on paths whose switchers have minimal pre-window depth (e.g., `F_g=4` switchers). Placebo under `by_path + trends_linear` is exercised via internal regression in `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathTrendsLinear` (finite values, bootstrap inheritance) but not pinned to R bit-by-bit. Cross-surface invariants (analytical + bootstrap + placebo + sup-t + `path_cumulated_event_study` + `to_dataframe` columns + `summary()` rendering) are regression-tested at `TestByPathTrendsLinear`. **Per-path state-set trends:** when `trends_nonparam="state_col"` is set with `by_path=k`, the set membership column is validated and stored once globally as `set_ids_arr` (time-invariance, NaN rejection, partition-coarseness checks unchanged from the non-by_path path). The `set_ids` parameter is threaded through the four per-path IF helpers (`_compute_path_effects`, `_compute_path_placebos`, `_collect_path_bootstrap_inputs`, `_collect_path_placebo_bootstrap_inputs`) so per-path analytical SE, bootstrap, placebos, and sup-t bands all consume the set-restricted control pool automatically. R does NOT first-difference and does NOT cumulate under `trends_nonparam` (unlike `trends_lin`); per-horizon `Effect_l` is a normal DID with set-restricted controls. Per-path R parity is confirmed against `did_multiplegt_dyn(..., by_path=3, trends_nonparam="state", placebo=1)` at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathTrendsNonparam` on the `multi_path_reversible_by_path_trends_nonparam` scenario; per-path point estimates AND placebos match R bit-exactly (rtol ~1e-9), per-path SE matches within the Phase 2 envelope (~13% rtol observed). Cross-surface invariants are regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathTrendsNonparam`. **Per-path non-binary treatment:** integer-coded discrete treatment (D in Z, e.g. ordinal {0, 1, 2}) is supported under `by_path=k` and `paths_of_interest`. Path tuples become integer-state tuples (`(0, 2, 2, 2)`) keyed bit-for-bit against R's comma-separated path strings (`"0,2,2,2"`) for D in {0..9}. Continuous D (e.g. `1.5`) raises `ValueError` at fit-time per the no-silent-failures contract — the existing `int(round(float(v)))` cast in `_enumerate_treatment_paths` is now defensive (no-op for integer-coded D). **Deviation from R for multi-character baseline states (D >= 10 or negative D):** R's `did_multiplegt_by_path` derives the per-path baseline via `path_index$baseline_XX <- substr(path_index$path, 1, 1)` (extracted 2026-05-03 via `Rscript -e 'cat(paste(deparse(DIDmultiplegtDYN:::did_multiplegt_by_path), collapse="\n"))'`), capturing only the first character of the comma-separated path string. For multi-character baselines this drops the rest of the value: for `path = "12,12,..."` it captures `"1"` instead of `"12"`; for `path = "-1,-1,..."` it captures `"-"` instead of `"-1"`. R's per-path control-pool subset is mis-allocated in both regimes. Python's tuple-key matching is correct in both — the per-path point estimates we compute are correct, R's per-path subset for the same path is buggy. The shipped R-parity scenarios stay in `D in {0, 1, 2}` to avoid the R bug; R-parity is asserted on that set at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathNonBinary` via the `multi_path_reversible_by_path_non_binary` scenario (78 switchers, 3 paths, single-baseline custom DGP, F_g >= 4). The string-encoding compatibility extends to all single-digit nonnegative D (`{0..9}`) since each value renders as a single character, but no R-parity scenario currently exercises D outside `{0, 1, 2}` — per-path point estimates match R bit-exactly (rtol ~1e-9 events; rtol+atol envelope for placebo near-zero values), SE inherits the documented cross-path cohort-sharing deviation (~5% rtol observed; SE_RTOL=0.15 envelope). Negative-integer treatment-state support is regression-tested in Python only (no R parity — R is the buggy side on multi-character baselines) at two sites: `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary::test_negative_integer_D_supported` covers paths with negative values in non-baseline positions (e.g. `(0, -1, -1, -1)`), and `::test_negative_baseline_path_supported` covers paths starting with a negative baseline `D_{g,1} = -1` (e.g. `(-1, 0, 0, 0)`, `(-1, 1, 1, 1)`) — the exact regime that triggers R's `substr` bug. Cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathNonBinary`. **Per-path survey-design SE** (analytical Binder TSL + replicate-weight bootstrap): under `by_path` / `paths_of_interest` + `survey_design`, the per-path per-horizon SE routes through `_survey_se_from_group_if` using the cell-period allocator. The per-path influence function `U_pp_l_path` is the per-period IF with non-path switcher-side contributions skipped — control contributions remain unchanged, matching the joiners/leavers IF convention from the **Per-path SE convention** paragraph above (the `switcher_subset_mask` zeroes the switcher row of the per-group IF, which trivially zeroes the corresponding row of the per-cell IF, preserving the row-sum identity `U_pp.sum(axis=1) == U`). The IF is cohort-recentered via `_cohort_recenter_per_period` and expanded to observations as `psi_i = U_pp[g_i, t_i] · (w_i / W_{g_i, t_i})`. Replicate-weight designs unconditionally route through the cell allocator (Class A contract, PR #323). Multiplier bootstrap (`n_bootstrap > 0`) under `survey_design + by_path/paths_of_interest` raises `NotImplementedError` at fit-time — the survey-aware perturbation pivot for path-restricted IFs is methodologically underived and deferred to a future wave; the global non-by_path TSL multiplier bootstrap is unaffected and continues to ship. **Path-enumeration ranking is unweighted** under `survey_design`: top-k selection uses group cardinality (`path_to_count[p]` = number of groups), not population-weight mass — survey weights do not affect which paths are selected as "top-k". A weighted-ranking variant (sum of survey weights per path) is deferred until concrete demand. **`df_survey` propagation:** under replicate weights, every per-path per-horizon fit contributes an `n_valid` count to the shared `_replicate_n_valid_list` accumulator and the final `_effective_df_survey = min(...) - 1` reflects all per-path replicate fits. A post-call `_refresh_path_inference` helper re-runs `safe_inference` on every populated entry so `multi_horizon_inference`, `placebo_horizon_inference`, `path_effects`, and `path_placebos` all use the same final df after per-path appends complete. **Lonely-PSU policy is sample-wide, not per-path** — the `lonely_psu` policy (`remove`/`certainty`/`adjust`) operates on the full design-level PSU/strata structure, not on path-restricted subsamples. **Telescope invariant:** on a single-path panel where every switcher follows the same trajectory and `eligible_groups` matches between by_path and non-by_path, per-path SE equals the global non-by_path survey SE bit-exactly — pinned at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignTelescope::test_telescope_analytical_TSL`. **Deviation from R:** none — R `did_multiplegt_dyn` does not support survey weighting, so this is a Python-only methodology extension (no R parity available; no R parity test class). Regression test anchor: `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathSurveyDesignAnalytical` covering analytical SE, replicate-weight SE, the `n_bootstrap` gate, the global anti-regression, per-path placebos, `trends_linear` composition, and unobserved-path warnings under survey. **Per-path heterogeneity testing** (analytical OLS / WLS + survey-aware Binder TSL + replicate-weight): under `by_path` / `paths_of_interest` + `heterogeneity=""`, the per-path per-horizon coefficient `beta_X^path_l` is computed by re-running `_compute_heterogeneity_test` on the path-restricted switcher subsample. The path filter (`path_groups: Optional[Set[int]]`) restricts eligibility to switchers ON path `p` inside the inner regression; the variance machinery (HC1-robust OLS vcov for non-survey via `solve_ols(..., return_vcov=True)` (`vcov_type="hc1"` default), WLS-on-pweights with cell-period IF allocator for analytical Binder TSL, group-level allocator for Rao-Wu replicate) is unchanged from the global heterogeneity path. **Cohort dummies absorb baseline by construction** — the cohort key `(D_{g,1}, F_g, S_g)` includes baseline, so multi-baseline switcher panels do not produce R-divergence (unlike `controls` / `trends_linear`); no parallel `UserWarning` is emitted. **R parity:** matches `did_multiplegt_dyn(..., by_path, predict_het)` per-by_level on the `multi_path_reversible_by_path_predict_het` scenario for `beta`, `se`, `t_stat`, and `n_obs` (`BETA_RTOL = 1e-6` on `beta`, `SE_RTOL = 1e-5` on `se` / `t_stat`; the SE tolerance is one decade looser than `BETA_RTOL` to absorb the small OLS denominator-and-cohort-recentering numerical drift observed on this fixture; `n_obs` matches exactly). Inherits the same tolerances as the new global `multi_path_reversible_predict_het` scenario (`TestDCDHDynRParityHeterogeneity`) since the per-path R call is `did_multiplegt_main(..., predict_het=...)` per path-restricted subsample with no additional numerical loss. **R parity (heterogeneity inference, post-2026-05-15 df threading):** Python now passes `df = n_obs - rank(design)` to `safe_inference` on the non-survey OLS path at `chaisemartin_dhaultfoeuille.py`'s `_compute_heterogeneity_test`, matching R's t-distribution with df from the OLS regression (`DIDmultiplegtDYN:::did_multiplegt_main` `t_stat <- qt(0.975, df.residual(model))` site). The numerical rank is computed via `_detect_rank_deficiency` (the same helper `solve_ols` calls internally); the small-sample short-circuit also uses `n_obs <= rank` rather than the pre-PR pre-drop `n_obs <= n_params`, so boundary cases where alias dropping leaves `n_obs > rank > 0` fit correctly instead of NaN-filling. Parity tolerance is `INFERENCE_RTOL = 1e-4` on `p_value` and `conf_int`; `beta` / `se` / `t_stat` continue to use `BETA_RTOL = 1e-6` / `SE_RTOL = 1e-5`. The `t_stat = beta / se` field is distribution-invariant. **Rank-deficient designs:** ``df = n_obs - rank(design)`` uses the post-drop numerical rank via the same ``_detect_rank_deficiency`` helper that ``solve_ols`` calls internally. For full-rank designs (``rank == n_params``) behavior is bit-identical to the pre-PR ``n_obs - n_params`` path; for near-rank-deficient designs that ``solve_ols`` retains rather than NaN-out (e.g., cohort-collinearity at high horizons), the post-drop rank is strictly lower and the post-PR ``df`` is strictly larger, matching R's ``lm()`` convention. Fully rank-deficient designs continue to NaN-fill via the rank-deficient short-circuit at ``_compute_heterogeneity_test``. R's `dont_drop_larger_lower=TRUE` is set in both fixture scenarios to match the Python `drop_larger_lower=False` requirement. **Survey composition:** inherits from the **Per-path survey-design SE** paragraph above — analytical Binder TSL routes through `_survey_se_from_group_if`'s cell-period allocator on the post-period of the transition; replicate-weights route through the group-level allocator. Multiplier bootstrap (`n_bootstrap > 0`) under `by_path + heterogeneity + survey_design` inherits the existing per-path multiplier-bootstrap-survey gate. **`df_survey` propagation:** every per-(path, horizon) replicate-weight fit appends `n_valid` to the shared `_replicate_n_valid_list` accumulator; per-path heterogeneity inference is refreshed with the FINAL `_effective_df_survey(...)` in the R2 P1b refresh block (separate dedicated loop because the schema shape is `{path: {l: {...}}}` rather than `{path: {"horizons": {l: {...}}}}`). **Result schema:** `results.path_heterogeneity_effects: Dict[Tuple[int, ...], Dict[int, Dict[str, Any]]]` keyed `{path: {l: {beta, se, t_stat, p_value, conf_int, n_obs}}}`. Empty-state contract mirrors `path_effects`: `None` when not requested, `{}` when requested but no path has eligible switchers. **DataFrame integration:** `to_dataframe(level="by_path")` adds always-present `het_*` columns (`het_beta`, `het_se`, `het_t_stat`, `het_p_value`, `het_conf_int_lower`, `het_conf_int_upper`), populated for positive-horizon rows when `heterogeneity` is set and NaN otherwise (mirrors the `cband_*` and `cumulated_*` always-present convention). **Per-path placebo heterogeneity (`placebo + predict_het + by_path`, post-2026-05-15):** R-verified — `did_multiplegt_dyn(by_path, predict_het, placebo)` emits per-path heterogeneity OLS results on backward (placebo) horizons via R's per-by_level dispatcher (`DIDmultiplegtDYN:::did_multiplegt_main` placebo block at the `effect = matrix(-i, ...)` rbind site). R's predict_het syntax: passing `predict_het = list("X", c(-1))` with `placebo > 0` triggers "compute heterogeneity for ALL forward (1..effects) AND ALL placebo (1..placebo) positions"; forward rows have positive `effect` values, placebo rows negative. Python mirrors via `_compute_heterogeneity_test(..., placebo=L_max)` (set when `self.placebo` is truthy) — the function iterates forward (1..L_max) and backward (-1..-L_max) horizons in a single loop with an explicit `out_idx < 0` eligibility guard for backward horizons whose `F_g` is too small (would otherwise silently misread `N_mat` via numpy negative indexing). Placebo rows in `to_dataframe(level="by_path")` have non-NaN `het_*` columns when `placebo=True` and `heterogeneity=` are both set; `path_heterogeneity_effects` uses negative-int keys for backward horizons, mirroring the existing `path_placebo_event_study` convention. **Survey gate (warn + skip):** `survey_design + placebo + heterogeneity` emits a `UserWarning` at fit-time and falls back to forward-horizon-only heterogeneity (codex R1 P1 #1: the eager raise broke the previously-supported forward-horizon survey + predict_het path under the default `placebo=True`) — the Binder TSL cell-period allocator's justification (Survey IF expansion Note above) is tied to **post-period** attribution (`out_idx = first_switch_idx[g] - 1 + l_h` with `l_h > 0`); backward-horizon attribution puts ψ_g mass on a pre-period cell, which is a separate library-extension claim that needs its own derivation. Forward-horizon `predict_het + survey_design` continues to work unchanged on both global and per-path surfaces. The function-level `_compute_heterogeneity_test` keeps a per-iteration backstop that raises `NotImplementedError` if a direct caller bypasses fit() and passes `survey + placebo > 0` (regression-tested at `test_compute_heterogeneity_test_direct_call_raises_on_backward_survey`). Pre-period allocator derivation is deferred to a follow-up methodology PR. R parity confirmed at `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityByPathHeterogeneityWithPlacebo` on the `multi_path_reversible_predict_het_with_placebo` fixture (scenario 22, `placebo=2, effects=3, by_path=3, predict_het=list("het_x", c(-1))`) AND `::TestDCDHDynRParityHeterogeneityWithPlacebo` on the global anchor (`multi_path_reversible_predict_het_with_placebo_global`, scenario 23, same DGP without by_path) — both surfaces emit forward + backward heterogeneity rows in matching parity. Pinned at `BETA_RTOL=1e-6` / `SE_RTOL=1e-5` for `beta` / `se` / `t_stat` / `n_obs`; `INFERENCE_RTOL=1e-4` for `p_value` / `conf_int`. Cross-surface invariants regression-tested at `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathPredictHetPlacebo`. Regression test anchors: `tests/test_chaisemartin_dhaultfoeuille.py::TestByPathHeterogeneity` (gate dispatch, behavior, telescope-to-global on single-path panel, zero-signal anti-regression, multi-baseline UserWarning anti-regression, DataFrame integration, edge cases) + `tests/test_chaisemartin_dhaultfoeuille_parity.py::TestDCDHDynRParityHeterogeneity` (global anchor, FIRST `predict_het` parity baseline) + `::TestDCDHDynRParityByPathHeterogeneity` (per-path). **Per-path user-specified path selection (`paths_of_interest`):** Python-only API extension — R's `did_multiplegt_dyn(..., by_path=k)` only accepts a positive int (top-k automatic ranking) or `-1` (all observed paths) and provides no list-based selection. Activated via `ChaisemartinDHaultfoeuille(paths_of_interest=[(0, 1, 1, 1), (0, 1, 0, 0)], drop_larger_lower=False)` as an alternative to `by_path=k`; the two are **mutually exclusive** (setting both raises `ValueError` at `__init__` and `set_params` time). Each path tuple must have length `L_max + 1`; the type / element / non-empty / length-uniformity checks fire at `__init__`, the length-vs-L_max check fires at fit-time. `bool` and `np.bool_` are explicitly rejected; `np.integer` is accepted and canonicalized to Python `int` for tuple-key consistency. Duplicates emit a `UserWarning` and are deduplicated; paths not observed in the panel emit a `UserWarning` and are omitted from `path_effects`. Paths appear in `results.path_effects` in the user-specified order, modulo deduplication and unobserved-path filtering. Composes with non-binary D and all downstream `by_path` surfaces (bootstrap, per-path placebos, per-path joint sup-t bands, `controls`, `trends_linear`, `trends_nonparam`) — mechanical filter on observed paths, no methodology change. Behavior + cross-feature regressions live at `tests/test_chaisemartin_dhaultfoeuille.py::TestPathsOfInterest`. diff --git a/tests/test_chaisemartin_dhaultfoeuille.py b/tests/test_chaisemartin_dhaultfoeuille.py index ae7776d1..7aab700a 100644 --- a/tests/test_chaisemartin_dhaultfoeuille.py +++ b/tests/test_chaisemartin_dhaultfoeuille.py @@ -2870,11 +2870,12 @@ def test_heterogeneity_multi_horizon(self): def test_heterogeneity_inference_local_invariants(self): """Local SE-derivation invariants for non-survey heterogeneity inference. Post-2026-05-15 df threading: Python passes - ``df = n_obs - n_params`` to ``safe_inference`` (matching R's - t-distribution); R-parity is pinned in + ``df = n_obs - rank(design)`` to ``safe_inference`` (matching + R's t-distribution); for full-rank designs ``rank == n_params``. + R-parity is pinned in ``tests/test_chaisemartin_dhaultfoeuille_parity.py``. This local test verifies the SE-derived fields are wired correctly - without requiring back-derivation of ``n_params``: + without requiring back-derivation of ``rank``: ``t_stat = beta / se``; ``conf_int`` symmetric around ``beta`` with positive half-width; ``p_value`` in ``[0, 1]``. Without these checks a regression isolated to the inference @@ -8311,6 +8312,71 @@ def test_negative_integer_D_supported(self): assert (0, -1, -1, -1) in path_keys assert (0, 1, 1, 1) in path_keys + def test_negative_baseline_path_supported(self): + """Negative-baseline switchers (D_{g,1} = -1) produce correct path tuples. + + Closes TODO #419 test-coverage gap. The existing + ``test_negative_integer_D_supported`` covers paths with negative + values in non-baseline positions (e.g. ``(0, -1, -1, -1)``), which + does NOT trigger R's ``substr(path, 1, 1)`` bug regime — R's + per-by_path dispatcher captures only the first character of the + comma-separated path string, so ``"-1,0,0,0"`` collapses to + ``"-"`` baseline rather than ``"-1"``. Python's tuple-key matching + is correct under any baseline value; this test pins the + negative-baseline contract with switchers that start at + ``D_{g,1} = -1`` and transition to ``0``. Per the REGISTRY note, + Python here is correct AND known to diverge from R's per-path + subset construction for the same data — no R-parity fixture is + added because R is the buggy side. + """ + rng = np.random.default_rng(46) + rows = [] + n_periods = 8 + # 30 switchers with D_{g,1} = -1, transitioning to 0 at F_g=4 + # path = (-1, 0, 0, 0) (length L_max+1 = 4 with L_max=3) + for g in range(30): + for t in range(n_periods): + d = 0 if t >= 3 else -1 + rows.append({"group": g, "period": t, "treatment": d}) + # 30 switchers with D_{g,1} = -1, transitioning to 1 at F_g=4 + # path = (-1, 1, 1, 1) + for g in range(30, 60): + for t in range(n_periods): + d = 1 if t >= 3 else -1 + rows.append({"group": g, "period": t, "treatment": d}) + # 20 always-at-(-1) controls (D == -1 throughout — same baseline + # as the switchers, never-treated relative to the change) + for g in range(60, 80): + for t in range(n_periods): + rows.append({"group": g, "period": t, "treatment": -1}) + df = pd.DataFrame(rows) + df["outcome"] = ( + 10.0 + + df["group"].values * 0.1 + + 0.1 * df["period"].values + + 2.0 * df["treatment"].values + + rng.normal(0, 0.5, size=len(df)) + ) + est = ChaisemartinDHaultfoeuille( + drop_larger_lower=False, by_path=2, twfe_diagnostic=False, seed=42 + ) + with warnings.catch_warnings(): + warnings.simplefilter("ignore", UserWarning) + res = est.fit( + df, outcome="outcome", group="group", time="period", + treatment="treatment", L_max=3, + ) + assert res.path_effects is not None + path_keys = set(res.path_effects.keys()) + # Both negative-baseline paths must appear with full negative + # baseline preserved in the tuple key. + assert (-1, 0, 0, 0) in path_keys, ( + f"Expected (-1, 0, 0, 0) in path keys; got {sorted(path_keys)}" + ) + assert (-1, 1, 1, 1) in path_keys, ( + f"Expected (-1, 1, 1, 1) in path keys; got {sorted(path_keys)}" + ) + def test_path_effects_present_under_non_binary(self): """path_effects populated; tuple keys are non-binary.""" df = _by_path_data_with_non_binary_treatment() @@ -10289,12 +10355,13 @@ def test_per_path_heterogeneity_finite_under_known_signal(self): def test_per_path_heterogeneity_inference_local_invariants(self): """Local SE-derivation invariants for non-survey per-path heterogeneity inference. Post-2026-05-15 df threading: Python - passes ``df = n_obs - n_params`` to ``safe_inference``; R-parity - is pinned in + passes ``df = n_obs - rank(design)`` to ``safe_inference`` + (full-rank designs have ``rank == n_params``); R-parity is + pinned in ``tests/test_chaisemartin_dhaultfoeuille_parity.py:: TestDCDHDynRParityByPathHeterogeneity``. Verifies SE-derivation wiring (``t_stat = beta/se``, symmetric ``conf_int`` around beta, - ``p_value`` in ``[0, 1]``) without back-deriving ``n_params``. + ``p_value`` in ``[0, 1]``) without back-deriving ``rank``. Mirrors ``TestHeterogeneityTesting::test_heterogeneity_inference_local_invariants``. """ @@ -11458,3 +11525,160 @@ def test_summary_renders_placebo_het_rows(self): s = res.summary() assert isinstance(s, str) assert len(s) > 0 + + def test_heterogeneity_df_uses_post_drop_rank(self): + """Heterogeneity inference uses df = n_obs - rank(design). + + Pre-PR (#449) Python used ``df = n_obs - n_params`` AND a + small-sample short-circuit at ``n_obs <= n_params``. For the + boundary case ``n_obs == n_params > rank(design)`` (e.g., + cohort-dummy collinearity at high horizons), R's + ``did_multiplegt_dyn`` / ``lm()`` alias-drops the redundant + column and fits with ``df = n_obs - rank``; pre-PR Python + short-circuited and NaN-filled. Post-PR uses ``n_obs <= rank`` + as the small-sample guard AND ``df = n_obs - rank``. + + Test construction: 5 switchers with first_switch_idx in + ``{3, 3, 4, 5, 6}`` (4 unique cohorts), ``X_het = [1, 1, 0, + 0, 0]``. X_het is exactly 1 on the F_g=3 cohort (which is + sorted first and dropped as reference) and 0 on the other 3 + cohorts. The design matrix + ``[intercept, X_het, F=4 dummy, F=5 dummy, F=6 dummy]`` has + 5 columns but ``X_het = intercept - (F=4 + F=5 + F=6)``, so + rank = 4. ``n_obs = 5``, ``n_params = 5``, ``rank = 4``. + Pre-PR: short-circuit fires (``5 <= 5``) → NaN-fill. Post-PR: + ``n_obs > rank`` → fit with ``df = 1``. + + The X_het column itself is identifiable (one of the cohort + dummies gets alias-dropped, not X_het) because pivoted QR + orders columns by norm and ``||X_het|| = sqrt(2)`` exceeds + the cohort dummies' unit norm. + """ + from diff_diff.chaisemartin_dhaultfoeuille import ( + _compute_heterogeneity_test, + ) + from diff_diff.utils import safe_inference + + n_periods = 8 + # 5 switchers, F_g in {3, 3, 4, 5, 6} — 4 unique cohort keys. + # baselines all 0, switch_direction all +1. + first_switch = np.array([3, 3, 4, 5, 6], dtype=int) + n_groups = 5 + baselines = np.zeros(n_groups, dtype=float) + switch_direction = np.array([1.0, 1.0, 1.0, 1.0, 1.0]) + T_g = np.full(n_groups, n_periods - 1, dtype=int) + # X_het = 1 for the F_g=3 cohort (reference), 0 for others. + # This makes X_het exactly equal to + # intercept - (sum of non-reference cohort dummies). + X_het = np.array([1.0, 1.0, 0.0, 0.0, 0.0]) + + rng = np.random.RandomState(202) + Y_mat = rng.normal(0, 1, size=(n_groups, n_periods)) + # Add het signal at post-period so beta != 0 + for g in range(n_groups): + f = first_switch[g] + Y_mat[g, f] += 5.0 + 3.0 * X_het[g] + N_mat = np.ones((n_groups, n_periods)) + + result = _compute_heterogeneity_test( + Y_mat=Y_mat, + N_mat=N_mat, + baselines=baselines, + first_switch_idx=first_switch, + switch_direction=switch_direction, + T_g=T_g, + X_het=X_het, + L_max=1, + ) + assert 1 in result + h = result[1] + # POST-PR: regression fits despite n_obs == n_params (= 5), + # because rank == 4 < n_params. Pre-PR would have short- + # circuited at the `n_obs <= n_params` guard and returned NaN. + assert np.isfinite(h["beta"]), ( + f"beta should be finite under post-drop-rank guard " + f"(n_obs=5, n_params=5, rank=4). Pre-PR would NaN-fill. " + f"Entry: {h}" + ) + assert np.isfinite(h["se"]), f"se non-finite: {h}" + n_obs = int(h["n_obs"]) + assert n_obs == 5, f"expected n_obs=5, got {n_obs}" + # df = n_obs - rank = 5 - 4 = 1. safe_inference at df=1 + # reproduces stored t/p/CI bit-exactly. + expected_t, expected_p, expected_ci = safe_inference( + h["beta"], h["se"], df=1 + ) + np.testing.assert_allclose( + h["t_stat"], expected_t, atol=1e-12, rtol=1e-12, + err_msg="t_stat does not match safe_inference(df=1)", + ) + np.testing.assert_allclose( + h["p_value"], expected_p, atol=1e-12, rtol=1e-12, + err_msg="p_value does not match safe_inference(df=1)", + ) + np.testing.assert_allclose( + h["conf_int"], expected_ci, atol=1e-12, rtol=1e-12, + err_msg="conf_int does not match safe_inference(df=1)", + ) + # safe_inference(df=n_obs - n_params=0) would produce different + # p_value/conf_int. Pin the asymmetry so a regression that + # reverts to pre-drop n_params is caught here. + wrong_t, wrong_p, wrong_ci = safe_inference( + h["beta"], h["se"], df=n_obs - 5 + ) + if np.isfinite(wrong_p): + # When df=0, safe_inference NaN-fills; the asymmetry check + # only fires when wrong_p is finite (which it isn't at df=0). + # We still pin that the stored p_value is NOT equal to the + # pre-drop result. + assert not np.isclose(h["p_value"], wrong_p, atol=1e-10), ( + "stored p_value matches pre-drop n_params df; " + "rank-threading may have reverted" + ) + + def test_heterogeneity_underidentified_nan_fills(self): + """Genuinely under-identified case (n_obs <= rank) NaN-fills. + + Guards against accidentally removing the small-sample short- + circuit entirely. Construction: 4 switchers, each its own + cohort. Design = [intercept, X_het, 3 cohort dummies] = 5 + columns. With X_het non-collinear, rank = min(4, 5) = 4 = + n_obs. Post-PR's `n_obs <= rank` guard fires (4 <= 4) and + NaN-fills. + """ + from diff_diff.chaisemartin_dhaultfoeuille import ( + _compute_heterogeneity_test, + ) + + n_periods = 8 + first_switch = np.array([3, 4, 5, 6], dtype=int) + n_groups = 4 + baselines = np.zeros(n_groups, dtype=float) + switch_direction = np.array([1.0, 1.0, 1.0, 1.0]) + T_g = np.full(n_groups, n_periods - 1, dtype=int) + # X_het with both 0s and 1s, not collinear with cohort dummies + X_het = np.array([1.0, 0.0, 1.0, 0.0]) + + rng = np.random.RandomState(203) + Y_mat = rng.normal(0, 1, size=(n_groups, n_periods)) + N_mat = np.ones((n_groups, n_periods)) + + result = _compute_heterogeneity_test( + Y_mat=Y_mat, + N_mat=N_mat, + baselines=baselines, + first_switch_idx=first_switch, + switch_direction=switch_direction, + T_g=T_g, + X_het=X_het, + L_max=1, + ) + assert 1 in result + h = result[1] + assert np.isnan(h["beta"]), ( + f"beta should be NaN when n_obs <= rank; got {h}" + ) + assert np.isnan(h["se"]) + assert np.isnan(h["t_stat"]) + assert np.isnan(h["p_value"]) + assert h["n_obs"] == 4 diff --git a/tests/test_chaisemartin_dhaultfoeuille_parity.py b/tests/test_chaisemartin_dhaultfoeuille_parity.py index ee8b42b2..604941a6 100644 --- a/tests/test_chaisemartin_dhaultfoeuille_parity.py +++ b/tests/test_chaisemartin_dhaultfoeuille_parity.py @@ -1410,11 +1410,13 @@ def test_parity_multi_path_reversible_predict_het(self, golden_values): f"h={h} n_obs: py={py_h['n_obs']} vs r={r_h['n_obs']}" ) # `p_value` and `conf_int` parity (post-2026-05-15 df threading). - # `_compute_heterogeneity_test` now passes `df = n_obs - n_params` - # to `safe_inference`, matching R's t-distribution with WLS df. - # Pinned at INFERENCE_RTOL = 1e-4 because Wald-test critical - # values come from `scipy.stats.t.ppf` and `t.sf` which are - # implementation-aligned with R's `qt`/`pt` to ~6 sig figs. + # `_compute_heterogeneity_test` now passes + # `df = n_obs - rank(design)` to `safe_inference`, matching + # R's t-distribution with OLS df (full-rank designs have + # `rank == n_params`). Pinned at INFERENCE_RTOL = 1e-4 + # because Wald-test critical values come from + # `scipy.stats.t.ppf` and `t.sf` which are implementation- + # aligned with R's `qt`/`pt` to ~6 sig figs. assert py_h["p_value"] == pytest.approx( r_h["p_value"], rel=self.INFERENCE_RTOL ), f"h={h} p_value: py={py_h['p_value']:.6e} vs r={r_h['p_value']:.6e}"