Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 2 additions & 6 deletions TODO.md
Original file line number Diff line number Diff line change
Expand Up @@ -96,13 +96,12 @@ Deferred items from PR reviews that were not addressed before merge.
| Replicate weight tests use Fay-like BRR perturbations (0.5/1.5), not true half-sample BRR. Add true BRR regressions per estimator family. Existing `test_survey_phase6.py` covers true BRR at the helper level. | `tests/test_replicate_weight_expansion.py` | #253 | Low |
| WooldridgeDiD: QMLE sandwich uses `aweight` cluster-robust adjustment `(G/(G-1))*(n-1)/(n-k)` vs Stata's `G/(G-1)` only. Conservative (inflates SEs). Add `qmle` weight type if Stata golden values confirm material difference. | `wooldridge.py`, `linalg.py` | #216 | Medium |
| WooldridgeDiD: aggregation weights use cell-level n_{g,t} counts. Paper (W2025 Eqs. 7.2-7.4) defines cohort-share weights. Add optional `weights="cohort_share"` parameter to `aggregate()`. | `wooldridge_results.py` | #216 | Medium |
| WooldridgeDiD: canonical link requirement (W2023 Prop 3.1) not enforced — no warning if user applies wrong method to outcome type. Estimator is consistent regardless, but equivalence with imputation breaks. | `wooldridge.py` | #216 | Low |
| WooldridgeDiD: optional *efficiency hint* (NOT a canonical-link violation per W2023 Prop 3.1) when method/outcome pairing is sub-optimal — e.g., `method="ols"` on binary data is consistent under QMLE, but `method="logit"` is typically more efficient. The original framing in this row as a "canonical link requirement" tied to Prop 3.1 was incorrect: Wooldridge (2023) Table 1 lists Gaussian/OLS for "any response" and logistic-Bernoulli for "binary OR fractional". A useful hint exists (efficiency), but should not be framed as a methodology violation. See PR #453 R1 review for the corrected reading. | `wooldridge.py` | #216 | Low |
| WooldridgeDiD: Stata `jwdid` golden value tests — add R/Stata reference script and `TestReferenceValues` class. | `tests/test_wooldridge.py` | #216 | Medium |
| Thread `vcov_type` (classical / hc1 / hc2 / hc2_bm) through the 8 standalone estimators that expose `cluster=`: `CallawaySantAnna`, `SunAbraham`, `ImputationDiD`, `TwoStageDiD`, `TripleDifference`, `StackedDiD`, `WooldridgeDiD`, `EfficientDiD`. Phase 1a added `vcov_type` to the `DifferenceInDifferences` inheritance chain only. | multiple | Phase 1a | Medium |
| Weighted one-way Bell-McCaffrey (`vcov_type="hc2_bm"` + `weights`, no cluster) currently raises `NotImplementedError`. `_compute_bm_dof_from_contrasts` builds its hat matrix from the unscaled design via `X (X'WX)^{-1} X' W`, but `solve_ols` solves the WLS problem by transforming to `X* = sqrt(w) X`, so the correct symmetric idempotent residual-maker is `M* = I - sqrt(W) X (X'WX)^{-1} X' sqrt(W)`. Rederive the Satterthwaite `(tr G)^2 / tr(G^2)` ratio on the transformed design and add weighted parity tests before lifting the guard. | `linalg.py::_compute_bm_dof_from_contrasts`, `linalg.py::_validate_vcov_args` | Phase 1a | Medium |
| HC2 / HC2 + Bell-McCaffrey on absorbed-FE fits currently raises `NotImplementedError` in three places: `TwoWayFixedEffects` unconditionally; `DifferenceInDifferences(absorb=..., vcov_type in {"hc2","hc2_bm"})`; `MultiPeriodDiD(absorb=..., vcov_type in {"hc2","hc2_bm"})`. Within-transformation preserves coefficients and residuals under FWL but not the hat matrix, so the reduced-design `h_ii` is not the diagonal of the full FE projection and CR2's block adjustment `A_g = (I - H_gg)^{-1/2}` is likewise wrong on absorbed cluster blocks. Lifting the guard needs HC2/CR2-BM computed from the full absorbed projection (unit/time FE dummies reconstructed internally, or a FE-aware hat-matrix formulation) and a parity harness against a full-dummy OLS run or R `fixest`/`clubSandwich`. HC1/CR1 are unaffected by this because they have no leverage term. | `twfe.py::fit`, `estimators.py::DifferenceInDifferences.fit`, `estimators.py::MultiPeriodDiD.fit` | Phase 1a | Medium |
| Weighted CR2 Bell-McCaffrey cluster-robust (`vcov_type="hc2_bm"` + `cluster_ids` + `weights`) currently raises `NotImplementedError`. Weighted hat matrix and residual rebalancing need threading per clubSandwich WLS handling. | `linalg.py::_compute_cr2_bm` | Phase 1a | Medium |
| `honest_did.py:1907` `np.linalg.solve(A_sys, b_sys) / except LinAlgError: continue` is a silent basis-rejection in the vertex-enumeration loop that is algorithmically intentional (try the next basis). Consider surfacing a count of rejected bases as a diagnostic when ARP enumeration exhausts, so users see when the vertex search was heavily constrained. Not a silent failure in the sense of the Phase 2 audit (the algorithm is supposed to skip), but the diagnostic would help debug borderline cases. | `honest_did.py` | #334 | Low |
| Unify Rust local-method `estimate_model` solver path to `solve_wls_svd` (the same SVD helper used by the global-method since PR #348) for sub-1e-14 bootstrap SE parity. Current local-method bootstrap parity test (`tests/test_rust_backend.py::TestTROPRustEdgeCaseParity::test_bootstrap_seed_reproducibility_local`) passes at `atol=1e-5` — the residual ~1e-7 gap is roundoff between Rust's `estimate_model` matrix factorization and numpy's `lstsq`, which accumulates differently across per-replicate bootstrap fits. Main-fit ATT parity is regime-dependent (`atol=1e-14` for `lambda_nn=inf`, `atol=1e-10` for finite `lambda_nn` — see `test_local_method_main_fit_parity`); the bootstrap gap is a same-solver-path roundoff concern and not a user-visible correctness bug. | `rust/src/trop.rs::estimate_model`, `rust/src/linalg.rs::solve_wls_svd` | follow-up | Low |
| Rust multiplier-bootstrap weight RNG (`generate_bootstrap_weights_batch` in `rust/src/bootstrap.rs:9-10, 57-75`) uses `Xoshiro256PlusPlus::seed_from_u64(seed + i)` per row for Rademacher/Mammen/Webb generation. If any Python caller (SDID / efficient-DiD multiplier bootstrap) has a numpy-canonical equivalent, the two backends likely diverge under the same seed. Audit Python callers (`diff_diff/sdid.py`, `diff_diff/efficient_did_bootstrap.py`, `diff_diff/bootstrap_utils.py::generate_bootstrap_weights_batch_numpy`) for parity-test gaps. Same fix shape as TROP RNG parity (PR #354): pre-generate weights in Python via numpy and pass them to Rust through PyO3. | `rust/src/bootstrap.rs`, `diff_diff/bootstrap_utils.py` | follow-up | Medium |
| `bias_corrected_local_linear`: extend golden parity to `kernel="triangular"` and `kernel="uniform"` (currently epa-only; all three kernels share `kernel_W` and the `lprobust` math, so parity is expected but not separately asserted). | `benchmarks/R/generate_nprobust_lprobust_golden.R`, `tests/test_bias_corrected_lprobust.py` | Phase 1c | Low |
Expand Down Expand Up @@ -160,7 +159,6 @@ Deferred items from PR reviews that were not addressed before merge.
| CS R helpers hard-code `xformla = ~ 1`; no covariate-adjusted R benchmark for IRLS path | `tests/test_methodology_callaway.py` | #202 | Low |
| Doc-snippet smoke tests only cover `.rst` files; `.txt` AI guides outside CI validation | `tests/test_doc_snippets.py` | #239 | Low |
| Add CI validation for `docs/doc-deps.yaml` integrity (stale paths, unmapped source files) | `docs/doc-deps.yaml` | #269 | Low |
| HonestDiD `test_m0_short_circuit` uses wall-clock `elapsed < 0.5s` as a proxy for "short-circuit path taken" instead of calling the full optimizer. Replace with a direct correctness signal (mock/spy the optimizer or check a state flag) so the test doesn't depend on CI timing. Not flaky today at 500ms, but load-bearing correctness on a timing proxy is brittle. | `tests/test_methodology_honest_did.py:246` | — | Low |
| SyntheticDiD: rename internal `placebo_effects` variable to `variance_effects` (or `resampled_effects`). Misleading name across the placebo/bootstrap/jackknife dispatch paths — holds three different contents depending on variance method. Low-risk refactor; user-facing field rename should preserve `placebo_effects` as a deprecated alias for one release. | `synthetic_did.py`, `results.py` | follow-up | Medium |
| AI review CI: pin workflow contract via test (uses `openai/codex-action@v1`, passes `prompt-file`, reads `steps.run_codex.outputs.final-message`, preserves diff-exclude paths and comment markers). Currently only the wrapper-tag and closing-tag-escape strings are asserted. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #416 | Low |
| `TestWorkflowDoesNotExecutePRHeadCode` (CodeQL #14 dismissal guard) does not model: `bash <script>` / `sh <script>` / `./<script>` / `source <script>` / `. <script>` direct shell-script execution; multi-line `python3 -c` bodies (line-by-line shlex can't reassemble across newlines — the workflow's 5 sanitizer bodies are exempt by invisibility); shell-variable-expansion indirection (`SCRIPT="$X"; python3 "$SCRIPT"`); `eval`; `find -exec`; `xargs -I {}`. Each represents a path by which PR-head bytes COULD execute without the test failing. The guard catches accidental regressions of common forms (16 tests covering pip/npm/cargo/maturin/etc. installs, python file exec, bash -c indirection with compound flags, env-var prefixes, line continuations, subshells/brace groups, single-line python -c, write-overwrites of allowlisted /tmp paths). Closing the residuals would require multi-line shell parsing with command-substitution awareness + script-execution allowlists — significant work for diminishing return given the dismissal's primary defense is the documented threat model on the alert and in `.github/workflows/ai_pr_review.yml` comment block. | `tests/test_openai_review.py`, `.github/workflows/ai_pr_review.yml` | #436 | Low |
Expand All @@ -174,12 +172,10 @@ Ordered paydown view across the tables above. Tier A → D is by effort × risk,

#### Tier A — Quick wins (≤1 day, ≤3 CI rounds expected)

- HonestDiD `test_m0_short_circuit`: replace wall-clock `elapsed < 0.5s` proxy with a state flag (`tests/test_methodology_honest_did.py:246`)
- EfficientDiD `control_group="last_cohort"` REGISTRY-vs-code alignment with `anticipation>0` (`efficient_did.py`, one design decision)
- TripleDifference: add `generate_ddd_panel_data` for panel DDD power analysis (`prep_dgp.py`, `power.py`)
- TROP: extract shared data-setup helper between `fit()` and `_fit_global()` (~150 LoC dedup; `trop.py`, `trop_global.py`, `trop_local.py`)
- WooldridgeDiD: emit warning when canonical-link is violated (`wooldridge.py`; W2023 Prop 3.1)
- HonestDiD vertex-rejection diagnostic counter on ARP enumeration exhaust (`honest_did.py:1907`)
- WooldridgeDiD: optional efficiency hint when method/outcome pairing is sub-optimal (NOT a canonical-link violation per W2023 Prop 3.1 — see Methodology/Correctness row for the corrected framing)

(SyntheticDiD `placebo_effects` → `variance_effects` rename moved to Tier B — the user-facing field rename + one-release deprecation alias is too large for ≤1 day / ≤3 CI rounds.)

Expand Down
37 changes: 36 additions & 1 deletion diff_diff/honest_did.py
Original file line number Diff line number Diff line change
Expand Up @@ -18,6 +18,7 @@
"""

from dataclasses import dataclass, field
import warnings
from typing import Any, Dict, List, Literal, Optional, Tuple, Union

import numpy as np
Expand Down Expand Up @@ -1890,11 +1891,15 @@ def _enumerate_vertices(
if n_eq > n_moments:
return []

vertices = []
vertices: List[np.ndarray] = []
n_total = 0
n_linalg_error = 0
n_infeasible = 0

# Each vertex has exactly n_eq non-zero (basic) variables
for basis_idx in itertools.combinations(range(n_moments), n_eq):
basis_idx = list(basis_idx)
n_total += 1

# Build the system for basic variables
# gamma[basis_idx]' @ X_tilde[basis_idx, :] = 0
Expand All @@ -1913,13 +1918,43 @@ def _enumerate_vertices(
try:
gamma_basic = np.linalg.solve(A_sys, b_sys)
except np.linalg.LinAlgError:
n_linalg_error += 1
continue

# Check feasibility: gamma >= 0
if np.all(gamma_basic >= -1e-10):
gamma = np.zeros(n_moments)
gamma[basis_idx] = np.maximum(gamma_basic, 0)
vertices.append(gamma)
else:
n_infeasible += 1

# Diagnostic warnings — surface vertex-search pathologies that would
# otherwise hide behind `_compute_arp_test` returning False
# (conservative non-rejection).
if n_total > 0 and len(vertices) == 0:
warnings.warn(
f"ARP vertex enumeration exhausted without feasible vertices: "
f"tried {n_total} bases, {n_linalg_error} rejected for "
f"LinAlgError, {n_infeasible} infeasible (negative basic "
f"variables). The caller (_compute_arp_test) will return False "
f"(conservative non-rejection). This may indicate near-singular "
f"nuisance constraints (X_tilde) or a degenerate "
f"moment-inequality system.",
RuntimeWarning,
stacklevel=3,
)
elif n_total > 0 and n_linalg_error / n_total >= 0.5:
warnings.warn(
f"ARP vertex enumeration heavily constrained: "
f"{n_linalg_error} of {n_total} bases ({100 * n_linalg_error / n_total:.0f}%) "
f"rejected for LinAlgError, {n_infeasible} infeasible. "
f"{len(vertices)} feasible vertex(es) recovered. Results may be "
f"numerically fragile; consider regularizing the moment-inequality "
f"system or reviewing the nuisance constraints (X_tilde).",
RuntimeWarning,
stacklevel=3,
)

return vertices

Expand Down
107 changes: 86 additions & 21 deletions tests/test_methodology_honest_did.py
Original file line number Diff line number Diff line change
Expand Up @@ -5,7 +5,8 @@
equations, known analytical cases, and expected mathematical properties.
"""

import os

import warnings

import numpy as np
import pytest
Expand Down Expand Up @@ -245,34 +246,37 @@ def test_optimal_flci_is_finite_and_valid(self):
assert ci_lb_opt <= lb, "CI lower should be <= identified set lower"
assert ci_ub_opt >= ub, "CI upper should be >= identified set upper"

@pytest.mark.skipif(
os.environ.get("CI") == "true",
reason="wall-clock timing is flaky on shared CI runners; short-circuit "
"correctness signal will be replaced with a mock/spy per TODO.md "
"(see PR #330 follow-up note)",
)
def test_m0_short_circuit(self):
"""M=0 should use standard CI without optimization.

Uses wall-clock elapsed time as a proxy for "short-circuit path
taken" — fast path is ``<0.5s``, slow optimization would be ``>>
0.5s``. Skipped on CI because neighbor-VM contention on shared
runners can push even the short-circuit path past the threshold.
Run locally to validate the fast-path invariant; the TODO.md entry
added by PR #330 tracks replacing this with a mock/spy so the
correctness signal becomes CI-safe.
"""M=0 takes the bias=0 fast path and never invokes the LP solver.

``_compute_worst_case_bias`` returns ``0.0`` immediately when ``M=0``
(diff_diff/honest_did.py:1650), so ``scipy.optimize.linprog`` is
never reached. Patching the LP solver and asserting ``call_count
== 0`` is a direct correctness signal — CI-safe (no wall-clock
dependency) and faster than the prior timing-based proxy.
"""
from unittest.mock import patch

beta_pre = np.array([0.3, 0.2, 0.1])
beta_post = np.array([2.0])
sigma = np.eye(4) * 0.01
l_vec = np.array([1.0])

import time
t0 = time.time()
_compute_optimal_flci(beta_pre, beta_post, sigma, l_vec, 3, 1, M=0.0)
elapsed = time.time() - t0
with patch("diff_diff.honest_did.optimize.linprog") as mock_linprog:
ci_lb, ci_ub = _compute_optimal_flci(
beta_pre, beta_post, sigma, l_vec, 3, 1, M=0.0
)

assert elapsed < 0.5, f"M=0 should be fast, took {elapsed:.2f}s"
assert mock_linprog.call_count == 0, (
f"M=0 must skip the LP solver (fast path at "
f"_compute_worst_case_bias:1650); got "
f"{mock_linprog.call_count} linprog call(s)."
)
# End-to-end correctness: M=0 CI is still well-defined.
assert np.isfinite(ci_lb) and np.isfinite(ci_ub), (
f"M=0 CI must be finite; got [{ci_lb}, {ci_ub}]"
)
assert ci_lb <= ci_ub, f"M=0 CI must be ordered; got [{ci_lb}, {ci_ub}]"

def test_smoothness_flci_with_survey_df(self):
"""Survey df should widen the smoothness FLCI (folded t vs folded normal)."""
Expand Down Expand Up @@ -493,3 +497,64 @@ def test_breakdown_monotonicity(self):
# The optimal FLCI is efficient, so need large M for a weak effect.
r_large = honest.fit(results, M=20.0)
assert r_large.ci_lb <= 0 <= r_large.ci_ub, "Should lose significance at large M"


class TestARPVertexEnumeration:
"""Diagnostic warnings on `_enumerate_vertices` vertex-search pathologies."""

def test_enumerate_vertices_warns_on_exhausted_search(self):
"""All-LinAlgError path: fully-zero nuisance column makes A_sys
singular on every basis, so the enumeration exhausts without
feasible vertices and the user should see a RuntimeWarning rather
than a silent empty-list return."""
from diff_diff.honest_did import _enumerate_vertices

# 4 moments, 1 nuisance column (all zeros) → A_sys singular on every basis
X_tilde = np.zeros((4, 1))
sigma_tilde_diag = np.array([1.0, 1.0, 1.0, 1.0])
with pytest.warns(RuntimeWarning, match="exhausted"):
vertices = _enumerate_vertices(X_tilde, sigma_tilde_diag, n_moments=4)
assert vertices == []

def test_enumerate_vertices_warns_on_heavy_rejection(self):
"""Mixed-basis path: 5 moments, 1 nuisance column. C(5, 2) = 10
bases. By design, 6 bases hit LinAlgError (the singular pairs
among indices {0,1,2,3} that share aligned nuisance/sigma values)
and 4 bases produce feasible vertices (the (i, 4) pairs that pair
a positive-X_tilde row with the unique negative-X_tilde row at
index 4). 60% rejection rate trips the `heavily constrained`
branch specifically, not the exhausted branch."""
from diff_diff.honest_did import _enumerate_vertices

X_tilde = np.array([[1.0], [1.0], [1.0], [2.0], [-1.0]])
sigma_tilde_diag = np.array([1.0, 1.0, 1.0, 2.0, 1.0])
with pytest.warns(RuntimeWarning, match="heavily constrained"):
vertices = _enumerate_vertices(X_tilde, sigma_tilde_diag, n_moments=5)
assert len(vertices) >= 1, (
f"Heavy-rejection construction must still produce some feasible "
f"vertices (otherwise the exhausted branch fires); got "
f"{len(vertices)} vertices."
)

def test_enumerate_vertices_quiet_on_healthy_enumeration(self):
"""Well-conditioned X_tilde: most bases solve cleanly and feasible
vertices are recovered. No RuntimeWarning should fire."""
from diff_diff.honest_did import _enumerate_vertices

rng = np.random.default_rng(0)
# 4 moments, 1 nuisance — small and well-conditioned
X_tilde = rng.normal(size=(4, 1))
sigma_tilde_diag = np.array([1.0, 1.0, 1.0, 1.0])
with warnings.catch_warnings(record=True) as caught:
warnings.simplefilter("always", RuntimeWarning)
vertices = _enumerate_vertices(X_tilde, sigma_tilde_diag, n_moments=4)
diag_warnings = [
w for w in caught
if "exhausted" in str(w.message) or "heavily constrained" in str(w.message)
]
assert not diag_warnings, (
f"Healthy enumeration must not emit ARP diagnostics; got "
f"{[str(w.message) for w in diag_warnings]}"
)
# Sanity: we expect some feasible vertices on a well-conditioned input
assert isinstance(vertices, list)
Loading