Fix MDN cache key collision from sum-of-hashes by MaxGhenis · Pull Request #179 · PolicyEngine/microimpute

MaxGhenis · 2026-04-17T12:42:24Z

Summary

Fixes finding #5. _generate_data_hash used pd.util.hash_pandas_object(X).sum() to key the disk cache. Summing per-row hashes (1) is commutative, so any row permutation hashes identically, and (2) makes cross-dataset collisions trivial. A collision would silently load a stale MDN from disk for a different dataset — a silent correctness bug.

Change

Replace sum-of-hashes with SHA-256 over the raw bytes of pd.util.hash_pandas_object(X, index=True).values — an order-sensitive content digest.

Test plan

New tests/test_models/test_mdn_cache_key.py covers:

identical content -> identical digest
mutated value -> different digest
row permutation -> different digest (regression test for Add proper testing framework with pytest #5)
50 random datasets -> 50 distinct digests
_generate_cache_key integrates the new hash

Tests gated with pytest.importorskip so they run only where torch/pytorch_tabular are installed.

_generate_data_hash previously computed pd.util.hash_pandas_object(X).sum() and hash_pandas_object(y).sum() to key the disk cache, which (1) loses row ordering because sum is commutative — any row permutation hashes identically — and (2) makes cross-dataset collisions trivial to construct (matching shape/columns and a matching hash sum is enough). Consequence: a cache lookup could silently load a stale MDN trained on a different dataset of the same shape (silent correctness bug). Replace the sum-of-hashes with an order-sensitive SHA-256 over the raw bytes of pd.util.hash_pandas_object(X, index=True).values. The final truncation to 16 hex chars (64 bits) is collision-resistant for any realistic cache size. Tests (in new tests/test_models/test_mdn_cache_key.py): - same content -> same hash - value mutation -> different hash - row permutation -> different hash (regression for #5) - distinct datasets with same shape -> different hash - 50 random datasets -> 50 distinct hashes - _generate_cache_key integrates the data hash Tests gated with pytest.importorskip so they only run when torch / pytorch_tabular are installed (mdn.py's top-level torch import is not optional).

vercel · 2026-04-17T12:42:26Z

The latest updates on your projects. Learn more about Vercel for GitHub.

Project	Deployment	Actions	Updated (UTC)
microimpute-dashboard	Ready	Preview, Comment	Apr 17, 2026 0:43am

MaxGhenis

MDN cache key correctness verified:

Replaced pd.util.hash_pandas_object(X).sum() (commutative sum-of-hashes, order-invariant, collision-prone) with hashlib.sha256(pd.util.hash_pandas_object(X, index=True).values.tobytes()) — order-sensitive content hash.
Truncation only in the final 16-char fingerprint (64 bits, collision-resistant for any realistic cache size).

Test coverage is thorough: stability (same data → same hash), value-change detection, row-permutation → different hash (the exact sum-of-hashes bug), 50 random datasets → 50 distinct hashes, and _generate_cache_key integrates the data hash. Tests gated on pytest.importorskip('torch', 'pytorch_tabular') so non-torch environments skip cleanly.

CI all green. Mergeable. LGTM.

vercel bot deployed to Preview April 17, 2026 12:43 View deployment

MaxGhenis commented Apr 17, 2026

View reviewed changes

MaxGhenis merged commit 5668e9f into main Apr 17, 2026
7 checks passed

MaxGhenis deleted the fix/mdn-cache-key branch April 17, 2026 16:11

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix MDN cache key collision from sum-of-hashes#179

Fix MDN cache key collision from sum-of-hashes#179
MaxGhenis merged 1 commit intomainfrom
fix/mdn-cache-key

MaxGhenis commented Apr 17, 2026

Uh oh!

vercel bot commented Apr 17, 2026 •

edited

Loading

Uh oh!

MaxGhenis left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

MaxGhenis commented Apr 17, 2026

Summary

Change

Test plan

Uh oh!

vercel bot commented Apr 17, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

MaxGhenis left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

vercel bot commented Apr 17, 2026 •

edited

Loading