Skip to content

Fix MDN cache key collision from sum-of-hashes#179

Merged
MaxGhenis merged 1 commit intomainfrom
fix/mdn-cache-key
Apr 17, 2026
Merged

Fix MDN cache key collision from sum-of-hashes#179
MaxGhenis merged 1 commit intomainfrom
fix/mdn-cache-key

Conversation

@MaxGhenis
Copy link
Copy Markdown
Contributor

Summary

Fixes finding #5. _generate_data_hash used pd.util.hash_pandas_object(X).sum() to key the disk cache. Summing per-row hashes (1) is commutative, so any row permutation hashes identically, and (2) makes cross-dataset collisions trivial. A collision would silently load a stale MDN from disk for a different dataset — a silent correctness bug.

Change

Replace sum-of-hashes with SHA-256 over the raw bytes of pd.util.hash_pandas_object(X, index=True).values — an order-sensitive content digest.

Test plan

New tests/test_models/test_mdn_cache_key.py covers:

  • identical content -> identical digest
  • mutated value -> different digest
  • row permutation -> different digest (regression test for Add proper testing framework with pytest #5)
  • 50 random datasets -> 50 distinct digests
  • _generate_cache_key integrates the new hash

Tests gated with pytest.importorskip so they run only where torch/pytorch_tabular are installed.

_generate_data_hash previously computed pd.util.hash_pandas_object(X).sum()
and hash_pandas_object(y).sum() to key the disk cache, which (1) loses
row ordering because sum is commutative — any row permutation hashes
identically — and (2) makes cross-dataset collisions trivial to
construct (matching shape/columns and a matching hash sum is enough).
Consequence: a cache lookup could silently load a stale MDN trained on
a different dataset of the same shape (silent correctness bug).

Replace the sum-of-hashes with an order-sensitive SHA-256 over the raw
bytes of pd.util.hash_pandas_object(X, index=True).values. The final
truncation to 16 hex chars (64 bits) is collision-resistant for any
realistic cache size.

Tests (in new tests/test_models/test_mdn_cache_key.py):
- same content -> same hash
- value mutation -> different hash
- row permutation -> different hash (regression for #5)
- distinct datasets with same shape -> different hash
- 50 random datasets -> 50 distinct hashes
- _generate_cache_key integrates the data hash

Tests gated with pytest.importorskip so they only run when torch /
pytorch_tabular are installed (mdn.py's top-level torch import is not
optional).
@vercel
Copy link
Copy Markdown

vercel bot commented Apr 17, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
microimpute-dashboard Ready Ready Preview, Comment Apr 17, 2026 0:43am

Copy link
Copy Markdown
Contributor Author

@MaxGhenis MaxGhenis left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MDN cache key correctness verified:

  • Replaced pd.util.hash_pandas_object(X).sum() (commutative sum-of-hashes, order-invariant, collision-prone) with hashlib.sha256(pd.util.hash_pandas_object(X, index=True).values.tobytes()) — order-sensitive content hash.
  • Truncation only in the final 16-char fingerprint (64 bits, collision-resistant for any realistic cache size).

Test coverage is thorough: stability (same data → same hash), value-change detection, row-permutation → different hash (the exact sum-of-hashes bug), 50 random datasets → 50 distinct hashes, and _generate_cache_key integrates the data hash. Tests gated on pytest.importorskip('torch', 'pytorch_tabular') so non-torch environments skip cleanly.

CI all green. Mergeable. LGTM.

@MaxGhenis MaxGhenis merged commit 5668e9f into main Apr 17, 2026
7 checks passed
@MaxGhenis MaxGhenis deleted the fix/mdn-cache-key branch April 17, 2026 16:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant