Fix MDN cache key collision from sum-of-hashes#179
Merged
Conversation
_generate_data_hash previously computed pd.util.hash_pandas_object(X).sum() and hash_pandas_object(y).sum() to key the disk cache, which (1) loses row ordering because sum is commutative — any row permutation hashes identically — and (2) makes cross-dataset collisions trivial to construct (matching shape/columns and a matching hash sum is enough). Consequence: a cache lookup could silently load a stale MDN trained on a different dataset of the same shape (silent correctness bug). Replace the sum-of-hashes with an order-sensitive SHA-256 over the raw bytes of pd.util.hash_pandas_object(X, index=True).values. The final truncation to 16 hex chars (64 bits) is collision-resistant for any realistic cache size. Tests (in new tests/test_models/test_mdn_cache_key.py): - same content -> same hash - value mutation -> different hash - row permutation -> different hash (regression for #5) - distinct datasets with same shape -> different hash - 50 random datasets -> 50 distinct hashes - _generate_cache_key integrates the data hash Tests gated with pytest.importorskip so they only run when torch / pytorch_tabular are installed (mdn.py's top-level torch import is not optional).
|
The latest updates on your projects. Learn more about Vercel for GitHub.
|
MaxGhenis
commented
Apr 17, 2026
Contributor
Author
MaxGhenis
left a comment
There was a problem hiding this comment.
MDN cache key correctness verified:
- Replaced
pd.util.hash_pandas_object(X).sum()(commutative sum-of-hashes, order-invariant, collision-prone) withhashlib.sha256(pd.util.hash_pandas_object(X, index=True).values.tobytes())— order-sensitive content hash. - Truncation only in the final 16-char fingerprint (64 bits, collision-resistant for any realistic cache size).
Test coverage is thorough: stability (same data → same hash), value-change detection, row-permutation → different hash (the exact sum-of-hashes bug), 50 random datasets → 50 distinct hashes, and _generate_cache_key integrates the data hash. Tests gated on pytest.importorskip('torch', 'pytorch_tabular') so non-torch environments skip cleanly.
CI all green. Mergeable. LGTM.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes finding #5.
_generate_data_hashusedpd.util.hash_pandas_object(X).sum()to key the disk cache. Summing per-row hashes (1) is commutative, so any row permutation hashes identically, and (2) makes cross-dataset collisions trivial. A collision would silently load a stale MDN from disk for a different dataset — a silent correctness bug.Change
Replace sum-of-hashes with SHA-256 over the raw bytes of
pd.util.hash_pandas_object(X, index=True).values— an order-sensitive content digest.Test plan
New
tests/test_models/test_mdn_cache_key.pycovers:_generate_cache_keyintegrates the new hashTests gated with
pytest.importorskipso they run only where torch/pytorch_tabular are installed.