Skip to content

feat: behavior hashing for cache key invalidation#196

Open
ptomecek wants to merge 5 commits intomainfrom
pit/behavior-hashing
Open

feat: behavior hashing for cache key invalidation#196
ptomecek wants to merge 5 commits intomainfrom
pit/behavior-hashing

Conversation

@ptomecek
Copy link
Copy Markdown
Collaborator

Summary

Adds compute_behavior_token() — a deterministic SHA-256 fingerprint of a class's method bytecode. When callable logic changes, cache keys automatically invalidate without requiring config changes.

This is PR 1 of 3 splitting the tokenization work from #195:

  1. This PR: Behavior hashing (standalone, additive)
  2. Replace dask tokenization with native implementation
  3. Model-level token caching and optimizations

What's included

compute_behavior_token(cls)

  • Hashes co_code + co_consts (minus docstrings) for each method
  • Walks MRO with override semantics (subclass overrides parent)
  • Automatically unwraps decorator chains (@Flow.call, functools.wraps, etc.) via inspect.unwrap
  • Supports __ccflow_tokenizer_deps__ for declaring extra standalone function dependencies
  • Dependencies sorted by qualname (order-insensitive)
  • Result cached per-class in __behavior_token_cache__ (not inherited by subclasses)
  • Returns None for classes with no hashable methods

cache_key() integration

  • Includes behavior token for the underlying model class
  • Includes behavior tokens for non-transparent evaluators in the chain
  • Only included when not None (backward-compatible — no change for classes without methods)

What's NOT included (future PRs)

  • No changes to dask dependency or normalize_token
  • No model_token property on BaseModel
  • No data tokenization changes

Tests

26 new tests covering:

  • Core behavior token computation
  • Method collection and sorting
  • __ccflow_tokenizer_deps__ ordering and changes
  • @Flow.call decorator unwrapping
  • MRO/inherited method handling
  • Subclass cache independence
  • cache_key() integration with CallableModel

All 672 existing tests pass (2 skipped).

@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented Apr 17, 2026

Test Results

690 tests  +40   688 ✅ +40   1m 43s ⏱️ ±0s
  1 suites ± 0     2 💤 ± 0 
  1 files   ± 0     0 ❌ ± 0 

Results for commit 66059da. ± Comparison against base commit c291194.

♻️ This comment has been updated with latest results.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 17, 2026

Codecov Report

❌ Patch coverage is 81.09091% with 104 lines in your changes missing coverage. Please review.
✅ Project coverage is 95.20%. Comparing base (a16f19b) to head (66059da).
⚠️ Report is 4 commits behind head on main.

Files with missing lines Patch % Lines
ccflow/tests/utils/test_tokenize.py 80.19% 80 Missing ⚠️
ccflow/utils/tokenize.py 81.81% 14 Missing and 10 partials ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main     #196      +/-   ##
==========================================
- Coverage   95.98%   95.20%   -0.79%     
==========================================
  Files         140      141       +1     
  Lines        9797    10362     +565     
  Branches      568      601      +33     
==========================================
+ Hits         9404     9865     +461     
- Misses        275      369      +94     
- Partials      118      128      +10     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@ptomecek ptomecek force-pushed the pit/behavior-hashing branch from fa37bfb to c9fe30e Compare April 23, 2026 18:48
ptomecek and others added 4 commits April 23, 2026 17:47
Add compute_behavior_token() which produces a SHA-256 fingerprint of a
class's method bytecode. Decorator chains (@Flow.call, etc.) are
automatically unwrapped via inspect.unwrap so the hash reflects the
user's implementation, not the wrapper.

Key design:
- Walks MRO with override semantics (subclass overrides parent)
- Supports __ccflow_tokenizer_deps__ for extra standalone functions
- Dependencies sorted by qualname (order-insensitive)
- Cached per-class in __behavior_token_cache__ (not inherited)
- Returns None for classes with no hashable methods

Integration: cache_key() now includes behavior tokens for the model
and any non-transparent evaluators, so code changes invalidate the
cache without requiring a config change.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
Add compute_data_token() as the single wrapper around dask tokenization
and refactor cache_key() to combine precomputed data and behavior tokens
instead of mutating one nested payload dict.

This makes cache_key() mostly orchestration:
- flatten the evaluation context chain
- collect data/behavior tokens for the underlying model
- collect data/behavior tokens for non-transparent evaluators
- combine those tokens into one final cache key

Also adds tests for compute_data_token() and opaque evaluator behavior.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
Update behavior hashing so function defaults, keyword-only defaults,
and closure cell contents contribute to compute_behavior_token(). This
closes a cache-key correctness gap where semantic changes could leave
behavior tokens unchanged.

Also merge __ccflow_tokenizer_deps__ across the full MRO instead of
first-definition-wins, with deterministic deduping so subclasses can add
deps without dropping inherited ones.

Add regression tests for defaults, kwdefaults, closures, inherited deps,
and a cache_key integration check for helper default changes.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
Add compute_cache_token() alongside compute_data_token() and
compute_behavior_token(), refactor cache_key() to delegate to it, and
rename the cached class attribute to __ccflow_tokenizer_cache__ so it
matches __ccflow_tokenizer_deps__.

This commit also keeps class support in __ccflow_tokenizer_deps__,
including recursive class-dependency detection, and adds regression
coverage for combined cache tokens and cache-key integration.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
@ptomecek ptomecek force-pushed the pit/behavior-hashing branch from c9fe30e to d1e6924 Compare April 23, 2026 21:47
@ptomecek ptomecek marked this pull request as draft April 23, 2026 21:49
Add a private SHA-256 helper in ccflow.utils.tokenize so the hash
algorithm is defined in one place, rename the tokenize tests to
ccflow/tests/utils/test_tokenize.py to match the module name, and
document how MemoryCacheEvaluator cache keys are built.

The docs now describe how data tokens, behavior tokens, transparent
vs non-transparent evaluators, and __ccflow_tokenizer_deps__ all feed
into compute_cache_token().

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Signed-off-by: Pascal Tomecek <pascal.tomecek@cubistsystematic.com>
@ptomecek ptomecek marked this pull request as ready for review April 23, 2026 21:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants