feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes)#783
Conversation
Failing-tests commit — exercises the keyed-cache substrate that doesn't exist yet on main. The implementation lands in the next commit. Asserts: chain hash is deterministic, split_chain_by_scope partitions quick-command ops from cleaning ops, widget init populates the cache and pointer traits, and a quick_command_args flip does not move the raw/clean pointers (cache hit for those scopes; only the filt scope's chain has new bytes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Three pointer traits — raw_sd_key, clean_sd_key, filt_sd_key — carry an opaque hash of the op chain that produced each scope's SD. A new ``summary_stats_cache: Dict`` traitlet maps those hashes to the parquet-b64 SD blob. The frontend reads <scope>_sd_key and looks the entry up in the cache; Python is the sole writer of cache keys, so there's no canonical-JSON contract to drift across the boundary. The key invariant: a state change that doesn't move a scope's chain must not produce a new SD computation. Concretely, flipping ``quick_command_args`` keeps the raw and clean keys constant — the existing entries are cache hits — and only the filt scope sees a new key and recomputes. The observer reuses ``self.summary_sd`` for the filt scope (it's already computed by ``_summary_sd``), so this lands without adding a second pass through the analysis pipeline for the scope that the old flow was already covering. Raw and clean scopes do new SD compute, but only on cache miss and only when their hash differs from filt's — so a widget with no cleaning method and no filter does exactly one SD compute total, the same as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190778427or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190778427MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26190778427" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7ee861edb8
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if self.processed_df is None: | ||
| return | ||
| chains = split_chain_by_scope(self.operations) | ||
| keys = {scope: hash_chain(chain) for scope, chain in chains.items()} |
There was a problem hiding this comment.
Invalidate SD cache when source dataframe changes
The cache key is derived only from operations, so changing raw_df or sample_method with the same op chain reuses the old key and skips recomputation, leaving summary_stats_cache entries stale for the new dataset. In this flow, summary_sd is recomputed for the new data, but the if keys[scope] in new_cache guard prevents updating the cached blob, so consumers of raw_sd_key/clean_sd_key/filt_sd_key can read stats from the previous dataframe.
Useful? React with 👍 / 👎.
| if keys[scope] in new_cache: | ||
| continue |
There was a problem hiding this comment.
Include analysis class set in SD cache identity
When analysis_klasses changes, summary stats semantics change even if the op chain is identical, but the key still hashes only the chain; existing entries are treated as cache hits and never rewritten. That means adding/removing analyses can leave summary_stats_cache serving blobs computed with the old analysis set, despite observing analysis_klasses changes.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
file an issue about this, but its out of scope for now
Summary
A keyed summary-stats cache: three pointer traits (
raw_sd_key/clean_sd_key/filt_sd_key) carry an opaque hash of the opchain that produced each scope's SD, and a new
summary_stats_cache: Dicttraitlet maps those hashes to parquet-b64 SD blobs. The frontendreads
<scope>_sd_keyand looks the entry up in the cache; Python isthe sole writer of cache keys, so there's no canonical-JSON contract
to drift across the boundary.
The substrate that makes the whole "stop recomputing SDs you already
have" line of work tractable. Sits on top of whatever encoding
sd_to_parquet_b64produces (today's row format, or #782's wide-typedformat when that lands — they compose).
The invariant that matters
A state change that doesn't move a scope's chain must not produce a
new SD computation for that scope. Concretely, flipping
quick_command_argskeeps the raw and clean keys constant — thoseentries are cache hits — and only the filt scope sees a new key.
That's the test (
tests/unit/dataflow/sd_cache_test.py):Why this lands cleanly with no perf regression
The naive shape — "compute SD for all three scopes from scratch every
state change" — would have tripled the analysis-pipeline work on
xorq backends (the
test_widget_construction_query_countregressiontest catches that).
The observer instead reuses
self.summary_sdfor the filtscope (it's already computed by
_summary_sd), so this landswithout adding a second pass through the analysis pipeline for the
scope the old flow was already covering. Raw and clean do fresh SD
compute, but only on cache miss and only when their hash differs from
filt's. A widget with no cleaning method and no filter does exactly
one SD compute total — same as before this PR. Query count test
stays green.
Why this isn't built on top of any other open PR
There are related in-flight PRs around scoped summary stats (#778 and
the smorgasbord stack) and SD encoding (#782). This branches cleanly
from main:
filtered_*keys /merged_sdmachinery in feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd #778. The keyed cache is the lower-levelprimitive that addresses the same need at a different layer; the two
can ship in either order and reconcile at merge time.
Test plan
tests/unit/dataflow/sd_cache_test.pyfor empty static artifact in worktree)
test_widget_construction_query_countstays ≤6 queriesruff check+paddy-formatclean🤖 Generated with Claude Code