feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes) by paddymul · Pull Request #783 · buckaroo-data/buckaroo

paddymul · 2026-05-20T21:22:54Z

Summary

A keyed summary-stats cache: three pointer traits (raw_sd_key /
clean_sd_key / filt_sd_key) carry an opaque hash of the op
chain that produced each scope's SD, and a new summary_stats_cache: Dict traitlet maps those hashes to parquet-b64 SD blobs. The frontend
reads <scope>_sd_key and looks the entry up in the cache; Python is
the sole writer of cache keys, so there's no canonical-JSON contract
to drift across the boundary.

The substrate that makes the whole "stop recomputing SDs you already
have" line of work tractable. Sits on top of whatever encoding
sd_to_parquet_b64 produces (today's row format, or #782's wide-typed
format when that lands — they compose).

The invariant that matters

A state change that doesn't move a scope's chain must not produce a
new SD computation for that scope. Concretely, flipping
quick_command_args keeps the raw and clean keys constant — those
entries are cache hits — and only the filt scope sees a new key.

That's the test (tests/unit/dataflow/sd_cache_test.py):

raw_before  == raw_after     # raw pointer unchanged
clean_before == clean_after  # clean pointer unchanged
filt_before != filt_after    # filt pointer moved
len(cache_after) == len(cache_before) + 1   # one new entry, not three

Why this lands cleanly with no perf regression

The naive shape — "compute SD for all three scopes from scratch every
state change" — would have tripled the analysis-pipeline work on
xorq backends (the test_widget_construction_query_count regression
test catches that).

The observer instead reuses self.summary_sd for the filt
scope (it's already computed by _summary_sd), so this lands
without adding a second pass through the analysis pipeline for the
scope the old flow was already covering. Raw and clean do fresh SD
compute, but only on cache miss and only when their hash differs from
filt's. A widget with no cleaning method and no filter does exactly
one SD compute total — same as before this PR. Query count test
stays green.

Why this isn't built on top of any other open PR

There are related in-flight PRs around scoped summary stats (#778 and
the smorgasbord stack) and SD encoding (#782). This branches cleanly
from main:

It does not modify any file that feat(sd-encoding): wide-typed parquet layout — native scalars, no JSON round-trip #782 touches.
It does not depend on, or duplicate, the filtered_* keys /
merged_sd machinery in feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd #778. The keyed cache is the lower-level
primitive that addresses the same need at a different layer; the two
can ship in either order and reconcile at merge time.

Test plan

4/4 new tests in tests/unit/dataflow/sd_cache_test.py
Full Python suite (1015 passed, 1 unrelated MCP test deselected
for empty static artifact in worktree)
test_widget_construction_query_count stays ≤6 queries
ruff check + paddy-format clean
CI green

🤖 Generated with Claude Code

Failing-tests commit — exercises the keyed-cache substrate that doesn't exist yet on main. The implementation lands in the next commit. Asserts: chain hash is deterministic, split_chain_by_scope partitions quick-command ops from cleaning ops, widget init populates the cache and pointer traits, and a quick_command_args flip does not move the raw/clean pointers (cache hit for those scopes; only the filt scope's chain has new bytes). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three pointer traits — raw_sd_key, clean_sd_key, filt_sd_key — carry an opaque hash of the op chain that produced each scope's SD. A new ``summary_stats_cache: Dict`` traitlet maps those hashes to the parquet-b64 SD blob. The frontend reads <scope>_sd_key and looks the entry up in the cache; Python is the sole writer of cache keys, so there's no canonical-JSON contract to drift across the boundary. The key invariant: a state change that doesn't move a scope's chain must not produce a new SD computation. Concretely, flipping ``quick_command_args`` keeps the raw and clean keys constant — the existing entries are cache hits — and only the filt scope sees a new key and recomputes. The observer reuses ``self.summary_sd`` for the filt scope (it's already computed by ``_summary_sd``), so this lands without adding a second pass through the analysis pipeline for the scope that the old flow was already covering. Raw and clean scopes do new SD compute, but only on cache miss and only when their hash differs from filt's — so a widget with no cleaning method and no filter does exactly one SD compute total, the same as before. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T21:25:42Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190778427

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190778427

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26190778427" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 7ee861edb8

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T21:27:19Z

+        if self.processed_df is None:
+            return
+        chains = split_chain_by_scope(self.operations)
+        keys = {scope: hash_chain(chain) for scope, chain in chains.items()}


Invalidate SD cache when source dataframe changes

The cache key is derived only from operations, so changing raw_df or sample_method with the same op chain reuses the old key and skips recomputation, leaving summary_stats_cache entries stale for the new dataset. In this flow, summary_sd is recomputed for the new data, but the if keys[scope] in new_cache guard prevents updating the cached blob, so consumers of raw_sd_key/clean_sd_key/filt_sd_key can read stats from the previous dataframe.

Useful? React with 👍 / 👎.

chatgpt-codex-connector · 2026-05-20T21:27:19Z

+            if keys[scope] in new_cache:
+                continue


Include analysis class set in SD cache identity

When analysis_klasses changes, summary stats semantics change even if the op chain is identical, but the key still hashes only the chain; existing entries are treated as cache hits and never rewritten. That means adding/removing analyses can leave summary_stats_cache serving blobs computed with the old analysis set, despite observing analysis_klasses changes.

Useful? React with 👍 / 👎.

file an issue about this, but its out of scope for now

paddymul and others added 2 commits May 20, 2026 17:21

paddymul temporarily deployed to testpypi May 20, 2026 21:24 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

paddymul added this pull request to the merge queue May 20, 2026

Merged via the queue into main with commit 5d14fab May 20, 2026
27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes)#783

feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes)#783
paddymul merged 2 commits into
mainfrom
feat/sd-keyed-cache

paddymul commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

paddymul May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 20, 2026

Summary

The invariant that matters

Why this lands cleanly with no perf regression

Why this isn't built on top of any other open PR

Test plan

Uh oh!

github-actions Bot commented May 20, 2026

📦 TestPyPI package published

MCP server for Claude Code

📖 Docs preview

🎨 Storybook preview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

paddymul May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant