feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd by paddymul · Pull Request #785 · buckaroo-data/buckaroo

paddymul · 2026-05-20T22:21:18Z

Summary

Wires #783's keyed SD cache into _merged_sd so the dataflow emits the single prefixed-key all_stats shape that #777's ?key JS already consumes. This is the reconciliation of the two competing scoped-SD designs (per the discussion that led #778 to be closed in favor of going through #783).

Bare keys (mean, histogram_bins, …) come from the raw scope SD in the cache.
filtered_* keys are layered on top from the filt scope SD when filt_sd_key != raw_sd_key (search filter active).
Frontend rendering is unchanged from feat(pinned-rows): ?key prefix marks an optional pinned row #777 — the merged merged_sd flows through the existing df_data_dict.all_stats path. The cache + pointer traits stay un-synced; they're Python-internal memoization.

Breaking change

merged_sd[col]["mean"] now refers to the pre-filter (raw) value, not the post-everything view. Post-filter values are available as filtered_mean etc. when a filter is active.

What it corrects in #783

Codex P1 — raw_df invalidation. _scope_cache_key now folds id(sampled_df) and post_processing_method into the hash, so a raw_df swap (or sample-method flip, or post-processing change) with an unchanged chain no longer reuses a stale entry. New test_raw_df_change_invalidates_scoped_sd pins this.
Post-processing reflection in raw scope. _compute_scope_df now applies the active post-processing method to the raw/clean base df, so when post-processing replaces the frame entirely (e.g. hide_post → SENTINEL_DF), the raw-scope bare keys carry the new df's column metadata. Restores the test_hide_column_config_post_processing / test_add_analysis invariants.
Cache stores dicts, not parquet-b64. The cache is no longer synced — the frontend consumes only merged_sd — so the parquet-b64 conversion is dead weight here. _merged_sd reads cached dicts directly.

Deferred (separate issue)

Codex P2 — analysis_klasses not in cache identity. Same fold-into-extra-arg pattern as P1; filed as a follow-up issue.
cleaned_* scope (third scope) in merged_sd. The cache already computes the clean scope SD; wiring it into merged_sd as cleaned_* keys is straightforward but deferred to keep this PR focused on the raw + filt invariants. Same ?key mechanism handles it on the frontend.

Test plan

4/4 new tests in tests/unit/dataflow/scoped_summary_stats_test.py covering no-cleaning/no-filter baseline, filter activates filtered_* keys, bare length reflects raw dataset, raw_df swap invalidates cache.
Full Python suite: 954 passed (one MCP test deselected — the known empty-static-artifact-in-worktree issue called out in feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes) #783's PR body, unrelated to this change).
CI green

Replaces

This branch supersedes the closed #778. The old branch was based on a parallel design (5-tuple handle_ops_and_clean returning both filtered and unfiltered dfs); the reconciled shape here reuses #783's per-scope cache as the substrate instead.

🤖 Generated with Claude Code

…d_sd Four tests covering the dataflow-level shape of the scope-merged ``merged_sd`` that sits on top of #783's keyed-SD cache: - no cleaning, no filter → only bare-key raw scope (passes today) - filter active → bare-key raw + ``filtered_*`` scope (fails today) - bare ``length`` reflects raw 5-row dataset, not post-filter 3-row view (fails today — deliberate breaking change) - raw_df swap surfaces the new dataset's stats (passes today; will fail and need codex's P1 fix once the cache becomes the bare-key source) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Wires #783's keyed SD cache into ``_merged_sd`` so the dataflow emits the single prefixed-key ``all_stats`` shape that #777's ``?key`` JS already consumes: - Bare keys (``mean``, ``histogram_bins``, …) come from the raw scope SD in the cache. - ``filtered_*`` keys are layered on top from the filt scope SD when ``filt_sd_key != raw_sd_key`` (i.e. a search filter is active). Frontend rendering is unchanged from #777 — the merged ``merged_sd`` flows through the existing ``df_data_dict.all_stats`` path. The cache and pointer traits stay un-synced; they're Python-internal memoization, not a wire-format primitive. Deliberate breaking change: ``merged_sd[col]["mean"]`` now refers to the pre-filter (raw) value, not the post-everything view. Post-filter values are available as ``filtered_mean`` etc. when a filter is active. Three substantive corrections to the cache shape from #783: 1. **Codex P1 — raw_df invalidation.** ``_scope_cache_key`` now folds ``id(sampled_df)`` and ``post_processing_method`` into the hash, so a ``raw_df`` swap (or sample-method flip, or post-processing change) with an unchanged chain no longer reuses a stale entry. The new ``test_raw_df_change_invalidates_scoped_sd`` test pins this. 2. **Post-processing reflection in raw scope.** ``_compute_scope_df`` now applies the active post-processing method to the raw/clean base df, so when post-processing replaces the frame entirely (e.g. ``hide_post`` → ``SENTINEL_DF``), the raw-scope bare keys carry the new df's column metadata. Restores the ``test_hide_column_config_post_processing`` and ``test_add_analysis`` invariants from ``customizable_dataflow_test.py``. 3. **Cache stores dicts, not parquet-b64.** The cache is no longer synced — the frontend consumes only ``merged_sd`` — so the parquet-b64 conversion is dead weight here. ``_merged_sd`` reads the cached dicts directly. (The parquet-b64 wire form continues to apply at the ``df_data_dict.all_stats`` boundary.) Codex P2 (``analysis_klasses`` not in cache identity) is deferred — filed as a follow-up issue. The same pattern (fold into the ``extra`` arg of ``hash_chain``) will apply. Test plan: - 4/4 new tests in ``tests/unit/dataflow/scoped_summary_stats_test.py`` - Full Python suite: 954 passed, 0 failures (one MCP test was deselected — the known empty-static-artifact-in-worktree issue called out in #783's PR body, unrelated to this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T22:23:09Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26198305638

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26198305638

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26198305638" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62cf525b21

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T22:24:40Z

+        """
+        sampled_id = id(self.sampled_df) if self.sampled_df is not None else 0
+        pp = self.post_processing_method or ''
+        return hash_chain(chain, extra=f"{sampled_id}|{pp}")


Include analysis class set in scope cache key

When analysis_klasses changes (for example via add_analysis) but the op chain, sampled dataframe identity, and post-processing method stay the same, _scope_cache_key returns the same hash and _populate_sd_cache treats all scopes as cache hits. Because _merged_sd now reads from summary_stats_cache rather than summary_sd, the merged stats can stay stale and omit newly added analysis outputs until some unrelated state change forces a new key. This breaks correctness for dynamic analysis updates in a single widget session.

Useful? React with 👍 / 👎.

Pre-asserts the bug: with cleaning_method='default' and no search filter, the current filter_active gate (filt_sd_key != raw_sd_key) fires because the filt chain differs from the empty raw chain. Result is nine spurious filtered_* keys (filtered_mean, filtered_length, ...) carrying cleaning-affected stats. Semantically those keys mean "search filter is on", not "cleaning is on". Expected to fail on CI; fix in the next commit reroutes the gate to chains['filt'] != chains['clean']. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The previous gate fired whenever filt_sd_key != raw_sd_key, which is true any time cleaning ops exist (the filt chain is the clean chain plus quick-commands; the raw chain is always empty). Result: with cleaning_method='default' and no search filter, ~9 cleaning-affected stats per column were exported under filtered_* keys, when those keys semantically mean "search filter is active". Re-derive the gate from the chains directly: chains['filt'] != chains['clean'] is true iff at least one quick-command op is present — exactly the "search filter applied on top of cleaning" condition. The deferred cleaned_* scope (PR #785 body) will give cleaning-affected stats their own keys; until then, omitting them from the merged_sd is strictly better than mislabelling them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ation Adds four tests pinning the deferred items on #785: - ``test_cleaned_keys_appear_when_cleaning_active`` — clean scope SD must be layered into ``merged_sd`` with a ``cleaned_*`` prefix when cleaning ops are active. - ``test_cleaning_only_does_not_emit_filtered_keys`` — the broken ``filter_active`` gate currently mislabels cleaning-affected stats as ``filtered_*`` when no search filter is active. The right gate is on chain-shape diff (``filt != clean``). - ``test_filter_and_clean_both_emit_correctly`` — both scopes layered without cross-talk; ``cleaned_null_count`` reflects the clean df, ``filtered_null_count`` reflects the search-nulled df. - ``test_analysis_klasses_change_invalidates_scoped_sd`` — Codex P2 pin: a swap of ``analysis_klasses`` must invalidate the per-scope SD cache so the new stat klass's keys surface in ``merged_sd``. All four are expected to fail on this commit; fixes follow.

Captures the punted work that travels with this PR so reviewers + future maintainers see what's intentionally not in scope and how each item slots in cleanly later. - plans/0785-post-processing-known-issues.md — pp × scope edge-case coverage punted; mechanism (per-scope pp application + cache key including post_processing_method) is structurally sound. - plans/0785-cleaning-scope-known-issues.md — cleaned_* keys in merged_sd deferred; coupled with the filter_active gate bug (xfail'd test pins it). Right gate is chain-shape diff between filt and clean, which only makes sense once cleaned_* is layered. - plans/0785-codex-p2-analysis-klasses.md — analysis_klasses not in cache key (Codex P2); one-line fold-into-extra fix, deferred because the bug fires only in dev mutation workflows. - plans/0785-xorq-cache-delegation.md — for xorq scopes, lean on xorq's expression cache instead of duplicating in summary_stats_cache. Refinement, not correctness. No code changes — branch notes only. Each file ends with a "How this slots in cleanly" / "What un-punting looks like" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ation Adds four tests pinning the deferred items on #785: - ``test_cleaned_keys_appear_when_cleaning_active`` — clean scope SD must be layered into ``merged_sd`` with a ``cleaned_*`` prefix when cleaning ops are active. - ``test_cleaning_only_does_not_emit_filtered_keys`` — the broken ``filter_active`` gate currently mislabels cleaning-affected stats as ``filtered_*`` when no search filter is active. The right gate is on chain-shape diff (``filt != clean``). - ``test_filter_and_clean_both_emit_correctly`` — both scopes layered without cross-talk; ``cleaned_null_count`` reflects the clean df, ``filtered_null_count`` reflects the search-nulled df. - ``test_analysis_klasses_change_invalidates_scoped_sd`` — Codex P2 pin: a swap of ``analysis_klasses`` must invalidate the per-scope SD cache so the new stat klass's keys surface in ``merged_sd``. All four are expected to fail on this commit; fixes follow.

Adds an optional ``?filtered_histogram`` pinned row to polars's default styling. The ``?`` prefix (#777) means JS hides the row when no column has the ``filtered_histogram`` key in ``merged_sd``, so a no-filter widget keeps a one-line ``[dtype, histogram]`` header. When the user runs a search, ``filtered_histogram`` gets layered onto each column's merged_sd entry (#785) and the row renders. Polars-only on purpose: - xorq skips computing ``filtered_histogram`` entirely (this PR's ``skip_stats_by_scope``), so the optional row never appears. - Pandas keeps the original ``[dtype, histogram]`` default. New ``pinned_filtered_histogram()`` helper in ``styling_helpers``, mirroring the existing ``pinned_histogram()``. New ``PolarsMainStyling(DefaultMainStyling)`` in ``polars_buckaroo`` overrides ``pinned_rows`` and replaces ``DefaultMainStyling`` in ``local_analysis_klasses``. Config-only — no behavior tests, just a new default in the styling pipeline. The optional-pinned-row plumbing (#777) is already exercised by ``gridUtils.test.ts``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paddymul and others added 2 commits May 20, 2026 18:10

This was referenced May 20, 2026

feat(scoped-sd): scoped summary stats (raw + filtered) in merged_sd #778

Closed

sd-cache: include analysis_klasses in cache identity (codex P2 from #783) #786

Open

paddymul temporarily deployed to testpypi May 20, 2026 22:22 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

paddymul mentioned this pull request May 20, 2026

spike(rows-first): two-message state_change protocol behind env gate #787

Closed

paddymul temporarily deployed to testpypi May 20, 2026 22:52 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi May 20, 2026 23:02 — with GitHub Actions Inactive

paddymul mentioned this pull request May 21, 2026

feat(scoped-sd): wire cleaned_* into merged_sd + Codex P2 fix #789

Merged

paddymul temporarily deployed to testpypi May 21, 2026 00:38 — with GitHub Actions Inactive

paddymul added this pull request to the merge queue May 21, 2026

Merged via the queue into main with commit 7a75b5b May 21, 2026
27 checks passed

paddymul mentioned this pull request May 21, 2026

buckaroo_state_change requests are not superseded; rapid typing wedges the server #794

Open

paddymul mentioned this pull request May 21, 2026

feat(polars): default pinned_rows include ?filtered_histogram #830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd#785

feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd#785
paddymul merged 5 commits into
mainfrom
feat/scoped-sd-merged-prefix-v2

paddymul commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 20, 2026

Summary

Breaking change

What it corrects in #783

Deferred (separate issue)

Test plan

Replaces

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

📖 Docs preview

🎨 Storybook preview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 20, 2026 •

edited

Loading