feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd#785
Conversation
…d_sd Four tests covering the dataflow-level shape of the scope-merged ``merged_sd`` that sits on top of #783's keyed-SD cache: - no cleaning, no filter → only bare-key raw scope (passes today) - filter active → bare-key raw + ``filtered_*`` scope (fails today) - bare ``length`` reflects raw 5-row dataset, not post-filter 3-row view (fails today — deliberate breaking change) - raw_df swap surfaces the new dataset's stats (passes today; will fail and need codex's P1 fix once the cache becomes the bare-key source) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires #783's keyed SD cache into ``_merged_sd`` so the dataflow emits the single prefixed-key ``all_stats`` shape that #777's ``?key`` JS already consumes: - Bare keys (``mean``, ``histogram_bins``, …) come from the raw scope SD in the cache. - ``filtered_*`` keys are layered on top from the filt scope SD when ``filt_sd_key != raw_sd_key`` (i.e. a search filter is active). Frontend rendering is unchanged from #777 — the merged ``merged_sd`` flows through the existing ``df_data_dict.all_stats`` path. The cache and pointer traits stay un-synced; they're Python-internal memoization, not a wire-format primitive. Deliberate breaking change: ``merged_sd[col]["mean"]`` now refers to the pre-filter (raw) value, not the post-everything view. Post-filter values are available as ``filtered_mean`` etc. when a filter is active. Three substantive corrections to the cache shape from #783: 1. **Codex P1 — raw_df invalidation.** ``_scope_cache_key`` now folds ``id(sampled_df)`` and ``post_processing_method`` into the hash, so a ``raw_df`` swap (or sample-method flip, or post-processing change) with an unchanged chain no longer reuses a stale entry. The new ``test_raw_df_change_invalidates_scoped_sd`` test pins this. 2. **Post-processing reflection in raw scope.** ``_compute_scope_df`` now applies the active post-processing method to the raw/clean base df, so when post-processing replaces the frame entirely (e.g. ``hide_post`` → ``SENTINEL_DF``), the raw-scope bare keys carry the new df's column metadata. Restores the ``test_hide_column_config_post_processing`` and ``test_add_analysis`` invariants from ``customizable_dataflow_test.py``. 3. **Cache stores dicts, not parquet-b64.** The cache is no longer synced — the frontend consumes only ``merged_sd`` — so the parquet-b64 conversion is dead weight here. ``_merged_sd`` reads the cached dicts directly. (The parquet-b64 wire form continues to apply at the ``df_data_dict.all_stats`` boundary.) Codex P2 (``analysis_klasses`` not in cache identity) is deferred — filed as a follow-up issue. The same pattern (fold into the ``extra`` arg of ``hash_chain``) will apply. Test plan: - 4/4 new tests in ``tests/unit/dataflow/scoped_summary_stats_test.py`` - Full Python suite: 954 passed, 0 failures (one MCP test was deselected — the known empty-static-artifact-in-worktree issue called out in #783's PR body, unrelated to this change). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26198305638or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26198305638MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26198305638" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 62cf525b21
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| """ | ||
| sampled_id = id(self.sampled_df) if self.sampled_df is not None else 0 | ||
| pp = self.post_processing_method or '' | ||
| return hash_chain(chain, extra=f"{sampled_id}|{pp}") |
There was a problem hiding this comment.
Include analysis class set in scope cache key
When analysis_klasses changes (for example via add_analysis) but the op chain, sampled dataframe identity, and post-processing method stay the same, _scope_cache_key returns the same hash and _populate_sd_cache treats all scopes as cache hits. Because _merged_sd now reads from summary_stats_cache rather than summary_sd, the merged stats can stay stale and omit newly added analysis outputs until some unrelated state change forces a new key. This breaks correctness for dynamic analysis updates in a single widget session.
Useful? React with 👍 / 👎.
Pre-asserts the bug: with cleaning_method='default' and no search filter, the current filter_active gate (filt_sd_key != raw_sd_key) fires because the filt chain differs from the empty raw chain. Result is nine spurious filtered_* keys (filtered_mean, filtered_length, ...) carrying cleaning-affected stats. Semantically those keys mean "search filter is on", not "cleaning is on". Expected to fail on CI; fix in the next commit reroutes the gate to chains['filt'] != chains['clean']. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous gate fired whenever filt_sd_key != raw_sd_key, which is true any time cleaning ops exist (the filt chain is the clean chain plus quick-commands; the raw chain is always empty). Result: with cleaning_method='default' and no search filter, ~9 cleaning-affected stats per column were exported under filtered_* keys, when those keys semantically mean "search filter is active". Re-derive the gate from the chains directly: chains['filt'] != chains['clean'] is true iff at least one quick-command op is present — exactly the "search filter applied on top of cleaning" condition. The deferred cleaned_* scope (PR #785 body) will give cleaning-affected stats their own keys; until then, omitting them from the merged_sd is strictly better than mislabelling them. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation Adds four tests pinning the deferred items on #785: - ``test_cleaned_keys_appear_when_cleaning_active`` — clean scope SD must be layered into ``merged_sd`` with a ``cleaned_*`` prefix when cleaning ops are active. - ``test_cleaning_only_does_not_emit_filtered_keys`` — the broken ``filter_active`` gate currently mislabels cleaning-affected stats as ``filtered_*`` when no search filter is active. The right gate is on chain-shape diff (``filt != clean``). - ``test_filter_and_clean_both_emit_correctly`` — both scopes layered without cross-talk; ``cleaned_null_count`` reflects the clean df, ``filtered_null_count`` reflects the search-nulled df. - ``test_analysis_klasses_change_invalidates_scoped_sd`` — Codex P2 pin: a swap of ``analysis_klasses`` must invalidate the per-scope SD cache so the new stat klass's keys surface in ``merged_sd``. All four are expected to fail on this commit; fixes follow.
Captures the punted work that travels with this PR so reviewers + future maintainers see what's intentionally not in scope and how each item slots in cleanly later. - plans/0785-post-processing-known-issues.md — pp × scope edge-case coverage punted; mechanism (per-scope pp application + cache key including post_processing_method) is structurally sound. - plans/0785-cleaning-scope-known-issues.md — cleaned_* keys in merged_sd deferred; coupled with the filter_active gate bug (xfail'd test pins it). Right gate is chain-shape diff between filt and clean, which only makes sense once cleaned_* is layered. - plans/0785-codex-p2-analysis-klasses.md — analysis_klasses not in cache key (Codex P2); one-line fold-into-extra fix, deferred because the bug fires only in dev mutation workflows. - plans/0785-xorq-cache-delegation.md — for xorq scopes, lean on xorq's expression cache instead of duplicating in summary_stats_cache. Refinement, not correctness. No code changes — branch notes only. Each file ends with a "How this slots in cleanly" / "What un-punting looks like" section. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…ation Adds four tests pinning the deferred items on #785: - ``test_cleaned_keys_appear_when_cleaning_active`` — clean scope SD must be layered into ``merged_sd`` with a ``cleaned_*`` prefix when cleaning ops are active. - ``test_cleaning_only_does_not_emit_filtered_keys`` — the broken ``filter_active`` gate currently mislabels cleaning-affected stats as ``filtered_*`` when no search filter is active. The right gate is on chain-shape diff (``filt != clean``). - ``test_filter_and_clean_both_emit_correctly`` — both scopes layered without cross-talk; ``cleaned_null_count`` reflects the clean df, ``filtered_null_count`` reflects the search-nulled df. - ``test_analysis_klasses_change_invalidates_scoped_sd`` — Codex P2 pin: a swap of ``analysis_klasses`` must invalidate the per-scope SD cache so the new stat klass's keys surface in ``merged_sd``. All four are expected to fail on this commit; fixes follow.
…ation Adds four tests pinning the deferred items on #785: - ``test_cleaned_keys_appear_when_cleaning_active`` — clean scope SD must be layered into ``merged_sd`` with a ``cleaned_*`` prefix when cleaning ops are active. - ``test_cleaning_only_does_not_emit_filtered_keys`` — the broken ``filter_active`` gate currently mislabels cleaning-affected stats as ``filtered_*`` when no search filter is active. The right gate is on chain-shape diff (``filt != clean``). - ``test_filter_and_clean_both_emit_correctly`` — both scopes layered without cross-talk; ``cleaned_null_count`` reflects the clean df, ``filtered_null_count`` reflects the search-nulled df. - ``test_analysis_klasses_change_invalidates_scoped_sd`` — Codex P2 pin: a swap of ``analysis_klasses`` must invalidate the per-scope SD cache so the new stat klass's keys surface in ``merged_sd``. All four are expected to fail on this commit; fixes follow.
Adds an optional ``?filtered_histogram`` pinned row to polars's default styling. The ``?`` prefix (#777) means JS hides the row when no column has the ``filtered_histogram`` key in ``merged_sd``, so a no-filter widget keeps a one-line ``[dtype, histogram]`` header. When the user runs a search, ``filtered_histogram`` gets layered onto each column's merged_sd entry (#785) and the row renders. Polars-only on purpose: - xorq skips computing ``filtered_histogram`` entirely (this PR's ``skip_stats_by_scope``), so the optional row never appears. - Pandas keeps the original ``[dtype, histogram]`` default. New ``pinned_filtered_histogram()`` helper in ``styling_helpers``, mirroring the existing ``pinned_histogram()``. New ``PolarsMainStyling(DefaultMainStyling)`` in ``polars_buckaroo`` overrides ``pinned_rows`` and replaces ``DefaultMainStyling`` in ``local_analysis_klasses``. Config-only — no behavior tests, just a new default in the styling pipeline. The optional-pinned-row plumbing (#777) is already exercised by ``gridUtils.test.ts``. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Wires #783's keyed SD cache into
_merged_sdso the dataflow emits the single prefixed-keyall_statsshape that #777's?keyJS already consumes. This is the reconciliation of the two competing scoped-SD designs (per the discussion that led #778 to be closed in favor of going through #783).mean,histogram_bins, …) come from the raw scope SD in the cache.filtered_*keys are layered on top from the filt scope SD whenfilt_sd_key != raw_sd_key(search filter active).?keyprefix marks an optional pinned row #777 — the mergedmerged_sdflows through the existingdf_data_dict.all_statspath. The cache + pointer traits stay un-synced; they're Python-internal memoization.Breaking change
merged_sd[col]["mean"]now refers to the pre-filter (raw) value, not the post-everything view. Post-filter values are available asfiltered_meanetc. when a filter is active.What it corrects in #783
_scope_cache_keynow foldsid(sampled_df)andpost_processing_methodinto the hash, so araw_dfswap (or sample-method flip, or post-processing change) with an unchanged chain no longer reuses a stale entry. Newtest_raw_df_change_invalidates_scoped_sdpins this._compute_scope_dfnow applies the active post-processing method to the raw/clean base df, so when post-processing replaces the frame entirely (e.g.hide_post→SENTINEL_DF), the raw-scope bare keys carry the new df's column metadata. Restores thetest_hide_column_config_post_processing/test_add_analysisinvariants.merged_sd— so the parquet-b64 conversion is dead weight here._merged_sdreads cached dicts directly.Deferred (separate issue)
analysis_klassesnot in cache identity. Same fold-into-extra-arg pattern as P1; filed as a follow-up issue.cleaned_*scope (third scope) inmerged_sd. The cache already computes the clean scope SD; wiring it intomerged_sdascleaned_*keys is straightforward but deferred to keep this PR focused on the raw + filt invariants. Same?keymechanism handles it on the frontend.Test plan
tests/unit/dataflow/scoped_summary_stats_test.pycovering no-cleaning/no-filter baseline, filter activatesfiltered_*keys, barelengthreflects raw dataset, raw_df swap invalidates cache.Replaces
This branch supersedes the closed #778. The old branch was based on a parallel design (5-tuple
handle_ops_and_cleanreturning both filtered and unfiltered dfs); the reconciled shape here reuses #783's per-scope cache as the substrate instead.🤖 Generated with Claude Code