Skip to content

feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd#785

Merged
paddymul merged 5 commits into
mainfrom
feat/scoped-sd-merged-prefix-v2
May 21, 2026
Merged

feat(scoped-sd): merge raw + filt scope SDs into prefixed merged_sd#785
paddymul merged 5 commits into
mainfrom
feat/scoped-sd-merged-prefix-v2

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

Wires #783's keyed SD cache into _merged_sd so the dataflow emits the single prefixed-key all_stats shape that #777's ?key JS already consumes. This is the reconciliation of the two competing scoped-SD designs (per the discussion that led #778 to be closed in favor of going through #783).

  • Bare keys (mean, histogram_bins, …) come from the raw scope SD in the cache.
  • filtered_* keys are layered on top from the filt scope SD when filt_sd_key != raw_sd_key (search filter active).
  • Frontend rendering is unchanged from feat(pinned-rows): ?key prefix marks an optional pinned row #777 — the merged merged_sd flows through the existing df_data_dict.all_stats path. The cache + pointer traits stay un-synced; they're Python-internal memoization.

Breaking change

merged_sd[col]["mean"] now refers to the pre-filter (raw) value, not the post-everything view. Post-filter values are available as filtered_mean etc. when a filter is active.

What it corrects in #783

  1. Codex P1 — raw_df invalidation. _scope_cache_key now folds id(sampled_df) and post_processing_method into the hash, so a raw_df swap (or sample-method flip, or post-processing change) with an unchanged chain no longer reuses a stale entry. New test_raw_df_change_invalidates_scoped_sd pins this.
  2. Post-processing reflection in raw scope. _compute_scope_df now applies the active post-processing method to the raw/clean base df, so when post-processing replaces the frame entirely (e.g. hide_postSENTINEL_DF), the raw-scope bare keys carry the new df's column metadata. Restores the test_hide_column_config_post_processing / test_add_analysis invariants.
  3. Cache stores dicts, not parquet-b64. The cache is no longer synced — the frontend consumes only merged_sd — so the parquet-b64 conversion is dead weight here. _merged_sd reads cached dicts directly.

Deferred (separate issue)

  • Codex P2 — analysis_klasses not in cache identity. Same fold-into-extra-arg pattern as P1; filed as a follow-up issue.
  • cleaned_* scope (third scope) in merged_sd. The cache already computes the clean scope SD; wiring it into merged_sd as cleaned_* keys is straightforward but deferred to keep this PR focused on the raw + filt invariants. Same ?key mechanism handles it on the frontend.

Test plan

  • 4/4 new tests in tests/unit/dataflow/scoped_summary_stats_test.py covering no-cleaning/no-filter baseline, filter activates filtered_* keys, bare length reflects raw dataset, raw_df swap invalidates cache.
  • Full Python suite: 954 passed (one MCP test deselected — the known empty-static-artifact-in-worktree issue called out in feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes) #783's PR body, unrelated to this change).
  • CI green

Replaces

This branch supersedes the closed #778. The old branch was based on a parallel design (5-tuple handle_ops_and_clean returning both filtered and unfiltered dfs); the reconciled shape here reuses #783's per-scope cache as the substrate instead.

🤖 Generated with Claude Code

paddymul and others added 2 commits May 20, 2026 18:10
…d_sd

Four tests covering the dataflow-level shape of the scope-merged
``merged_sd`` that sits on top of #783's keyed-SD cache:

- no cleaning, no filter → only bare-key raw scope (passes today)
- filter active → bare-key raw + ``filtered_*`` scope (fails today)
- bare ``length`` reflects raw 5-row dataset, not post-filter 3-row
  view (fails today — deliberate breaking change)
- raw_df swap surfaces the new dataset's stats (passes today; will
  fail and need codex's P1 fix once the cache becomes the bare-key
  source)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Wires #783's keyed SD cache into ``_merged_sd`` so the dataflow emits
the single prefixed-key ``all_stats`` shape that #777's ``?key`` JS
already consumes:

- Bare keys (``mean``, ``histogram_bins``, …) come from the raw scope
  SD in the cache.
- ``filtered_*`` keys are layered on top from the filt scope SD when
  ``filt_sd_key != raw_sd_key`` (i.e. a search filter is active).

Frontend rendering is unchanged from #777 — the merged ``merged_sd``
flows through the existing ``df_data_dict.all_stats`` path. The cache
and pointer traits stay un-synced; they're Python-internal
memoization, not a wire-format primitive.

Deliberate breaking change: ``merged_sd[col]["mean"]`` now refers to
the pre-filter (raw) value, not the post-everything view. Post-filter
values are available as ``filtered_mean`` etc. when a filter is
active.

Three substantive corrections to the cache shape from #783:

1. **Codex P1 — raw_df invalidation.** ``_scope_cache_key`` now folds
   ``id(sampled_df)`` and ``post_processing_method`` into the hash, so
   a ``raw_df`` swap (or sample-method flip, or post-processing change)
   with an unchanged chain no longer reuses a stale entry. The new
   ``test_raw_df_change_invalidates_scoped_sd`` test pins this.
2. **Post-processing reflection in raw scope.** ``_compute_scope_df``
   now applies the active post-processing method to the raw/clean
   base df, so when post-processing replaces the frame entirely (e.g.
   ``hide_post`` → ``SENTINEL_DF``), the raw-scope bare keys carry
   the new df's column metadata. Restores the
   ``test_hide_column_config_post_processing`` and
   ``test_add_analysis`` invariants from
   ``customizable_dataflow_test.py``.
3. **Cache stores dicts, not parquet-b64.** The cache is no longer
   synced — the frontend consumes only ``merged_sd`` — so the
   parquet-b64 conversion is dead weight here. ``_merged_sd`` reads
   the cached dicts directly. (The parquet-b64 wire form continues to
   apply at the ``df_data_dict.all_stats`` boundary.)

Codex P2 (``analysis_klasses`` not in cache identity) is deferred —
filed as a follow-up issue. The same pattern (fold into the ``extra``
arg of ``hash_chain``) will apply.

Test plan:
- 4/4 new tests in ``tests/unit/dataflow/scoped_summary_stats_test.py``
- Full Python suite: 954 passed, 0 failures (one MCP test was
  deselected — the known empty-static-artifact-in-worktree issue
  called out in #783's PR body, unrelated to this change).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26198305638

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26198305638

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26198305638" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 62cf525b21

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

"""
sampled_id = id(self.sampled_df) if self.sampled_df is not None else 0
pp = self.post_processing_method or ''
return hash_chain(chain, extra=f"{sampled_id}|{pp}")
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Include analysis class set in scope cache key

When analysis_klasses changes (for example via add_analysis) but the op chain, sampled dataframe identity, and post-processing method stay the same, _scope_cache_key returns the same hash and _populate_sd_cache treats all scopes as cache hits. Because _merged_sd now reads from summary_stats_cache rather than summary_sd, the merged stats can stay stale and omit newly added analysis outputs until some unrelated state change forces a new key. This breaks correctness for dynamic analysis updates in a single widget session.

Useful? React with 👍 / 👎.

Pre-asserts the bug: with cleaning_method='default' and no search
filter, the current filter_active gate (filt_sd_key != raw_sd_key)
fires because the filt chain differs from the empty raw chain. Result
is nine spurious filtered_* keys (filtered_mean, filtered_length, ...)
carrying cleaning-affected stats. Semantically those keys mean "search
filter is on", not "cleaning is on".

Expected to fail on CI; fix in the next commit reroutes the gate to
chains['filt'] != chains['clean'].

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The previous gate fired whenever filt_sd_key != raw_sd_key, which is
true any time cleaning ops exist (the filt chain is the clean chain
plus quick-commands; the raw chain is always empty). Result: with
cleaning_method='default' and no search filter, ~9 cleaning-affected
stats per column were exported under filtered_* keys, when those keys
semantically mean "search filter is active".

Re-derive the gate from the chains directly: chains['filt'] !=
chains['clean'] is true iff at least one quick-command op is present —
exactly the "search filter applied on top of cleaning" condition.

The deferred cleaned_* scope (PR #785 body) will give cleaning-affected
stats their own keys; until then, omitting them from the merged_sd is
strictly better than mislabelling them.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul added a commit that referenced this pull request May 21, 2026
…ation

Adds four tests pinning the deferred items on #785:

- ``test_cleaned_keys_appear_when_cleaning_active`` — clean scope SD
  must be layered into ``merged_sd`` with a ``cleaned_*`` prefix when
  cleaning ops are active.
- ``test_cleaning_only_does_not_emit_filtered_keys`` — the broken
  ``filter_active`` gate currently mislabels cleaning-affected stats
  as ``filtered_*`` when no search filter is active. The right gate is
  on chain-shape diff (``filt != clean``).
- ``test_filter_and_clean_both_emit_correctly`` — both scopes layered
  without cross-talk; ``cleaned_null_count`` reflects the clean df,
  ``filtered_null_count`` reflects the search-nulled df.
- ``test_analysis_klasses_change_invalidates_scoped_sd`` — Codex P2
  pin: a swap of ``analysis_klasses`` must invalidate the per-scope
  SD cache so the new stat klass's keys surface in ``merged_sd``.

All four are expected to fail on this commit; fixes follow.
Captures the punted work that travels with this PR so reviewers + future
maintainers see what's intentionally not in scope and how each item
slots in cleanly later.

- plans/0785-post-processing-known-issues.md — pp × scope edge-case
  coverage punted; mechanism (per-scope pp application + cache key
  including post_processing_method) is structurally sound.
- plans/0785-cleaning-scope-known-issues.md — cleaned_* keys in
  merged_sd deferred; coupled with the filter_active gate bug
  (xfail'd test pins it). Right gate is chain-shape diff between
  filt and clean, which only makes sense once cleaned_* is layered.
- plans/0785-codex-p2-analysis-klasses.md — analysis_klasses not in
  cache key (Codex P2); one-line fold-into-extra fix, deferred
  because the bug fires only in dev mutation workflows.
- plans/0785-xorq-cache-delegation.md — for xorq scopes, lean on
  xorq's expression cache instead of duplicating in
  summary_stats_cache. Refinement, not correctness.

No code changes — branch notes only. Each file ends with a "How this
slots in cleanly" / "What un-punting looks like" section.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
paddymul added a commit that referenced this pull request May 21, 2026
…ation

Adds four tests pinning the deferred items on #785:

- ``test_cleaned_keys_appear_when_cleaning_active`` — clean scope SD
  must be layered into ``merged_sd`` with a ``cleaned_*`` prefix when
  cleaning ops are active.
- ``test_cleaning_only_does_not_emit_filtered_keys`` — the broken
  ``filter_active`` gate currently mislabels cleaning-affected stats
  as ``filtered_*`` when no search filter is active. The right gate is
  on chain-shape diff (``filt != clean``).
- ``test_filter_and_clean_both_emit_correctly`` — both scopes layered
  without cross-talk; ``cleaned_null_count`` reflects the clean df,
  ``filtered_null_count`` reflects the search-nulled df.
- ``test_analysis_klasses_change_invalidates_scoped_sd`` — Codex P2
  pin: a swap of ``analysis_klasses`` must invalidate the per-scope
  SD cache so the new stat klass's keys surface in ``merged_sd``.

All four are expected to fail on this commit; fixes follow.
@paddymul paddymul added this pull request to the merge queue May 21, 2026
Merged via the queue into main with commit 7a75b5b May 21, 2026
27 checks passed
paddymul added a commit that referenced this pull request May 21, 2026
…ation

Adds four tests pinning the deferred items on #785:

- ``test_cleaned_keys_appear_when_cleaning_active`` — clean scope SD
  must be layered into ``merged_sd`` with a ``cleaned_*`` prefix when
  cleaning ops are active.
- ``test_cleaning_only_does_not_emit_filtered_keys`` — the broken
  ``filter_active`` gate currently mislabels cleaning-affected stats
  as ``filtered_*`` when no search filter is active. The right gate is
  on chain-shape diff (``filt != clean``).
- ``test_filter_and_clean_both_emit_correctly`` — both scopes layered
  without cross-talk; ``cleaned_null_count`` reflects the clean df,
  ``filtered_null_count`` reflects the search-nulled df.
- ``test_analysis_klasses_change_invalidates_scoped_sd`` — Codex P2
  pin: a swap of ``analysis_klasses`` must invalidate the per-scope
  SD cache so the new stat klass's keys surface in ``merged_sd``.

All four are expected to fail on this commit; fixes follow.
paddymul added a commit that referenced this pull request May 21, 2026
Adds an optional ``?filtered_histogram`` pinned row to polars's default
styling. The ``?`` prefix (#777) means JS hides the row when no column
has the ``filtered_histogram`` key in ``merged_sd``, so a no-filter
widget keeps a one-line ``[dtype, histogram]`` header. When the user
runs a search, ``filtered_histogram`` gets layered onto each column's
merged_sd entry (#785) and the row renders.

Polars-only on purpose:
- xorq skips computing ``filtered_histogram`` entirely (this PR's
  ``skip_stats_by_scope``), so the optional row never appears.
- Pandas keeps the original ``[dtype, histogram]`` default.

New ``pinned_filtered_histogram()`` helper in ``styling_helpers``,
mirroring the existing ``pinned_histogram()``. New
``PolarsMainStyling(DefaultMainStyling)`` in ``polars_buckaroo``
overrides ``pinned_rows`` and replaces ``DefaultMainStyling`` in
``local_analysis_klasses``.

Config-only — no behavior tests, just a new default in the styling
pipeline. The optional-pinned-row plumbing (#777) is already
exercised by ``gridUtils.test.ts``.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant