feat(dataflow): skip_stats_by_scope — per-scope stat skiplist; xorq skips histogram on filt+clean by paddymul · Pull Request #829 · buckaroo-data/buckaroo

paddymul · 2026-05-21T20:46:45Z

Summary

Adds a generic DataFlow.skip_stats_by_scope class attribute that lets a backend declare "don't run stat X when computing scope Y's SD". The motivating case (and only override shipping here): XorqDataflow skips histogram on the filt and clean scopes — on xorq the per-column histogram queries dominate state_change latency on remote tables, and only the bare (raw) histogram is rendered in merged_sd. Running the histogram on filt + clean too was wasted engine work that didn't show in the UI. Pandas / polars dataflows keep the default empty dict, so their filt/clean histograms continue to compute and render.

class XorqDataflow(CustomizableDataflow):
    skip_stats_by_scope = {'filt': {'histogram'}, 'clean': {'histogram'}}

Adapted from the working draft on smorg/post-785-playground (4949e56c), stripped of references to the closed #787 spike (recompute_summary_sd, _defer_summary_sd) and the not-yet-merged #809 (cost_classes).

Plumbing

skip_stat_names threads from the dataflow into the underlying pipelines:

_summary_sd (filt scope) applies the filt skip when quick_command_args is truthy — can't use operations because the cascade hasn't updated it yet when this fires.
_populate_sd_cache (raw + clean scopes) uses _effective_skip(scope, chains), which returns None when the scope is degenerate (no filter → filt == clean; no cleaning → clean == raw). Keeps the no-filter / no-cleaning case collapsing raw + clean + filt under one cache key — the lazy-postprocessor invariant on XorqBuckarooInfiniteWidget (TestLazyPostprocessor) still holds.
_scope_cache_key incorporates the effective skip, so raw + filt cache entries don't collide when their skips differ.
StatPipeline.process_column skips at iteration (stat-func name match) and strips at output (covers v1 ColAnalysis wrappers where one StatFunc provides many keys).
XorqStatPipeline.process_table skips in both batch-aggregate and per-column phases — actual engine queries are elided, not just the output.
DfStats (v1) accepts the kwarg for API parity and strips at output (v1 AnalysisPipeline has no per-stat filtering).

Tests

Failing-then-fix split per repo TDD policy. CI saw 5 failing tests on the first push (commit d79ad48e) and green on the fix (38d76795):

test_skip_stats_by_scope_excludes_named_stat_from_filt — skip mean on filt; filtered_mean absent, bare mean present.
test_skip_stats_by_scope_excludes_from_raw_and_clean — skip on raw/clean; bare mean absent, filtered_mean present.
test_skip_stats_by_scope_default_empty_runs_all_stats — no override → unchanged.
test_xorq_dataflow_skips_histogram_on_filt_and_clean — pins XorqDataflow's default.
test_polars_dataflow_keeps_histogram — Polars keeps the default empty dict.

Full tests/unit/ suite green locally (1049 passed, 6 skipped) — including TestLazyPostprocessor which caught an early version of the cache-key change that re-ran big_a 3× because raw/clean/filt no longer shared a key.

Test plan

Failing test commit pushed first; CI runs the new tests red
Fix commit makes them pass
Full unit suite green locally (1049 passed)
Verify in JupyterLab against the boston restaurant data that filtered/cleaned histogram rows are gone and bare histogram still renders
Measure state_change latency on boston-restaurant before/after — expect the ~6.5s histogram-query slice to drop to ~0.5s (per the 4949e56c commit message claim of ~13× speedup)

🤖 Generated with Claude Code

Pins the contract for a new ``DataFlow.skip_stats_by_scope`` class attribute: a ``{scope_name: {stat_name, ...}}`` map that tells ``_summary_sd`` and ``_populate_sd_cache`` to omit the named stats when computing each scope's SD. Five tests across two files: scoped_summary_stats_test.py - ``test_skip_stats_by_scope_excludes_named_stat_from_filt``: a sub- class with ``skip_stats_by_scope = {'filt': {'mean'}}`` must not emit ``filtered_mean`` while ``mean`` (raw) and ``filtered_length`` remain. - ``test_skip_stats_by_scope_excludes_from_raw_and_clean``: skip on ``raw`` + ``clean`` removes the bare stat key while filt is unaffected. - ``test_skip_stats_by_scope_default_empty_runs_all_stats``: existing CustomizableDataflow subclasses (no override) keep behaviour unchanged. test_xorq_buckaroo_widget.py - ``test_xorq_dataflow_skips_histogram_on_filt_and_clean``: pins ``XorqDataflow.skip_stats_by_scope = {'filt': {'histogram'}, 'clean': {'histogram'}}`` — the user-requested default to keep the per-column engine queries off the hot path while still rendering the bare histogram from raw. - ``test_polars_dataflow_keeps_histogram``: Polars side keeps the default empty dict so filt/clean histograms continue to render. Motivation (per smorg/post-785-playground ``4949e56c``): on the boston restaurant data ~6.5 s of every 9 s state_change is spent in xorq's per-column histogram queries across the three scopes. Skipping histogram on filt+clean drops that to ~0.5 s without losing the visible histogram (still computed in raw and read from the bare key in merged_sd). Fix follows in the next commit. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…kips histogram on filt+clean Adds a ``DataFlow.skip_stats_by_scope`` class attribute that lets a backend declare "don't run stat X when computing scope Y's SD". The motivating case (and only override shipping here): ``XorqDataflow`` sets ``{'filt': {'histogram'}, 'clean': {'histogram'}}`` — on xorq the per-column histogram queries dominate state_change latency on remote tables, and the user only renders the bare (raw) histogram in ``merged_sd``. Running the histogram on filt + clean too was wasted work that didn't show up in the UI. Pandas / polars dataflows keep the default empty dict, so their filt/clean histograms continue to appear. ## Plumbing ``skip_stat_names`` threads from the dataflow into the underlying pipelines: - ``_summary_sd`` (filt scope) — applies filt skip when ``quick_command_args`` is truthy (the cascade hasn't updated ``operations`` yet when this fires, so we can't use the chains). - ``_populate_sd_cache`` (raw + clean scopes) — uses ``_effective_skip(scope, chains)`` which returns ``None`` when the scope is degenerate (filt == clean → no filter; clean == raw → no cleaning). Keeps the no-filter / no-cleaning case collapsing raw + clean + filt under one cache key so the lazy-postprocessor ``add_processing(big_a)`` invariant on ``XorqBuckarooInfiniteWidget`` still holds (postprocessor runs once per state, not three times). - ``_scope_cache_key`` incorporates the effective skip so the cache doesn't collide raw and filt entries when their skips differ. - ``StatPipeline.process_column`` / ``process_df`` / ``process_df_v1_compat``: skip at iteration (stat-func name match) plus a final output-level strip (covers v1 ``ColAnalysis`` wrappers where one ``StatFunc`` provides many keys — ``DefaultSummaryStats__ series`` provides ``mean``, ``max``, ``std``, ...; we can't skip the whole wrapper for one key). - ``XorqStatPipeline.process_table`` / ``process_table_v1_compat``: skip in both the batch-aggregate phase and the per-column phase. - ``DfStatsV2``, ``PlDfStatsV2``, ``XorqDfStatsV2``: accept and forward. - ``DfStats`` (v1) accepts the kwarg for API parity but strips at the output level since v1 ``AnalysisPipeline`` doesn't have per-stat filtering. ## Tests (failing-then-fix split per repo TDD policy) Five new tests in the previous commit, all green now: - ``test_skip_stats_by_scope_excludes_named_stat_from_filt`` — skip ``mean`` on filt; ``filtered_mean`` absent, bare ``mean`` present, other ``filtered_*`` keys present. - ``test_skip_stats_by_scope_excludes_from_raw_and_clean`` — skip on raw/clean; bare ``mean`` absent, ``filtered_mean`` present. - ``test_skip_stats_by_scope_default_empty_runs_all_stats`` — no override means behaviour unchanged. - ``test_xorq_dataflow_skips_histogram_on_filt_and_clean`` — pins XorqDataflow's default. - ``test_polars_dataflow_keeps_histogram`` — Polars side still has empty skiplist. ## Regression Full ``tests/unit/`` suite green (1049 passed, 6 skipped) — including ``TestLazyPostprocessor`` on ``XorqBuckarooInfiniteWidget`` which caught the early version of the cache-key change that recomputed raw+clean scopes whenever skip differed from filt. Adapted from the working draft on smorg/post-785-playground (``4949e56c``), stripped of references to the closed #787 spike (``recompute_summary_sd`` / ``_defer_summary_sd``) and the not-yet-merged #809 (``cost_classes``). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-21T20:51:05Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26252900166

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26252900166

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26252900166" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 38d7679568

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-21T20:51:50Z

+        # this observer via ``processed_result``, *then* sets
+        # ``self.operations``). ``quick_command_args`` is the upstream
+        # trait, set before ``_operation_result`` runs.
+        filter_active = bool(self.quick_command_args)


Determine filt-skip from realized filter ops

Use of bool(self.quick_command_args) here can misclassify "no filter" states as filtered. generate_quick_ops explicitly treats payloads like {'search': ['']} as no-op, so quick_command_args may be non-empty while no quick-command op exists; in that case _summary_sd still applies the filt skip, and because raw/clean/filt chains collapse to one key, _populate_sd_cache can store the skipped summary_sd under the shared key and skip recomputing raw, dropping bare stats (e.g. raw histogram) after clearing search.

Useful? React with 👍 / 👎.

paddymul and others added 2 commits May 21, 2026 16:28

paddymul added the triaged Reviewed and triaged label May 21, 2026

paddymul temporarily deployed to testpypi May 21, 2026 20:50 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed May 21, 2026

View reviewed changes

paddymul temporarily deployed to testpypi May 21, 2026 20:58 — with GitHub Actions Inactive

paddymul mentioned this pull request May 21, 2026

feat(polars): default pinned_rows include ?filtered_histogram #830

Merged

paddymul force-pushed the feat/skip-stats-by-scope branch from ed80f57 to 38d7679 Compare May 21, 2026 21:01

paddymul temporarily deployed to testpypi May 21, 2026 21:03 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(dataflow): skip_stats_by_scope — per-scope stat skiplist; xorq skips histogram on filt+clean#829

feat(dataflow): skip_stats_by_scope — per-scope stat skiplist; xorq skips histogram on filt+clean#829
paddymul wants to merge 2 commits into
mainfrom
feat/skip-stats-by-scope

paddymul commented May 21, 2026

Uh oh!

github-actions Bot commented May 21, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

paddymul commented May 21, 2026

Summary

Plumbing

Tests

Test plan

Uh oh!

github-actions Bot commented May 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

📖 Docs preview

🎨 Storybook preview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 21, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 21, 2026 •

edited

Loading