Skip to content

feat(dataflow): skip_stats_by_scope — per-scope stat skiplist; xorq skips histogram on filt+clean#829

Open
paddymul wants to merge 2 commits into
mainfrom
feat/skip-stats-by-scope
Open

feat(dataflow): skip_stats_by_scope — per-scope stat skiplist; xorq skips histogram on filt+clean#829
paddymul wants to merge 2 commits into
mainfrom
feat/skip-stats-by-scope

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

Adds a generic DataFlow.skip_stats_by_scope class attribute that lets a backend declare "don't run stat X when computing scope Y's SD". The motivating case (and only override shipping here): XorqDataflow skips histogram on the filt and clean scopes — on xorq the per-column histogram queries dominate state_change latency on remote tables, and only the bare (raw) histogram is rendered in merged_sd. Running the histogram on filt + clean too was wasted engine work that didn't show in the UI. Pandas / polars dataflows keep the default empty dict, so their filt/clean histograms continue to compute and render.

class XorqDataflow(CustomizableDataflow):
    skip_stats_by_scope = {'filt': {'histogram'}, 'clean': {'histogram'}}

Adapted from the working draft on smorg/post-785-playground (4949e56c), stripped of references to the closed #787 spike (recompute_summary_sd, _defer_summary_sd) and the not-yet-merged #809 (cost_classes).

Plumbing

skip_stat_names threads from the dataflow into the underlying pipelines:

  • _summary_sd (filt scope) applies the filt skip when quick_command_args is truthy — can't use operations because the cascade hasn't updated it yet when this fires.
  • _populate_sd_cache (raw + clean scopes) uses _effective_skip(scope, chains), which returns None when the scope is degenerate (no filter → filt == clean; no cleaning → clean == raw). Keeps the no-filter / no-cleaning case collapsing raw + clean + filt under one cache key — the lazy-postprocessor invariant on XorqBuckarooInfiniteWidget (TestLazyPostprocessor) still holds.
  • _scope_cache_key incorporates the effective skip, so raw + filt cache entries don't collide when their skips differ.
  • StatPipeline.process_column skips at iteration (stat-func name match) and strips at output (covers v1 ColAnalysis wrappers where one StatFunc provides many keys).
  • XorqStatPipeline.process_table skips in both batch-aggregate and per-column phases — actual engine queries are elided, not just the output.
  • DfStats (v1) accepts the kwarg for API parity and strips at output (v1 AnalysisPipeline has no per-stat filtering).

Tests

Failing-then-fix split per repo TDD policy. CI saw 5 failing tests on the first push (commit d79ad48e) and green on the fix (38d76795):

  • test_skip_stats_by_scope_excludes_named_stat_from_filt — skip mean on filt; filtered_mean absent, bare mean present.
  • test_skip_stats_by_scope_excludes_from_raw_and_clean — skip on raw/clean; bare mean absent, filtered_mean present.
  • test_skip_stats_by_scope_default_empty_runs_all_stats — no override → unchanged.
  • test_xorq_dataflow_skips_histogram_on_filt_and_clean — pins XorqDataflow's default.
  • test_polars_dataflow_keeps_histogram — Polars keeps the default empty dict.

Full tests/unit/ suite green locally (1049 passed, 6 skipped) — including TestLazyPostprocessor which caught an early version of the cache-key change that re-ran big_a 3× because raw/clean/filt no longer shared a key.

Test plan

  • Failing test commit pushed first; CI runs the new tests red
  • Fix commit makes them pass
  • Full unit suite green locally (1049 passed)
  • Verify in JupyterLab against the boston restaurant data that filtered/cleaned histogram rows are gone and bare histogram still renders
  • Measure state_change latency on boston-restaurant before/after — expect the ~6.5s histogram-query slice to drop to ~0.5s (per the 4949e56c commit message claim of ~13× speedup)

🤖 Generated with Claude Code

paddymul and others added 2 commits May 21, 2026 16:28
Pins the contract for a new ``DataFlow.skip_stats_by_scope`` class
attribute: a ``{scope_name: {stat_name, ...}}`` map that tells
``_summary_sd`` and ``_populate_sd_cache`` to omit the named stats
when computing each scope's SD.

Five tests across two files:

scoped_summary_stats_test.py
- ``test_skip_stats_by_scope_excludes_named_stat_from_filt``: a sub-
  class with ``skip_stats_by_scope = {'filt': {'mean'}}`` must not
  emit ``filtered_mean`` while ``mean`` (raw) and ``filtered_length``
  remain.
- ``test_skip_stats_by_scope_excludes_from_raw_and_clean``: skip on
  ``raw`` + ``clean`` removes the bare stat key while filt is
  unaffected.
- ``test_skip_stats_by_scope_default_empty_runs_all_stats``: existing
  CustomizableDataflow subclasses (no override) keep behaviour
  unchanged.

test_xorq_buckaroo_widget.py
- ``test_xorq_dataflow_skips_histogram_on_filt_and_clean``: pins
  ``XorqDataflow.skip_stats_by_scope = {'filt': {'histogram'},
  'clean': {'histogram'}}`` — the user-requested default to keep
  the per-column engine queries off the hot path while still
  rendering the bare histogram from raw.
- ``test_polars_dataflow_keeps_histogram``: Polars side keeps the
  default empty dict so filt/clean histograms continue to render.

Motivation (per smorg/post-785-playground ``4949e56c``):
on the boston restaurant data ~6.5 s of every 9 s state_change is
spent in xorq's per-column histogram queries across the three
scopes. Skipping histogram on filt+clean drops that to ~0.5 s
without losing the visible histogram (still computed in raw and
read from the bare key in merged_sd).

Fix follows in the next commit.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kips histogram on filt+clean

Adds a ``DataFlow.skip_stats_by_scope`` class attribute that lets a
backend declare "don't run stat X when computing scope Y's SD". The
motivating case (and only override shipping here): ``XorqDataflow``
sets ``{'filt': {'histogram'}, 'clean': {'histogram'}}`` — on xorq
the per-column histogram queries dominate state_change latency on
remote tables, and the user only renders the bare (raw) histogram in
``merged_sd``. Running the histogram on filt + clean too was wasted
work that didn't show up in the UI. Pandas / polars dataflows keep
the default empty dict, so their filt/clean histograms continue to
appear.

## Plumbing

``skip_stat_names`` threads from the dataflow into the underlying
pipelines:

- ``_summary_sd`` (filt scope) — applies filt skip when
  ``quick_command_args`` is truthy (the cascade hasn't updated
  ``operations`` yet when this fires, so we can't use the chains).
- ``_populate_sd_cache`` (raw + clean scopes) — uses
  ``_effective_skip(scope, chains)`` which returns ``None`` when the
  scope is degenerate (filt == clean → no filter; clean == raw → no
  cleaning). Keeps the no-filter / no-cleaning case collapsing raw +
  clean + filt under one cache key so the lazy-postprocessor
  ``add_processing(big_a)`` invariant on ``XorqBuckarooInfiniteWidget``
  still holds (postprocessor runs once per state, not three times).
- ``_scope_cache_key`` incorporates the effective skip so the cache
  doesn't collide raw and filt entries when their skips differ.
- ``StatPipeline.process_column`` / ``process_df`` /
  ``process_df_v1_compat``: skip at iteration (stat-func name match)
  plus a final output-level strip (covers v1 ``ColAnalysis`` wrappers
  where one ``StatFunc`` provides many keys — ``DefaultSummaryStats__
  series`` provides ``mean``, ``max``, ``std``, ...; we can't skip
  the whole wrapper for one key).
- ``XorqStatPipeline.process_table`` / ``process_table_v1_compat``:
  skip in both the batch-aggregate phase and the per-column phase.
- ``DfStatsV2``, ``PlDfStatsV2``, ``XorqDfStatsV2``: accept and forward.
- ``DfStats`` (v1) accepts the kwarg for API parity but strips at the
  output level since v1 ``AnalysisPipeline`` doesn't have per-stat
  filtering.

## Tests (failing-then-fix split per repo TDD policy)

Five new tests in the previous commit, all green now:

- ``test_skip_stats_by_scope_excludes_named_stat_from_filt`` — skip
  ``mean`` on filt; ``filtered_mean`` absent, bare ``mean`` present,
  other ``filtered_*`` keys present.
- ``test_skip_stats_by_scope_excludes_from_raw_and_clean`` — skip on
  raw/clean; bare ``mean`` absent, ``filtered_mean`` present.
- ``test_skip_stats_by_scope_default_empty_runs_all_stats`` — no
  override means behaviour unchanged.
- ``test_xorq_dataflow_skips_histogram_on_filt_and_clean`` — pins
  XorqDataflow's default.
- ``test_polars_dataflow_keeps_histogram`` — Polars side still has
  empty skiplist.

## Regression

Full ``tests/unit/`` suite green (1049 passed, 6 skipped) — including
``TestLazyPostprocessor`` on ``XorqBuckarooInfiniteWidget`` which
caught the early version of the cache-key change that recomputed
raw+clean scopes whenever skip differed from filt.

Adapted from the working draft on smorg/post-785-playground
(``4949e56c``), stripped of references to the closed #787 spike
(``recompute_summary_sd`` / ``_defer_summary_sd``) and the
not-yet-merged #809 (``cost_classes``).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paddymul paddymul added the triaged Reviewed and triaged label May 21, 2026
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 21, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26252900166

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26252900166

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26252900166" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 38d7679568

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

# this observer via ``processed_result``, *then* sets
# ``self.operations``). ``quick_command_args`` is the upstream
# trait, set before ``_operation_result`` runs.
filter_active = bool(self.quick_command_args)
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Determine filt-skip from realized filter ops

Use of bool(self.quick_command_args) here can misclassify "no filter" states as filtered. generate_quick_ops explicitly treats payloads like {'search': ['']} as no-op, so quick_command_args may be non-empty while no quick-command op exists; in that case _summary_sd still applies the filt skip, and because raw/clean/filt chains collapse to one key, _populate_sd_cache can store the skipped summary_sd under the shared key and skip recomputing raw, dropping bare stats (e.g. raw histogram) after clearing search.

Useful? React with 👍 / 👎.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

triaged Reviewed and triaged

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant