feat(dataflow): skip_stats_by_scope — per-scope stat skiplist; xorq skips histogram on filt+clean#829
feat(dataflow): skip_stats_by_scope — per-scope stat skiplist; xorq skips histogram on filt+clean#829paddymul wants to merge 2 commits into
Conversation
Pins the contract for a new ``DataFlow.skip_stats_by_scope`` class
attribute: a ``{scope_name: {stat_name, ...}}`` map that tells
``_summary_sd`` and ``_populate_sd_cache`` to omit the named stats
when computing each scope's SD.
Five tests across two files:
scoped_summary_stats_test.py
- ``test_skip_stats_by_scope_excludes_named_stat_from_filt``: a sub-
class with ``skip_stats_by_scope = {'filt': {'mean'}}`` must not
emit ``filtered_mean`` while ``mean`` (raw) and ``filtered_length``
remain.
- ``test_skip_stats_by_scope_excludes_from_raw_and_clean``: skip on
``raw`` + ``clean`` removes the bare stat key while filt is
unaffected.
- ``test_skip_stats_by_scope_default_empty_runs_all_stats``: existing
CustomizableDataflow subclasses (no override) keep behaviour
unchanged.
test_xorq_buckaroo_widget.py
- ``test_xorq_dataflow_skips_histogram_on_filt_and_clean``: pins
``XorqDataflow.skip_stats_by_scope = {'filt': {'histogram'},
'clean': {'histogram'}}`` — the user-requested default to keep
the per-column engine queries off the hot path while still
rendering the bare histogram from raw.
- ``test_polars_dataflow_keeps_histogram``: Polars side keeps the
default empty dict so filt/clean histograms continue to render.
Motivation (per smorg/post-785-playground ``4949e56c``):
on the boston restaurant data ~6.5 s of every 9 s state_change is
spent in xorq's per-column histogram queries across the three
scopes. Skipping histogram on filt+clean drops that to ~0.5 s
without losing the visible histogram (still computed in raw and
read from the bare key in merged_sd).
Fix follows in the next commit.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…kips histogram on filt+clean
Adds a ``DataFlow.skip_stats_by_scope`` class attribute that lets a
backend declare "don't run stat X when computing scope Y's SD". The
motivating case (and only override shipping here): ``XorqDataflow``
sets ``{'filt': {'histogram'}, 'clean': {'histogram'}}`` — on xorq
the per-column histogram queries dominate state_change latency on
remote tables, and the user only renders the bare (raw) histogram in
``merged_sd``. Running the histogram on filt + clean too was wasted
work that didn't show up in the UI. Pandas / polars dataflows keep
the default empty dict, so their filt/clean histograms continue to
appear.
## Plumbing
``skip_stat_names`` threads from the dataflow into the underlying
pipelines:
- ``_summary_sd`` (filt scope) — applies filt skip when
``quick_command_args`` is truthy (the cascade hasn't updated
``operations`` yet when this fires, so we can't use the chains).
- ``_populate_sd_cache`` (raw + clean scopes) — uses
``_effective_skip(scope, chains)`` which returns ``None`` when the
scope is degenerate (filt == clean → no filter; clean == raw → no
cleaning). Keeps the no-filter / no-cleaning case collapsing raw +
clean + filt under one cache key so the lazy-postprocessor
``add_processing(big_a)`` invariant on ``XorqBuckarooInfiniteWidget``
still holds (postprocessor runs once per state, not three times).
- ``_scope_cache_key`` incorporates the effective skip so the cache
doesn't collide raw and filt entries when their skips differ.
- ``StatPipeline.process_column`` / ``process_df`` /
``process_df_v1_compat``: skip at iteration (stat-func name match)
plus a final output-level strip (covers v1 ``ColAnalysis`` wrappers
where one ``StatFunc`` provides many keys — ``DefaultSummaryStats__
series`` provides ``mean``, ``max``, ``std``, ...; we can't skip
the whole wrapper for one key).
- ``XorqStatPipeline.process_table`` / ``process_table_v1_compat``:
skip in both the batch-aggregate phase and the per-column phase.
- ``DfStatsV2``, ``PlDfStatsV2``, ``XorqDfStatsV2``: accept and forward.
- ``DfStats`` (v1) accepts the kwarg for API parity but strips at the
output level since v1 ``AnalysisPipeline`` doesn't have per-stat
filtering.
## Tests (failing-then-fix split per repo TDD policy)
Five new tests in the previous commit, all green now:
- ``test_skip_stats_by_scope_excludes_named_stat_from_filt`` — skip
``mean`` on filt; ``filtered_mean`` absent, bare ``mean`` present,
other ``filtered_*`` keys present.
- ``test_skip_stats_by_scope_excludes_from_raw_and_clean`` — skip on
raw/clean; bare ``mean`` absent, ``filtered_mean`` present.
- ``test_skip_stats_by_scope_default_empty_runs_all_stats`` — no
override means behaviour unchanged.
- ``test_xorq_dataflow_skips_histogram_on_filt_and_clean`` — pins
XorqDataflow's default.
- ``test_polars_dataflow_keeps_histogram`` — Polars side still has
empty skiplist.
## Regression
Full ``tests/unit/`` suite green (1049 passed, 6 skipped) — including
``TestLazyPostprocessor`` on ``XorqBuckarooInfiniteWidget`` which
caught the early version of the cache-key change that recomputed
raw+clean scopes whenever skip differed from filt.
Adapted from the working draft on smorg/post-785-playground
(``4949e56c``), stripped of references to the closed #787 spike
(``recompute_summary_sd`` / ``_defer_summary_sd``) and the
not-yet-merged #809 (``cost_classes``).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26252900166or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26252900166MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26252900166" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 38d7679568
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| # this observer via ``processed_result``, *then* sets | ||
| # ``self.operations``). ``quick_command_args`` is the upstream | ||
| # trait, set before ``_operation_result`` runs. | ||
| filter_active = bool(self.quick_command_args) |
There was a problem hiding this comment.
Determine filt-skip from realized filter ops
Use of bool(self.quick_command_args) here can misclassify "no filter" states as filtered. generate_quick_ops explicitly treats payloads like {'search': ['']} as no-op, so quick_command_args may be non-empty while no quick-command op exists; in that case _summary_sd still applies the filt skip, and because raw/clean/filt chains collapse to one key, _populate_sd_cache can store the skipped summary_sd under the shared key and skip recomputing raw, dropping bare stats (e.g. raw histogram) after clearing search.
Useful? React with 👍 / 👎.
ed80f57 to
38d7679
Compare
Summary
Adds a generic
DataFlow.skip_stats_by_scopeclass attribute that lets a backend declare "don't run stat X when computing scope Y's SD". The motivating case (and only override shipping here):XorqDataflowskipshistogramon thefiltandcleanscopes — on xorq the per-column histogram queries dominate state_change latency on remote tables, and only the bare (raw) histogram is rendered inmerged_sd. Running the histogram on filt + clean too was wasted engine work that didn't show in the UI. Pandas / polars dataflows keep the default empty dict, so their filt/clean histograms continue to compute and render.Adapted from the working draft on
smorg/post-785-playground(4949e56c), stripped of references to the closed #787 spike (recompute_summary_sd,_defer_summary_sd) and the not-yet-merged #809 (cost_classes).Plumbing
skip_stat_namesthreads from the dataflow into the underlying pipelines:_summary_sd(filt scope) applies the filt skip whenquick_command_argsis truthy — can't useoperationsbecause the cascade hasn't updated it yet when this fires._populate_sd_cache(raw + clean scopes) uses_effective_skip(scope, chains), which returnsNonewhen the scope is degenerate (no filter → filt == clean; no cleaning → clean == raw). Keeps the no-filter / no-cleaning case collapsing raw + clean + filt under one cache key — the lazy-postprocessor invariant onXorqBuckarooInfiniteWidget(TestLazyPostprocessor) still holds._scope_cache_keyincorporates the effective skip, so raw + filt cache entries don't collide when their skips differ.StatPipeline.process_columnskips at iteration (stat-func name match) and strips at output (covers v1ColAnalysiswrappers where oneStatFuncprovides many keys).XorqStatPipeline.process_tableskips in both batch-aggregate and per-column phases — actual engine queries are elided, not just the output.DfStats(v1) accepts the kwarg for API parity and strips at output (v1AnalysisPipelinehas no per-stat filtering).Tests
Failing-then-fix split per repo TDD policy. CI saw 5 failing tests on the first push (commit
d79ad48e) and green on the fix (38d76795):test_skip_stats_by_scope_excludes_named_stat_from_filt— skipmeanon filt;filtered_meanabsent, baremeanpresent.test_skip_stats_by_scope_excludes_from_raw_and_clean— skip on raw/clean; baremeanabsent,filtered_meanpresent.test_skip_stats_by_scope_default_empty_runs_all_stats— no override → unchanged.test_xorq_dataflow_skips_histogram_on_filt_and_clean— pins XorqDataflow's default.test_polars_dataflow_keeps_histogram— Polars keeps the default empty dict.Full
tests/unit/suite green locally (1049 passed, 6 skipped) — includingTestLazyPostprocessorwhich caught an early version of the cache-key change that re-ranbig_a3× because raw/clean/filt no longer shared a key.Test plan
4949e56ccommit message claim of ~13× speedup)🤖 Generated with Claude Code