feat(sd-encoding): wide-typed parquet layout — native scalars, no JSON round-trip by paddymul · Pull Request #782 · buckaroo-data/buckaroo

paddymul · 2026-05-20T20:19:03Z

Summary

Rewrites sd_to_parquet_b64 to emit one parquet column per
{short_col}__{stat_name} pair, with native parquet types for
numeric and bool scalars. Strings and list/dict values stay
JSON-encoded so every string cell can be JSON.parsed unambiguously
on the JS side.

In the vein of #648, but built off main and addressing the gap I
flagged on that PR: numbers and bools no longer JSON-round-trip on
either side. a__mean: 50.0 rides parquet's double column
directly and arrives as a JS Number — no JSON.parse("50.0")
hop, no precision loss, no NaN-as-string ambiguity (NaN → parquet
null).

What's actually different vs. #648

Encoding is per-value typed, not blanket JSON-string. Schema for a
typical SD now looks like:

a__dtype:       string   (JSON-encoded "float64")
a__mean:        double   (native — was a JSON string in #648)
a__length:      int64    (native — was a JSON string)
a__is_numeric:  bool     (native — was a JSON string)
a__histogram:   string   (JSON-encoded list, unchanged)

layout: 'wide' is carried as a discriminator field on
ParquetB64Payload. The fallback path (parquet-write failure
→ row JSON) omits the field, so the JS decoder picks the row
branch by default.
Adds resolve_summary_stats_payload (Python) and
pivotWideSummaryStats (JS) as mirrored inverse helpers. The
four widget test files that hand-rolled a row-decode helper now
import the shared resolver.

Why this matters

JSON-encoding every numeric cell was costing:

a JSON.stringify on the Python side per stat
a JSON.parse on the JS side per stat
precision (1.234567890123 survives float64 round-trip;
through JSON it depends on serializer fidelity)
the ability to distinguish NaN from the string "NaN"

For a 50-col × 10-stat SD, that's 500 JSON.parse calls on every
state change, all to recover values parquet already knows the type
of. This change drops the ones for numerics and bools.

Trade-off worth naming

Wide-column layout has higher per-payload parquet schema overhead
(one column header per stat × col). For tiny SDs that's a wash or
slight regression on wire size; for numeric-heavy SDs the native
scalars pay off. The CPU win on the JS side is unconditional.

Test plan

Python: 11/11 new tests in test_sd_to_parquet_b64.py pass
Python: full suite (1016 passed, unrelated mcp_uvx test
deselected for missing built static)
JS: 248 tests pass (18 in resolveDFData.test.ts)
tsc -b clean
ruff check clean
pre-push CI lint gate green
CI green

🤖 Generated with Claude Code

Failing-tests commit for the new sd_to_parquet_b64 encoding — must be seen red on CI before the matching impl lands. The tests assert: - payload carries ``layout: 'wide'`` - numeric and bool scalars ride native parquet types (no JSON round-trip) - strings + lists + dicts remain JSON-encoded - NaN floats and None serialize as parquet null - column names follow the ``{short_col}__{stat_name}`` convention JS-side tests + fixture land with the impl in the next commit, because tsc enforces cross-file consistency at commit time and won't let a forward-referenced test land alone. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Rewrite sd_to_parquet_b64 to emit one parquet column per ``{short_col}__{stat_name}`` pair, with native parquet types for numeric and bool scalars. The headline win is that numbers no longer JSON-round-trip on either side: ``a__mean: 50.0`` rides parquet's ``double`` column type and arrives as a JS Number directly. Strings and list/dict values (histograms, value_counts) remain JSON-encoded so the JS side can JSON.parse every string cell unambiguously. Payload carries ``layout: 'wide'`` as a discriminator; the row-format fallback (kept for parquet-write failures) doesn't set the field, so the JS decoder picks the row branch by default. Adds ``resolve_summary_stats_payload`` and ``pivotWideSummaryStats`` as the Python and JS inverse helpers, mirrors of each other. The four widget test files that hand-rolled a row-decode helper now import the shared resolver instead. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T20:21:13Z

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190461339

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190461339

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26190461339" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

chatgpt-codex-connector

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1398e496cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

Open a pull request for review
Mark a draft as ready
Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

chatgpt-codex-connector · 2026-05-20T20:35:10Z

+    if isinstance(val, (int, np.integer)):
+        return pa.array([int(val)], type=pa.int64())


Guard out-of-range ints before encoding as int64

Do not coerce every int/np.integer to pa.int64() unconditionally here. Values outside signed 64-bit range (for example np.uint64 stats such as a column max) will overflow in Arrow, and this conversion happens before sd_to_parquet_b64 enters its try, so the JSON fallback path is never reached. In that scenario summary serialization raises instead of degrading gracefully, which can break rendering for datasets containing large unsigned integers.

Useful? React with 👍 / 👎.

Codex flagged on #782 that _stat_value_to_pa_array coerces every int to pa.int64() before sd_to_parquet_b64 enters its try/except — so a uint64 stat like a column max overflows in Arrow and crashes summary serialization instead of degrading gracefully. These three tests pin down the desired behavior: - uint64 above int64 max must not raise - uint64 stats round-trip via parquet uint64 - ints beyond both int64 and uint64 range fall back to JSON-encoded string Expected to fail on CI before the fix lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…le uint64 Addresses codex P1 on #782. uint64 stats above int64 max (e.g. a column max on an unsigned dtype) used to overflow inside _stat_value_to_pa_array, which runs *before* sd_to_parquet_b64's try/except — so summary serialization would raise instead of falling back to the JSON payload. New behavior: - int in int64 range → pa.int64() (unchanged) - non-negative int above int64 max but within uint64 → pa.uint64() - anything wider → JSON-encoded string (JS already JSON.parses string cells) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

paddymul and others added 2 commits May 20, 2026 16:15

paddymul temporarily deployed to testpypi May 20, 2026 20:20 — with GitHub Actions Inactive

chatgpt-codex-connector Bot reviewed May 20, 2026

View reviewed changes

paddymul temporarily deployed to testpypi May 20, 2026 21:15 — with GitHub Actions Inactive

paddymul temporarily deployed to testpypi May 20, 2026 21:17 — with GitHub Actions Inactive

paddymul mentioned this pull request May 20, 2026

feat(sd-cache): keyed summary-stats cache (raw / clean / filt scopes) #783

Merged

5 tasks

paddymul added this pull request to the merge queue May 20, 2026

Merged via the queue into main with commit 868c7ec May 20, 2026
26 of 27 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(sd-encoding): wide-typed parquet layout — native scalars, no JSON round-trip#782

feat(sd-encoding): wide-typed parquet layout — native scalars, no JSON round-trip#782
paddymul merged 4 commits into
mainfrom
feat/sd-wide-typed-encoding

paddymul commented May 20, 2026

Uh oh!

github-actions Bot commented May 20, 2026 •

edited

Loading

Uh oh!

chatgpt-codex-connector Bot left a comment

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

		if isinstance(val, (int, np.integer)):
		return pa.array([int(val)], type=pa.int64())

Conversation

paddymul commented May 20, 2026

Summary

What's actually different vs. #648

Why this matters

Trade-off worth naming

Test plan

Uh oh!

github-actions Bot commented May 20, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

📦 TestPyPI package published

MCP server for Claude Code

📖 Docs preview

🎨 Storybook preview

Uh oh!

chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

💡 Codex Review

Uh oh!

chatgpt-codex-connector Bot May 20, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

github-actions Bot commented May 20, 2026 •

edited

Loading