feat(sd-encoding): wide-typed parquet layout — native scalars, no JSON round-trip#782
Conversation
Failing-tests commit for the new sd_to_parquet_b64 encoding — must be
seen red on CI before the matching impl lands.
The tests assert:
- payload carries ``layout: 'wide'``
- numeric and bool scalars ride native parquet types (no JSON round-trip)
- strings + lists + dicts remain JSON-encoded
- NaN floats and None serialize as parquet null
- column names follow the ``{short_col}__{stat_name}`` convention
JS-side tests + fixture land with the impl in the next commit, because
tsc enforces cross-file consistency at commit time and won't let a
forward-referenced test land alone.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite sd_to_parquet_b64 to emit one parquet column per
``{short_col}__{stat_name}`` pair, with native parquet types for numeric
and bool scalars. The headline win is that numbers no longer
JSON-round-trip on either side: ``a__mean: 50.0`` rides parquet's
``double`` column type and arrives as a JS Number directly. Strings
and list/dict values (histograms, value_counts) remain JSON-encoded
so the JS side can JSON.parse every string cell unambiguously.
Payload carries ``layout: 'wide'`` as a discriminator; the row-format
fallback (kept for parquet-write failures) doesn't set the field, so
the JS decoder picks the row branch by default.
Adds ``resolve_summary_stats_payload`` and ``pivotWideSummaryStats``
as the Python and JS inverse helpers, mirrors of each other. The four
widget test files that hand-rolled a row-decode helper now import the
shared resolver instead.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
📦 TestPyPI package publishedpip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190461339or with uv: uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190461339MCP server for Claude Codeclaude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26190461339" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table📖 Docs preview🎨 Storybook preview |
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 1398e496cb
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if isinstance(val, (int, np.integer)): | ||
| return pa.array([int(val)], type=pa.int64()) |
There was a problem hiding this comment.
Guard out-of-range ints before encoding as int64
Do not coerce every int/np.integer to pa.int64() unconditionally here. Values outside signed 64-bit range (for example np.uint64 stats such as a column max) will overflow in Arrow, and this conversion happens before sd_to_parquet_b64 enters its try, so the JSON fallback path is never reached. In that scenario summary serialization raises instead of degrading gracefully, which can break rendering for datasets containing large unsigned integers.
Useful? React with 👍 / 👎.
Codex flagged on #782 that _stat_value_to_pa_array coerces every int to pa.int64() before sd_to_parquet_b64 enters its try/except — so a uint64 stat like a column max overflows in Arrow and crashes summary serialization instead of degrading gracefully. These three tests pin down the desired behavior: - uint64 above int64 max must not raise - uint64 stats round-trip via parquet uint64 - ints beyond both int64 and uint64 range fall back to JSON-encoded string Expected to fail on CI before the fix lands. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le uint64 Addresses codex P1 on #782. uint64 stats above int64 max (e.g. a column max on an unsigned dtype) used to overflow inside _stat_value_to_pa_array, which runs *before* sd_to_parquet_b64's try/except — so summary serialization would raise instead of falling back to the JSON payload. New behavior: - int in int64 range → pa.int64() (unchanged) - non-negative int above int64 max but within uint64 → pa.uint64() - anything wider → JSON-encoded string (JS already JSON.parses string cells) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Summary
Rewrites
sd_to_parquet_b64to emit one parquet column per{short_col}__{stat_name}pair, with native parquet types fornumeric and bool scalars. Strings and list/dict values stay
JSON-encoded so every string cell can be JSON.parsed unambiguously
on the JS side.
In the vein of #648, but built off
mainand addressing the gap Iflagged on that PR: numbers and bools no longer JSON-round-trip on
either side.
a__mean: 50.0rides parquet'sdoublecolumndirectly and arrives as a JS Number — no
JSON.parse("50.0")hop, no precision loss, no NaN-as-string ambiguity (NaN → parquet
null).
What's actually different vs. #648
typical SD now looks like:
layout: 'wide'is carried as a discriminator field onParquetB64Payload. The fallback path (parquet-write failure→ row JSON) omits the field, so the JS decoder picks the row
branch by default.
resolve_summary_stats_payload(Python) andpivotWideSummaryStats(JS) as mirrored inverse helpers. Thefour widget test files that hand-rolled a row-decode helper now
import the shared resolver.
Why this matters
JSON-encoding every numeric cell was costing:
JSON.stringifyon the Python side per statJSON.parseon the JS side per stat1.234567890123survives float64 round-trip;through JSON it depends on serializer fidelity)
"NaN"For a 50-col × 10-stat SD, that's 500
JSON.parsecalls on everystate change, all to recover values parquet already knows the type
of. This change drops the ones for numerics and bools.
Trade-off worth naming
Wide-column layout has higher per-payload parquet schema overhead
(one column header per stat × col). For tiny SDs that's a wash or
slight regression on wire size; for numeric-heavy SDs the native
scalars pay off. The CPU win on the JS side is unconditional.
Test plan
test_sd_to_parquet_b64.pypassdeselected for missing built static)
resolveDFData.test.ts)tsc -bcleanruff checkclean🤖 Generated with Claude Code