Skip to content

feat(sd-encoding): wide-typed parquet layout — native scalars, no JSON round-trip#782

Merged
paddymul merged 4 commits into
mainfrom
feat/sd-wide-typed-encoding
May 20, 2026
Merged

feat(sd-encoding): wide-typed parquet layout — native scalars, no JSON round-trip#782
paddymul merged 4 commits into
mainfrom
feat/sd-wide-typed-encoding

Conversation

@paddymul
Copy link
Copy Markdown
Collaborator

Summary

Rewrites sd_to_parquet_b64 to emit one parquet column per
{short_col}__{stat_name} pair, with native parquet types for
numeric and bool scalars
. Strings and list/dict values stay
JSON-encoded so every string cell can be JSON.parsed unambiguously
on the JS side.

In the vein of #648, but built off main and addressing the gap I
flagged on that PR: numbers and bools no longer JSON-round-trip on
either side. a__mean: 50.0 rides parquet's double column
directly and arrives as a JS Number — no JSON.parse("50.0")
hop, no precision loss, no NaN-as-string ambiguity (NaN → parquet
null).

What's actually different vs. #648

  • Encoding is per-value typed, not blanket JSON-string. Schema for a
    typical SD now looks like:
    a__dtype:       string   (JSON-encoded "float64")
    a__mean:        double   (native — was a JSON string in #648)
    a__length:      int64    (native — was a JSON string)
    a__is_numeric:  bool     (native — was a JSON string)
    a__histogram:   string   (JSON-encoded list, unchanged)
    
  • layout: 'wide' is carried as a discriminator field on
    ParquetB64Payload. The fallback path (parquet-write failure
    → row JSON) omits the field, so the JS decoder picks the row
    branch by default.
  • Adds resolve_summary_stats_payload (Python) and
    pivotWideSummaryStats (JS) as mirrored inverse helpers. The
    four widget test files that hand-rolled a row-decode helper now
    import the shared resolver.

Why this matters

JSON-encoding every numeric cell was costing:

  • a JSON.stringify on the Python side per stat
  • a JSON.parse on the JS side per stat
  • precision (1.234567890123 survives float64 round-trip;
    through JSON it depends on serializer fidelity)
  • the ability to distinguish NaN from the string "NaN"

For a 50-col × 10-stat SD, that's 500 JSON.parse calls on every
state change, all to recover values parquet already knows the type
of. This change drops the ones for numerics and bools.

Trade-off worth naming

Wide-column layout has higher per-payload parquet schema overhead
(one column header per stat × col). For tiny SDs that's a wash or
slight regression on wire size; for numeric-heavy SDs the native
scalars pay off. The CPU win on the JS side is unconditional.

Test plan

  • Python: 11/11 new tests in test_sd_to_parquet_b64.py pass
  • Python: full suite (1016 passed, unrelated mcp_uvx test
    deselected for missing built static)
  • JS: 248 tests pass (18 in resolveDFData.test.ts)
  • tsc -b clean
  • ruff check clean
  • pre-push CI lint gate green
  • CI green

🤖 Generated with Claude Code

paddymul and others added 2 commits May 20, 2026 16:15
Failing-tests commit for the new sd_to_parquet_b64 encoding — must be
seen red on CI before the matching impl lands.

The tests assert:
- payload carries ``layout: 'wide'``
- numeric and bool scalars ride native parquet types (no JSON round-trip)
- strings + lists + dicts remain JSON-encoded
- NaN floats and None serialize as parquet null
- column names follow the ``{short_col}__{stat_name}`` convention

JS-side tests + fixture land with the impl in the next commit, because
tsc enforces cross-file consistency at commit time and won't let a
forward-referenced test land alone.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Rewrite sd_to_parquet_b64 to emit one parquet column per
``{short_col}__{stat_name}`` pair, with native parquet types for numeric
and bool scalars. The headline win is that numbers no longer
JSON-round-trip on either side: ``a__mean: 50.0`` rides parquet's
``double`` column type and arrives as a JS Number directly. Strings
and list/dict values (histograms, value_counts) remain JSON-encoded
so the JS side can JSON.parse every string cell unambiguously.

Payload carries ``layout: 'wide'`` as a discriminator; the row-format
fallback (kept for parquet-write failures) doesn't set the field, so
the JS decoder picks the row branch by default.

Adds ``resolve_summary_stats_payload`` and ``pivotWideSummaryStats``
as the Python and JS inverse helpers, mirrors of each other. The four
widget test files that hand-rolled a row-decode helper now import the
shared resolver instead.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions
Copy link
Copy Markdown
Contributor

github-actions Bot commented May 20, 2026

📦 TestPyPI package published

pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190461339

or with uv:

uv pip install --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo==0.14.3.dev26190461339

MCP server for Claude Code

claude mcp add buckaroo-table -- uvx --from "buckaroo[mcp]==0.14.3.dev26190461339" --index-strategy unsafe-best-match --index-url https://test.pypi.org/simple/ --extra-index-url https://pypi.org/simple/ buckaroo-table

📖 Docs preview

🎨 Storybook preview

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 1398e496cb

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread buckaroo/serialization_utils.py Outdated
Comment on lines +373 to +374
if isinstance(val, (int, np.integer)):
return pa.array([int(val)], type=pa.int64())
Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 Badge Guard out-of-range ints before encoding as int64

Do not coerce every int/np.integer to pa.int64() unconditionally here. Values outside signed 64-bit range (for example np.uint64 stats such as a column max) will overflow in Arrow, and this conversion happens before sd_to_parquet_b64 enters its try, so the JSON fallback path is never reached. In that scenario summary serialization raises instead of degrading gracefully, which can break rendering for datasets containing large unsigned integers.

Useful? React with 👍 / 👎.

Codex flagged on #782 that _stat_value_to_pa_array coerces every int to
pa.int64() before sd_to_parquet_b64 enters its try/except — so a uint64
stat like a column max overflows in Arrow and crashes summary
serialization instead of degrading gracefully.

These three tests pin down the desired behavior:
- uint64 above int64 max must not raise
- uint64 stats round-trip via parquet uint64
- ints beyond both int64 and uint64 range fall back to JSON-encoded string

Expected to fail on CI before the fix lands.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…le uint64

Addresses codex P1 on #782. uint64 stats above int64 max (e.g. a column max
on an unsigned dtype) used to overflow inside _stat_value_to_pa_array,
which runs *before* sd_to_parquet_b64's try/except — so summary
serialization would raise instead of falling back to the JSON payload.

New behavior:
- int in int64 range → pa.int64() (unchanged)
- non-negative int above int64 max but within uint64 → pa.uint64()
- anything wider → JSON-encoded string (JS already JSON.parses string cells)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@paddymul paddymul added this pull request to the merge queue May 20, 2026
Merged via the queue into main with commit 868c7ec May 20, 2026
26 of 27 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant