Skip to content

feat(converters): add parquet to csv conversion support#553

Open
pedro3pv wants to merge 1 commit into
C4illin:mainfrom
pedro3pv:main
Open

feat(converters): add parquet to csv conversion support#553
pedro3pv wants to merge 1 commit into
C4illin:mainfrom
pedro3pv:main

Conversation

@pedro3pv
Copy link
Copy Markdown

@pedro3pv pedro3pv commented Apr 9, 2026

This change introduces support for converting Parquet files to CSV using hyparquet for better version 2 support and memory-efficient streaming. It includes:

  • Registration of the parquet converter in main.ts
  • Implementation of row-group based processing in parquet.ts
  • Comprehensive tests for the new converter

Summary by cubic

Adds Parquet→CSV conversion using hyparquet, streaming via csv-stringify with compressor support from hyparquet-compressors.

  • New Features
    • parquet converter streams row groups to CSV with headers, supports Snappy/Zstd, handles backpressure with reliable error cleanup; tests cover success plus read/metadata failures.

Written for commit 8701ee1. Summary will update on new commits.

Copilot AI review requested due to automatic review settings April 9, 2026 18:51
@github-actions github-actions Bot added Feature and removed Feature labels Apr 9, 2026
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new Parquet → CSV converter to the converters subsystem, wiring it into the central converter registry and introducing dependencies needed for Parquet parsing and CSV output.

Changes:

  • Register the new parquet converter in src/converters/main.ts.
  • Implement Parquet row-group reading and CSV streaming in src/converters/parquet.ts using hyparquet + csv-stringify.
  • Add a basic Bun test for the parquet converter and update dependencies/lockfile.

Reviewed changes

Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
tests/converters/parquet.test.ts Adds initial tests for the new converter (currently minimal and has an async assertion issue).
src/converters/parquet.ts New Parquet→CSV converter implementation using hyparquet metadata + row-group reads and CSV stringification.
src/converters/main.ts Registers the parquet converter so it can be selected/auto-matched by the main conversion flow.
package.json Adds csv-stringify, hyparquet, and (currently unused) duckdb.
bun.lock Lockfile updates for the new dependencies.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/converters/parquet.ts Outdated
Comment thread src/converters/parquet.ts
Comment thread package.json
Comment thread tests/converters/parquet.test.ts Outdated
Comment thread tests/converters/parquet.test.ts
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

4 issues found across 5 files

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/converters/parquet.ts">

<violation number="1" location="src/converters/parquet.ts:46">
P1: The `finish` and `error` event handlers are `async` and `await fileHandle.close()`. If `close()` rejects, the thrown error prevents `resolve`/`reject` from being called, leaving the outer Promise permanently pending and producing an unhandled rejection. Use non-async handlers that always settle the outer promise, e.g. `fileHandle.close().catch(() => {}).finally(() => resolve(...))`.</violation>

<violation number="2" location="src/converters/parquet.ts:74">
P1: Row writes ignore stream backpressure, which can cause excessive buffering/memory growth on large parquet files.</violation>
</file>

<file name="package.json">

<violation number="1" location="package.json:24">
P2: `duckdb` is added as a dependency but is not imported or used anywhere in the codebase. Since it's a large native dependency with a native build step (`node-gyp`), it should be removed to avoid unnecessary install time, binary size, and potential build failures.</violation>
</file>

<file name="tests/converters/parquet.test.ts">

<violation number="1" location="tests/converters/parquet.test.ts:67">
P2: Rejection assertion is not awaited/returned, so the async failure-path test may pass without validating the expected rejection.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/converters/parquet.ts Outdated
Comment thread src/converters/parquet.ts Outdated
Comment thread package.json Outdated
Comment thread tests/converters/parquet.test.ts Outdated
@github-actions github-actions Bot added Feature and removed Feature labels Apr 17, 2026
Copy link
Copy Markdown
Contributor

@cubic-dev-ai cubic-dev-ai Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 issue found across 4 files (changes from recent commits).

Prompt for AI agents (unresolved issues)

Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.


<file name="src/converters/parquet.ts">

<violation number="1" location="src/converters/parquet.ts:118">
P2: On conversion errors, the output write stream is not explicitly closed/destroyed, which can leave the destination file descriptor open or partial output file lingering.</violation>
</file>

Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.

Comment thread src/converters/parquet.ts Outdated
@github-actions github-actions Bot added Feature and removed Feature labels Apr 17, 2026
@C4illin
Copy link
Copy Markdown
Owner

C4illin commented May 7, 2026

Tested with https://github.com/kaysush/sample-parquet-files/blob/main/part-00000-a9e77425-5fb4-456f-ba52-f821123bd193-c000.snappy.parquet

and got:

error: write after end
 code: "ERR_STREAM_WRITE_AFTER_END"

      at internal:streams/writable:239:29

@github-actions github-actions Bot added Feature and removed Feature labels May 8, 2026
@pedro3pv
Copy link
Copy Markdown
Author

pedro3pv commented May 8, 2026

@C4illin — the write after end has been fixed. The root cause was that hyparquet's parquetRead calls the onComplete callback without await, so CSV writes could still be pending when stringifier.end() was called. I also took the opportunity to add support for additional compression codecs via hyparquet-compressors, so the converter now handles ZSTD, GZIP, BROTLI, and LZ4.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants