feat(converters): add parquet to csv conversion support#553
Conversation
There was a problem hiding this comment.
Pull request overview
Adds a new Parquet → CSV converter to the converters subsystem, wiring it into the central converter registry and introducing dependencies needed for Parquet parsing and CSV output.
Changes:
- Register the new
parquetconverter insrc/converters/main.ts. - Implement Parquet row-group reading and CSV streaming in
src/converters/parquet.tsusinghyparquet+csv-stringify. - Add a basic Bun test for the parquet converter and update dependencies/lockfile.
Reviewed changes
Copilot reviewed 4 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| tests/converters/parquet.test.ts | Adds initial tests for the new converter (currently minimal and has an async assertion issue). |
| src/converters/parquet.ts | New Parquet→CSV converter implementation using hyparquet metadata + row-group reads and CSV stringification. |
| src/converters/main.ts | Registers the parquet converter so it can be selected/auto-matched by the main conversion flow. |
| package.json | Adds csv-stringify, hyparquet, and (currently unused) duckdb. |
| bun.lock | Lockfile updates for the new dependencies. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
4 issues found across 5 files
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/converters/parquet.ts">
<violation number="1" location="src/converters/parquet.ts:46">
P1: The `finish` and `error` event handlers are `async` and `await fileHandle.close()`. If `close()` rejects, the thrown error prevents `resolve`/`reject` from being called, leaving the outer Promise permanently pending and producing an unhandled rejection. Use non-async handlers that always settle the outer promise, e.g. `fileHandle.close().catch(() => {}).finally(() => resolve(...))`.</violation>
<violation number="2" location="src/converters/parquet.ts:74">
P1: Row writes ignore stream backpressure, which can cause excessive buffering/memory growth on large parquet files.</violation>
</file>
<file name="package.json">
<violation number="1" location="package.json:24">
P2: `duckdb` is added as a dependency but is not imported or used anywhere in the codebase. Since it's a large native dependency with a native build step (`node-gyp`), it should be removed to avoid unnecessary install time, binary size, and potential build failures.</violation>
</file>
<file name="tests/converters/parquet.test.ts">
<violation number="1" location="tests/converters/parquet.test.ts:67">
P2: Rejection assertion is not awaited/returned, so the async failure-path test may pass without validating the expected rejection.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
There was a problem hiding this comment.
1 issue found across 4 files (changes from recent commits).
Prompt for AI agents (unresolved issues)
Check if these issues are valid — if so, understand the root cause of each and fix them. If appropriate, use sub-agents to investigate and fix each issue separately.
<file name="src/converters/parquet.ts">
<violation number="1" location="src/converters/parquet.ts:118">
P2: On conversion errors, the output write stream is not explicitly closed/destroyed, which can leave the destination file descriptor open or partial output file lingering.</violation>
</file>
Reply with feedback, questions, or to request a fix. Tag @cubic-dev-ai to re-run a review.
|
and got: |
|
@C4illin — the write after end has been fixed. The root cause was that hyparquet's parquetRead calls the onComplete callback without await, so CSV writes could still be pending when stringifier.end() was called. I also took the opportunity to add support for additional compression codecs via hyparquet-compressors, so the converter now handles ZSTD, GZIP, BROTLI, and LZ4. |
This change introduces support for converting Parquet files to CSV using hyparquet for better version 2 support and memory-efficient streaming. It includes:
Summary by cubic
Adds Parquet→CSV conversion using
hyparquet, streaming viacsv-stringifywith compressor support fromhyparquet-compressors.parquetconverter streams row groups to CSV with headers, supports Snappy/Zstd, handles backpressure with reliable error cleanup; tests cover success plus read/metadata failures.Written for commit 8701ee1. Summary will update on new commits.