Skip to content

fix(arrow/array): reject nulls in non-nullable fields when building records#858

Open
zeroshade wants to merge 2 commits into
apache:mainfrom
zeroshade:gh-372-nonnullable-validation
Open

fix(arrow/array): reject nulls in non-nullable fields when building records#858
zeroshade wants to merge 2 commits into
apache:mainfrom
zeroshade:gh-372-nonnullable-validation

Conversation

@zeroshade

Copy link
Copy Markdown
Member

Rationale for this change

array.RecordBuilder.NewRecordBatch (and the deprecated NewRecord) silently
produced records whose field was declared Nullable: false but whose column
actually contained nulls. The Rust (RecordBatch::try_new) and C++
implementations reject this; arrow-go accepted it, yielding spec-invalid
records. Fixes #372.

What changes are included in this PR?

  • RecordBuilder: NewRecordBatch now rejects a record whose top-level
    field is non-nullable but whose column contains nulls (it panics, consistent
    with its existing whole-builder length-mismatch guard). A new
    RecordBuilder.NewRecordBatchChecked() (arrow.RecordBatch, error) returns the
    violation as an error instead of panicking.
  • The check is scoped to top-level record columns only. A non-nullable child
    of a nullable struct is unaffected (its nulls are legitimate where the parent
    slot is null), so existing nested data keeps working. array.NewRecordBatch
    and the lower-level constructors are unchanged.
  • Readers: the CSV, JSON, and Avro readers build through
    NewRecordBatchChecked and surface a typed read error instead of building an
    invalid record or panicking.

Are these changes tested?

Yes. A new TestRecordBuilderNonNullableNulls covers rejection (error + panic),
nullable fields with nulls allowed, non-nullable without nulls allowed, a
nullable struct with a non-nullable child allowed, and the empty-builder case.
Pre-existing fixtures that declared non-nullable fields while appending nulls
were corrected to mark those fields nullable. The full arrow/... and
parquet/... suites pass.

Are there any user-facing changes?

Yes:

  • New RecordBuilder.NewRecordBatchChecked() (arrow.RecordBatch, error).
  • RecordBuilder.NewRecordBatch() / NewRecord() now reject (panic) a top-level
    non-nullable column that contains nulls; previously such records were built
    silently. This is a behavior change for code that relied on the prior lax
    behavior — the fix is to declare the field Nullable: true.
  • The CSV, JSON, and Avro readers now return an error when input would place a
    null into a non-nullable top-level column.

Potential follow-ups (intentionally out of scope here): recursive validation
of nested non-nullable fields, and logical-null detection for non-nullable
run-end-encoded columns.

Closes #372

…ecords

RecordBuilder.NewRecordBatch (and the deprecated NewRecord) now reject a
record whose top-level field is declared non-nullable but whose column
contains nulls, matching the Rust and C++ implementations. The new
RecordBuilder.NewRecordBatchChecked returns the violation as an error
instead of panicking.

The check is scoped to top-level record columns only: a non-nullable child
of a nullable struct is unaffected, so existing nested data keeps working.
The CSV, JSON, and Avro readers route through NewRecordBatchChecked and
surface a read error rather than building an invalid record.

Closes apache#372
@zeroshade zeroshade requested a review from lidavidm June 19, 2026 21:46
Comment thread arrow/array/record.go
Comment on lines +410 to +411
// NewRecordBatch panics if the fields' builders do not have the same length or if
// a non-nullable field contains nulls. Use NewRecordBatchChecked for an error.

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While not strictly breaking, this might be a pretty painful change, if NewRecordBatch now "randomly" panics on things it would've accepted before; maybe we should deprecate this too as a stronger signal?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I dug into deprecating it, but it turns out to be disproportionate for this PR. staticcheck's SA1019 fires on cross-package use, so deprecating NewRecordBatch would mean migrating ~76 call sites across ~20 files to NewRecordBatchChecked (golangci-lint caps output at 3 per message, which hid the real count at first). That set also includes the new TestRecordBuilderNonNullableNulls, which deliberately calls NewRecordBatch() to assert it panics, so it would need a //nolint:staticcheck,SA1019.

Since NewRecordBatch already panics today on a whole-builder length mismatch, rejecting nulls in a non-nullable column reads as a consistent extension of that existing contract rather than brand-new panic behavior — and NewRecordBatchChecked is there for callers who'd rather have an error. So for now I've left NewRecordBatch as-is with the clarified doc. Happy to do the deprecation + full caller migration as a dedicated follow-up if you think the stronger signal is worth it; it just felt too sweeping to fold into this fix. WDYT?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Um, I'm not sure if I'm talking to you or a bot, but I'd think accidentally inserting a null (especially because nullable defaults to false due to false being the zero value) is far more common, and a panic is fairly unwelcome.

Comment thread arrow/array/record.go Outdated
Address review feedback: drop the manual per-row null scan (and the
nullCountInRows helper) in favor of the cached Array.NullN(). We only
need to know whether any nulls are present, not the exact count, and
NullN is already tracked, so this avoids iterating the column.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

RecordBuilder should confirm non-nullable fields do not contain null rows when building Record

2 participants