fix(arrow/array): reject nulls in non-nullable fields when building records#858
fix(arrow/array): reject nulls in non-nullable fields when building records#858zeroshade wants to merge 2 commits into
Conversation
…ecords RecordBuilder.NewRecordBatch (and the deprecated NewRecord) now reject a record whose top-level field is declared non-nullable but whose column contains nulls, matching the Rust and C++ implementations. The new RecordBuilder.NewRecordBatchChecked returns the violation as an error instead of panicking. The check is scoped to top-level record columns only: a non-nullable child of a nullable struct is unaffected, so existing nested data keeps working. The CSV, JSON, and Avro readers route through NewRecordBatchChecked and surface a read error rather than building an invalid record. Closes apache#372
| // NewRecordBatch panics if the fields' builders do not have the same length or if | ||
| // a non-nullable field contains nulls. Use NewRecordBatchChecked for an error. |
There was a problem hiding this comment.
While not strictly breaking, this might be a pretty painful change, if NewRecordBatch now "randomly" panics on things it would've accepted before; maybe we should deprecate this too as a stronger signal?
There was a problem hiding this comment.
I dug into deprecating it, but it turns out to be disproportionate for this PR. staticcheck's SA1019 fires on cross-package use, so deprecating NewRecordBatch would mean migrating ~76 call sites across ~20 files to NewRecordBatchChecked (golangci-lint caps output at 3 per message, which hid the real count at first). That set also includes the new TestRecordBuilderNonNullableNulls, which deliberately calls NewRecordBatch() to assert it panics, so it would need a //nolint:staticcheck,SA1019.
Since NewRecordBatch already panics today on a whole-builder length mismatch, rejecting nulls in a non-nullable column reads as a consistent extension of that existing contract rather than brand-new panic behavior — and NewRecordBatchChecked is there for callers who'd rather have an error. So for now I've left NewRecordBatch as-is with the clarified doc. Happy to do the deprecation + full caller migration as a dedicated follow-up if you think the stronger signal is worth it; it just felt too sweeping to fold into this fix. WDYT?
There was a problem hiding this comment.
Um, I'm not sure if I'm talking to you or a bot, but I'd think accidentally inserting a null (especially because nullable defaults to false due to false being the zero value) is far more common, and a panic is fairly unwelcome.
Address review feedback: drop the manual per-row null scan (and the nullCountInRows helper) in favor of the cached Array.NullN(). We only need to know whether any nulls are present, not the exact count, and NullN is already tracked, so this avoids iterating the column.
Rationale for this change
array.RecordBuilder.NewRecordBatch(and the deprecatedNewRecord) silentlyproduced records whose field was declared
Nullable: falsebut whose columnactually contained nulls. The Rust (
RecordBatch::try_new) and C++implementations reject this; arrow-go accepted it, yielding spec-invalid
records. Fixes #372.
What changes are included in this PR?
RecordBuilder:NewRecordBatchnow rejects a record whose top-levelfield is non-nullable but whose column contains nulls (it panics, consistent
with its existing whole-builder length-mismatch guard). A new
RecordBuilder.NewRecordBatchChecked() (arrow.RecordBatch, error)returns theviolation as an error instead of panicking.
of a nullable struct is unaffected (its nulls are legitimate where the parent
slot is null), so existing nested data keeps working.
array.NewRecordBatchand the lower-level constructors are unchanged.
NewRecordBatchCheckedand surface a typed read error instead of building aninvalid record or panicking.
Are these changes tested?
Yes. A new
TestRecordBuilderNonNullableNullscovers rejection (error + panic),nullable fields with nulls allowed, non-nullable without nulls allowed, a
nullable struct with a non-nullable child allowed, and the empty-builder case.
Pre-existing fixtures that declared non-nullable fields while appending nulls
were corrected to mark those fields nullable. The full
arrow/...andparquet/...suites pass.Are there any user-facing changes?
Yes:
RecordBuilder.NewRecordBatchChecked() (arrow.RecordBatch, error).RecordBuilder.NewRecordBatch()/NewRecord()now reject (panic) a top-levelnon-nullable column that contains nulls; previously such records were built
silently. This is a behavior change for code that relied on the prior lax
behavior — the fix is to declare the field
Nullable: true.null into a non-nullable top-level column.
Potential follow-ups (intentionally out of scope here): recursive validation
of nested non-nullable fields, and logical-null detection for non-nullable
run-end-encoded columns.
Closes #372