Skip to content

Draft: feat(format): schema evolution for the Java row codec#3714

Draft
stevenschlansker wants to merge 13 commits into
apache:mainfrom
stevenschlansker:row-codec-schema-versions
Draft

Draft: feat(format): schema evolution for the Java row codec#3714
stevenschlansker wants to merge 13 commits into
apache:mainfrom
stevenschlansker:row-codec-schema-versions

Conversation

@stevenschlansker
Copy link
Copy Markdown
Contributor

Opt in with .withSchemaEvolution() on any row, array, or map codec builder. Fields carry @ForyVersion(since, until); removed fields are listed on a nested interface referenced from
@ForySchema(removedFields = ...), which preserves parameterized types like List<String>. Older payloads are dispatched at read time; nothing changes when the flag is off. Standard and compact formats supported.

Why?

Currently changing row format schema definition in any way invalidates all records

What does this PR do?

Propose a new concept of format versions, each succeeding version may add or remove fields from types, and deserialization machinery picks version based on schema hash

AI Contribution Checklist

AI Usage Disclosure

  • substantial_ai_assistance: yes
  • scope: all
  • affected_files_or_subsystems: row format Java
  • ai_review: <line-by-line self-review completed; summarize the two-reviewer loop and final no-further-comments result>
  • ai_review_artifacts:
  • human_verification: <checks run locally or in CI + pass/fail summary + contributor reviewed results>
  • performance_verification: <N/A or benchmark/regression evidence summary>
  • provenance_license_confirmation: <Apache-2.0-compatible provenance confirmed; no incompatible third-party code introduced>
  • If yes, I included a completed AI Contribution Checklist in this PR description and the required AI Usage Disclosure.
  • If yes, my PR description includes the required ai_review summary and screenshot evidence of the final clean AI review results from both fresh reviewers on the current PR diff or current HEAD after the latest code changes.

Does this PR introduce any user-facing change?

New codec option: schema evolution. Some small annotations and a builder method.
Existing row format compatibility unchanged

Benchmark

Includes performance optimization of 20-30% less memory allocation in row codec decoding

Claude (on behalf of Steven Schlansker) added 13 commits May 28, 2026 22:04
Opt in with `.withSchemaEvolution()` on any row, array, or map codec
builder. Fields carry `@ForyVersion(since, until)`; removed fields are
listed on a nested interface referenced from
`@ForySchema(removedFields = ...)`, which preserves parameterized types
like `List<String>`. Older payloads are dispatched at read time; nothing
changes when the flag is off. Standard and compact formats supported;
interface-typed beans included.
…pointTo

BinaryRowEncoder.decode consumes an 8-byte schema hash before pointing the row
at the rest of the buffer, but passed the full payload size to BinaryRow.pointTo.
The row's recorded sizeInBytes was therefore 8 too large, so any subsequent
copy(), toBytes(), or getSizeInBytes() call on a row obtained through this path
would include 8 trailing bytes of unrelated buffer content.

The bug was latent because BinaryRowEncoder.decode does not expose the row to
callers; the buffer's reader index was already advanced correctly by size - 8.
Fix is local to BinaryRowEncoder.decode. Adds a regression test that injects a
size-recording BinaryRow and asserts the encoder passes the row body size.
…p codecs

BinaryArrayEncoder.encode(T) and BinaryMapEncoder.encode(T) previously composed
the hash-prefixed payload through MemoryUtils.buffer + writeInt64 + writeBytes +
getBytes, allocating three byte[] copies and a MemoryBuffer per call. Build the
result directly into a single byte[]: wrap it to write the 8-byte hash header,
then System.arraycopy the body in. The non-evolution paths are unchanged.
…computation

CompactBinaryArray.getStruct and CompactBinaryRow.getStruct previously rebuilt
the element schema's fixedOffsets array and walked allNotNullable twice on
every call, even though the schema itself was already cached at the slot. For
a list of 32 structs that meant 32 int[] allocations plus duplicate field-list
iterations.

Move the per-schema work into a new CompactRowLayout holder. CompactBinaryArray
computes its element's layout once at construction (the element type is fixed
for the lifetime of the array); CompactBinaryRow.getStruct caches a layout per
slot in extData. Add a package-private CompactBinaryRow constructor that takes
precomputed offsets / nullability / bitmap width so the row allocation skips
the two redundant iterations.

Measured allocation reduction with org.apache.fory.format.perf.RowFormatAllocationProbe
(included for repeatability). Bytes allocated per decode op:

  scenario   standard   compact (before -> after)
  root         72868     110170  ->  87516   (-21%)
  array         6639       9982  ->   7856   (-21%)
  matrix       52704      79664  ->  62968   (-21%)
  map           6080       8168  ->   7216   (-12%)

Standard format unchanged (control). Per-element saving ~65 bytes, consistent
with the eliminated int[fields+1] alloc plus avoided iteration work.
The strict schema hash already recurses through StructType, so two payloads
whose inner-struct shapes differ produce different outer hashes. The
implementation gap was in SchemaHistory.build, which only enumerated the
outer bean's own version boundaries — projection codecs for "outer V=K with
inner V=L" weren't generated, so older inner shapes failed to deserialize
even though the hash distinguished them.

Implementation:

- SchemaHistory.build now recurses into nested-bean fields whose type
  carries schema-evolution annotations, builds each inner's history, and
  cross-products over inner versions when enumerating outer versions. Each
  VersionedSchema now carries a map of (nested bean class -> chosen inner
  version) so the codec builder can wire the right inner projection codec.

- RowCodecBuilder.evolvingBuildForWriter emits one projection codec class
  per cross-product combination, using a per-nested-bean-type suffix map
  passed down through Encoding/RowEncoderBuilder. BaseBinaryEncoderBuilder
  exposes a `nestedBeanSuffix(TypeRef)` hook that the projection builder
  overrides to look up each nested bean's right suffix.

- Inner projection classes are generated recursively from
  nestedSuffixesFor(), so a deeply-nested versioned bean produces the
  required class tree at outer-build time.

Class-count complexity is O(product of versions across nesting), but each
projection class is small (decode-only) and only those reachable from the
outer's enumeration are generated.

Regression test nestedInnerEvolution_readerInnerNewerThanWriter and the
two-axis crossOuterAndInnerEvolution both pass. 138 tests in fory-format
green.
…decs

Array and map evolution paths were generating per-outer-version projection
classes named with only the outer version suffix and instantiated without an
inner-version routing map. When the element bean contained a versioned
nested bean, multiple cross-product entries collided on the codegen cache:
the projection always read inner beans at whichever version was compiled
first. The row codec already did this correctly; lift its suffix and nested-
suffix logic into a shared ProjectionRouting helper and reuse it from
ArrayCodecBuilder and MapCodecBuilder. Add array/map regression tests that
fail before the fix and pass after.
… codecs

The existing row test (evolutionFlagAsymmetryFailsLoud) had no array or map
equivalent. Add both. The evolution-on consumer reading evolution-off bytes
direction is loud (ClassNotCompatibleException); the reverse direction is
undefined per the wire format but must not silently return a structurally
plausible value. Rename isVersionedBeanElement/Value to isBeanElement/Value
with a doc comment, since the predicate is just isBean — calling it
"versioned" suggested the unversioned-bean case was excluded.
…ures collapse

bySignature.putIfAbsent could store a non-all-current cross-product combination
under the signature that build() later marks as the writer-side current. The
stored VS's nestedBeanVersions would then misreport at least one inner bean
as living at a non-current version, violating the documented contract on
current().nestedBeanVersions(). Reachable only if two combinations canonicalize
to the same outer signature, which today's inner-bySignature collapse prevents,
but the contract should not depend on that. Add a contract test that asserts
the invariant for a deeply nested versioned bean.
@ForyVersion declares RECORD_COMPONENT as a valid target but no test exercised
the record path. Add three cases in fory-latest-jdk-tests: a record with a
String field added at v2, a record with the @ForySchema-removed-field History
interface, and a record with a primitive int field added at v2 (verifying the
0 default).
Tighten the row-format schema-evolution doc to reflect the actual flag-mismatch
behavior (loud in one direction, undefined in the reverse for array/map) and
add a note that the projection codec class count grows as the product of
per-bean version counts in a composition, with retiring history entries as
the way to bound it.
The class-level javadoc had a trailing "<p>, ganrunsheng" line, evidently a
truncated tag. Reduce the class doc to its one useful sentence.
decode(byte[]) takes bytes.length as the payload size unconditionally, paired
with encode(T) which writes no size prefix either. The MemoryBuffer overloads
respect sizeEmbedded for stream framing; the byte[] overloads do not, because
they handle a single message. Comment the three byte[] decode methods so the
asymmetry isn't read as a bug.
Three small edits in the row-format schema-evolution section: name all
primitive defaults (0, 0.0, false), fold the "parameterized types are
expressed naturally" assertion into the lead-in to the removed-field example,
and drop the trailing sentence that restated what the example already showed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant