Skip to content

[spark] Make LakeSplit extend Serializable to simplify Spark serialization#3123

Open
YannByron wants to merge 6 commits intoapache:mainfrom
YannByron:main-lakesplit
Open

[spark] Make LakeSplit extend Serializable to simplify Spark serialization#3123
YannByron wants to merge 6 commits intoapache:mainfrom
YannByron:main-lakesplit

Conversation

@YannByron
Copy link
Copy Markdown
Contributor

Summary

  • Make LakeSplit extend java.io.Serializable so Spark can transport splits directly instead of manual byte-level serialize/deserialize via SimpleVersionedSerializer
  • Replace lakeSplitBytes: Array[Byte] with lakeSplit: LakeSplit / lakeSplits: java.util.List[LakeSplit] in InputPartition case classes
  • Remove serializeLakeSplits/deserializeLakeSplits from FlussLakeUtils and splitSerializer from batch/reader method chains

Closes #3122

Test plan

  • LakeSplitSerializationTest — verifies TestingLakeSplit round-trips through Java serialization
  • fluss-common, fluss-lake-paimon, fluss-lake-iceberg, fluss-spark-common all compile
  • Spark lake integration tests in fluss-spark/fluss-spark-ut/

🤖 Generated with Claude Code

…ation

LakeSplit objects were previously serialized to Array[Byte] via
SimpleVersionedSerializer on the Spark driver side, stored in
InputPartition case classes, then deserialized on executors using a
re-created serializer. This added unnecessary complexity and coupling.

Since both Paimon's DataSplit and Iceberg's FileScanTask are already
java.io.Serializable, LakeSplit can safely extend Serializable, allowing
Spark to transport splits directly via Java serialization.

Changes:
- LakeSplit extends java.io.Serializable; PaimonSplit/TestingLakeSplit add serialVersionUID
- InputPartition case classes use LakeSplit directly instead of Array[Byte]
- Remove splitSerializer from batch/reader method chains
- Remove serializeLakeSplits/deserializeLakeSplits from FlussLakeUtils
- Add LakeSplitSerializationTest

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@leonardBang leonardBang self-requested a review April 21, 2026 08:20
Comment thread fluss-common/src/test/java/org/apache/fluss/lake/source/TestingLakeSplit.java Outdated
@beryllw
Copy link
Copy Markdown
Contributor

beryllw commented Apr 21, 2026

Thanks for the pr. LGTM!

YannByron and others added 3 commits April 21, 2026 22:05
Use a unique table name in testJavaSerializationRoundTrip to avoid
AlreadyExistsException when running alongside testSerializeAndDeserialize,
since both tests share a static Iceberg catalog without per-test cleanup.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…lation

Same issue as the Iceberg counterpart: testJavaSerializationRoundTrip
shared DEFAULT_TABLE with testSerializeAndDeserialize. While Paimon's
createTable uses ignoreIfExists=true so it wouldn't throw, the second
test would silently append to the existing table, breaking test isolation.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
…nique table names

Drop the default table before each test method in IcebergSourceTestBase
and PaimonSourceTestBase to ensure test isolation. This is a more robust
approach than requiring each test to use a unique table name.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[spark] Make LakeSplit extend Serializable to simplify Spark serialization

2 participants