[spark] support persist source data to avoid loading data repeatedly by Stefanietry · Pull Request #8081 · apache/paimon

Stefanietry · 2026-06-02T12:09:03Z

Purpose
Purpose: In the UpdateAction mode, it avoids redundant calculations during the process of computing dataSplits and performing join concatenation by persisting the source data.
Linked issue: #8080

Tests
Add SparkDataEvolutionITCase

JingsongLi · 2026-06-02T15:29:15Z

                                    + "outweighs the benefit of pruning untouched files.");

+    public static final ConfigOption<Boolean> DATA_EVOLUTION_DATA_SOURCE_PERSIST_ENABLED =
+            key("data-evolution.data.source.persist.enabled")


data-evolution.merge-into.source-persist

Done, the conf has been modified as suggested.

JingsongLi · 2026-06-03T13:20:28Z

+                        + "    'manifest.compression' = 'snappy',\n"
+                        + "    'row-tracking.enabled' = 'true',\n"
+                        + "    'data-evolution.enabled' = 'true',\n"
+                        + "    'data-evolution.data.source.persist.enabled' = 'true',\n"


This still uses the old option name. The PR adds data-evolution.merge-into.source-persist, so this table keeps the new option at its default false and the test never exercises the persist path. Please switch this to the new key.

Thanks for your reminding, it has been corrected.

JingsongLi · 2026-06-03T13:20:46Z

      val sourceTableProjExprs =
        allReadFieldsOnSource.toSeq :+ Alias(TrueLiteral, ROW_FROM_SOURCE)()
-      val sourceTableProj = Project(sourceTableProjExprs, sourceTable)
+      val sourceChild = persistSourceDss.map(_.queryExecution.logical).getOrElse(sourceTable)


This only wires the cached source into the matched/update path. For a MERGE that has both matched and not-matched clauses, insertActionInvoke still builds its left-anti join from sourceTable, so the source is scanned again after the update path. Could you pass the persisted source into the insert path too, so the new option avoids repeated source loading for the whole merge action?

Done, the persisted source has been successfully passed to the insert path as per the suggestion, thereby avoiding redundant calculations.

JingsongLi · 2026-06-03T13:21:15Z

The Spark 4.0 implementation also needs the same change. paimon-spark/paimon-spark-4.0/src/main/scala/org/apache/paimon/spark/commands/MergeIntoPaimonDataEvolutionTable.scala is a version-specific copy of this command, and it still creates sourceDss from sourceTable and joins sourceTable directly. As a result, data-evolution.merge-into.source-persist has no effect for the Spark 4.0 artifact unless this file is updated as well.

Stefanietry · 2026-06-04T08:24:03Z

The Spark 4.0 implementation also needs the same change. paimon-spark/paimon-spark-4.0/src/main/scala/org/apache/paimon/spark/commands/MergeIntoPaimonDataEvolutionTable.scala is a version-specific copy of this command, and it still creates sourceDss from sourceTable and joins sourceTable directly. As a result, data-evolution.merge-into.source-persist has no effect for the Spark 4.0 artifact unless this file is updated as well.

Done, the optimization have been implemented in Spark 4.0.

JingsongLi · 2026-06-04T08:47:32Z

-    if (plan.snapshotId() != null) {
-      writer.rowIdCheckConflict(plan.snapshotId())
+    val persistSourceDss: Option[Dataset[Row]] =
+      if (table.coreOptions().dataEvolutionMergeIntoSourcePersist() && matchedActions.nonEmpty) {


This guard means the new option has no effect for insert-only MERGE statements (WHEN NOT MATCHED without any matched update). That path can still load the source twice when file pruning is enabled: targetRelatedSplits builds sourceDss to find touched splits, and insertActionInvoke then builds the left-anti join from sourceTable again. Since this option is described as persisting the source for merge-into, could we enable it whenever the source may be reused, e.g. table.coreOptions().dataEvolutionMergeIntoSourcePersist() && (matchedActions.nonEmpty || notMatchedActions.nonEmpty) or otherwise document that it is intentionally update-only?

Stefanietry force-pushed the support_persist_for_data_evolution branch 2 times, most recently from 4e99876 to cdf7d4c Compare June 2, 2026 13:00

JingsongLi reviewed Jun 2, 2026

View reviewed changes

Stefanietry force-pushed the support_persist_for_data_evolution branch 4 times, most recently from a3b196c to abfcebd Compare June 3, 2026 06:15

JingsongLi closed this Jun 3, 2026

JingsongLi reopened this Jun 3, 2026

Stefanietry closed this Jun 3, 2026

Stefanietry reopened this Jun 3, 2026

Stefanietry force-pushed the support_persist_for_data_evolution branch from abfcebd to 95e9f9b Compare June 3, 2026 09:32

JingsongLi reviewed Jun 3, 2026

View reviewed changes

Stefanietry force-pushed the support_persist_for_data_evolution branch from 95e9f9b to 287eab9 Compare June 4, 2026 06:01

JingsongLi reviewed Jun 4, 2026

View reviewed changes

[spark] support persist source data to avoid loading data repeatedly

58f4ecd

Stefanietry force-pushed the support_persist_for_data_evolution branch from 287eab9 to 58f4ecd Compare June 4, 2026 09:52

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] support persist source data to avoid loading data repeatedly#8081

[spark] support persist source data to avoid loading data repeatedly#8081
Stefanietry wants to merge 1 commit into
apache:masterfrom
Stefanietry:support_persist_for_data_evolution

Stefanietry commented Jun 2, 2026

Uh oh!

JingsongLi Jun 2, 2026

Uh oh!

Stefanietry Jun 3, 2026

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

Stefanietry Jun 4, 2026

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

Stefanietry Jun 4, 2026

Uh oh!

JingsongLi commented Jun 3, 2026

Uh oh!

Stefanietry commented Jun 4, 2026

Uh oh!

JingsongLi Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Stefanietry commented Jun 2, 2026

Uh oh!

JingsongLi Jun 2, 2026

Choose a reason for hiding this comment

Uh oh!

Stefanietry Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Stefanietry Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Stefanietry Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi commented Jun 3, 2026

Uh oh!

Stefanietry commented Jun 4, 2026

Uh oh!

JingsongLi Jun 4, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants