[spark] support distributed execution of vector search on spark by Stefanietry · Pull Request #8108 · apache/paimon

Stefanietry · 2026-06-03T13:15:08Z

Purpose
Purpose: Currently, vector search operation is executed on a single node within the driver, which may lead to performance bottlenecks when dealing with large amounts of data. This issue aims to implement a distributed execution capability.
Linked issue: #8107

Tests
Add distributed vector search test via the parameter vector-search.distribute.enabled on org.apache.paimon.spark.SparkMultimodalITCase#testVector

JingsongLi · 2026-06-03T13:31:50Z

+        Broadcast<RoaringNavigableMap64> preFilterBroadcast =
+                preFilter == null ? null : engineContext.broadcast(preFilter);
+
+        SerializableFunction<List<byte[]>, Optional<byte[]>> task =


This distributed path returns java.util.Optional<byte[]> from the Spark task and then collects it back to the driver. java.util.Optional is not Serializable in Java 8, so Spark will fail serializing the task result with NotSerializableException once this branch actually runs. Could we return a serializable value instead, for example byte[] with null meaning empty, or a small serializable wrapper?

JingsongLi · 2026-06-03T13:32:16Z

        assertThat(df.columns()).hasSize(4);
        rows = df.collectAsList();
        assertThat(rows).hasSize(5);
+        spark.sql("set spark.paimon.vector-search.distribute.enabled = true;");


This assertion does not seem to exercise the new Spark-distributed path: the table only has a small number of vector splits, while SparkVectorReadImpl falls back to super.read unless splits.size() >= global-index.thread-num * 2 (default 64 splits). Because of that, the serialization/distributed execution code can be broken and this test would still pass. Could we force the distributed branch in this test, for example by setting spark.paimon.global-index.thread-num=1 or by creating enough index shards/splits?

JingsongLi · 2026-06-03T13:33:34Z

+        return dataOutputSerializer.getCopyOfBuffer();
+    }
+
+    public ScoredGlobalIndexResult deserialize(byte[] data) throws IOException {


This helper cannot round-trip an empty ScoredGlobalIndexResult. serialize() writes only scoreSize=0 for scored results whose bitmap is empty, and the existing deserializer interprets scoreSize == 0 as a plain GlobalIndexResult; this deserialize(byte[]) method then fails the instanceof ScoredGlobalIndexResult check. In the distributed reader, a split group can legitimately produce an empty scored result when the scalar pre-filter excludes all rows in that group, so this can make filtered distributed searches fail even though the local path handles empty optionals. We probably need an explicit scored/non-scored marker in the serialization format, or avoid serializing empty scored results as successful task results.

[spark] support distributed execution of vector search on spark

93353be

JingsongLi reviewed Jun 3, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[spark] support distributed execution of vector search on spark#8108

[spark] support distributed execution of vector search on spark#8108
Stefanietry wants to merge 1 commit into
apache:masterfrom
Stefanietry:opt_vector_search_on_spark

Stefanietry commented Jun 3, 2026

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

JingsongLi Jun 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Stefanietry commented Jun 3, 2026

Uh oh!

JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

JingsongLi Jun 3, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants