[globalindex] Support multi-column GlobalIndex framework#7933
[globalindex] Support multi-column GlobalIndex framework#7933CrownChu wants to merge 19 commits into
Conversation
394d600 to
682e613
Compare
JingsongLi
left a comment
There was a problem hiding this comment.
Review: [globalindex] Support multi-column GlobalIndex framework
Overall this is a well-structured change that cleanly extends the single-column SPI to support multi-column indexes. The backward compatibility is handled well via default methods. I have a few concerns about correctness and robustness:
1. [Bug] Flink BuildIndexOperator: Multi-column writer receives row with extra _ROW_ID field
File: paimon-flink/paimon-flink-common/src/main/java/org/apache/paimon/flink/globalindex/GenericIndexTopoBuilder.java
The projected row type is built as:
List<String> readColumns = new ArrayList<>(indexColumns);
readColumns.add(SpecialFields.ROW_ID.name());
RowType projectedRowType = SpecialFields.rowTypeWithRowId(rowType).project(readColumns);So the InternalRow read from data has the layout [indexCol1, indexCol2, ..., _ROW_ID]. In the multi-column branch:
if (multiColumn) {
((GlobalIndexMultiColumnWriter) indexWriter).write(row);
}The entire row (including the trailing _ROW_ID field) is passed to the writer. But the writer was created with only the index fields via createIndexWriter(table, indexType, indexFields, mergedOptions), and the Javadoc on GlobalIndexMultiColumnWriter.write() states: "The row layout matches the fields order passed to GlobalIndexerFactory#create(List, Options)".
This is a contract violation. Writer implementations that iterate over all row fields or check row.getFieldCount() will see an unexpected extra column. You should either:
- Create a sub-projection of the row that excludes
_ROW_IDbefore passing it to the writer, or - Clearly document that the row may contain trailing fields beyond the indexed columns (and update the Javadoc accordingly).
2. [Correctness] No null-field handling in multi-column mode (Flink)
In single-column mode, if the indexed field is null, the shard stops early:
Object fieldData = indexFieldGetters[0].getFieldOrNull(row);
if (fieldData == null) {
LOG.info("Null value at rowId={}, stopping shard [{}, {}].", ...);
break;
}In multi-column mode, the row is passed directly without any null check on individual fields. If one of the indexed columns is null in a multi-column row, the behavior depends entirely on the writer implementation. This asymmetry could lead to:
- Writer failures (NPE inside Lucene, for example)
- Silent corruption of the index
Consider at minimum documenting this contract difference, or adding validation that checks indexed columns for null before passing to the multi-column writer.
3. [Robustness] Unsafe cast without instanceof check
Files: GenericIndexTopoBuilder.java (Flink), DefaultGlobalIndexBuilder.java (Spark)
Both paths cast the writer based solely on the multiColumn flag:
((GlobalIndexMultiColumnWriter) indexWriter).write(row);However, GlobalIndexerFactory.create(List<DataField>, Options) has a default implementation that falls back to single-column:
default GlobalIndexer create(List<DataField> fields, Options options) {
return create(fields.get(0), options);
}If an existing factory has not been updated to support multi-column (and relies on this default), it will return a GlobalIndexSingletonWriter, and the cast will fail with a ClassCastException at runtime.
Suggestion: Add a validation check after writer creation:
if (multiColumn && !(indexWriter instanceof GlobalIndexMultiColumnWriter)) {
throw new UnsupportedOperationException(
"Index type '" + indexType + "' does not support multi-column indexing. " +
"The factory must override create(List<DataField>, Options) and return a GlobalIndexMultiColumnWriter.");
}4. [Design] Interface default method creates silent data-loss path
File: paimon-spark/paimon-spark-common/src/main/java/org/apache/paimon/spark/globalindex/GlobalIndexTopologyBuilder.java
The new default method on the interface:
default List<CommitMessage> buildIndex(..., List<DataField> indexFields, ...) {
return buildIndex(..., indexFields.get(0), options);
}This silently discards all fields beyond the first for any implementation that has not been updated to override the multi-field method. Combined with DefaultGlobalIndexTopoBuilder overriding the single-field method to delegate to multi-field, this creates an unusual delegation pattern that is correct for DefaultGlobalIndexTopoBuilder but could be a trap for other implementations.
Consider adding a log warning in the default method when indexFields.size() > 1, or throwing UnsupportedOperationException instead of silently dropping fields.
5. [Minor] resolveFields assumes metadata consistency across range groups
File: paimon-core/src/main/java/org/apache/paimon/globalindex/GlobalIndexScanner.java
private List<DataField> resolveFields(Map<Range, List<IndexFileMeta>> metas, RowType rowType) {
GlobalIndexMeta firstMeta =
checkNotNull(metas.values().iterator().next().get(0).globalIndexMeta());
...
}This takes the first index file's metadata from an arbitrary range to determine the field list. If different ranges have inconsistent metadata (e.g., during a transition where an old single-column index coexists with a new multi-column index for the same type), this could resolve incorrect fields. A defensive check that all ranges share the same field list would prevent hard-to-diagnose query errors.
Summary
The most critical issue is #1 (the _ROW_ID field leaking into the multi-column writer). Issues #2 and #3 are important for production robustness. Issues #4 and #5 are lower priority but worth addressing for long-term maintainability.
cab2375 to
52a3cb5
Compare
Fixes and Additions in This Round1. [Bug] Flink BuildIndexOperator: Multi-column writer receives row with extra _ROW_ID fieldFix: The read projection includes 2. [Correctness] No null-field handling in multi-column mode (Flink)Fix: In multi-column mode, each index field is checked individually. If any field is null, the current shard stops writing immediately (break). Only rows where all indexed columns are non-null are 3. [Robustness] Unsafe cast without instanceof checkFix: Added 4. [Design] Interface default method creates silent data-loss pathFix: Runtime instanceof check + exception ensures that when a factory does not properly implement multi-column support, it fails fast rather than silently falling through to SingletonWriter and losing 5. [Minor] resolveFields assumes metadata consistency across range groupsFix: For multi-column indexes ( 6. Multi-column minimum rowId for schema evolution
7. Multi-column GlobalIndexMeta storage conventionIn multi-column mode, 8. New ES Topology Builder (es-multi-index)
9. Multi-condition reader predicate push-down
|
| task.shardRange.to); | ||
| break; | ||
| if (multiColumn) { | ||
| ((GlobalIndexMultiColumnWriter) indexWriter).write(row); |
There was a problem hiding this comment.
Bug: missing null check in multi-column path.
The single-column path below checks for null and breaks the shard, and ESIndexTopoBuilder.BuildESIndexOperator.processElement() in this same PR also checks all columns for null before writing. But this multi-column path writes row directly with no null check.
Known implementations like LuminaVectorGlobalIndexWriter.write() throw IllegalArgumentException on null input — so a multi-column index containing a vector column will crash the Flink job if any row has a null value in the indexed columns.
Suggested fix — add the same null-field guard that ESIndexTopoBuilder has:
if (multiColumn) {
boolean hasNull = false;
for (InternalRow.FieldGetter getter : indexFieldGetters) {
if (getter.getFieldOrNull(row) == null) {
hasNull = true;
break;
}
}
if (hasNull) {
LOG.info(
"Null value in indexed columns at rowId={}, stopping shard [{}, {}].",
currentRowId,
task.shardRange.from,
task.shardRange.to);
break;
}
((GlobalIndexMultiColumnWriter) indexWriter).write(row);
}
jerry-024
left a comment
There was a problem hiding this comment.
The multi-column GlobalIndex SPI framework changes look good. However, the ES-specific code (ESIndexTopoBuilder, ESGlobalIndexTopoBuilder, ES routing in procedures, prefix matching in utils, and SPI registration) is a separate feature that happens to use the multi-column framework. Suggest splitting it into its own PR to keep this one focused.
Also, findMinNonIndexableRowId and filterEntriesBefore are copy-pasted across GenericIndexTopoBuilder, ESIndexTopoBuilder, and ESGlobalIndexTopoBuilder — consider extracting them into a shared utility.
| * with multi-column support. | ||
| */ | ||
| public class ESIndexTopoBuilder { | ||
|
|
There was a problem hiding this comment.
This entire file is an ES-specific topology builder, unrelated to the multi-column GlobalIndex SPI framework. Consider moving it to a separate PR to keep this one focused on the framework changes.
Also, findMinNonIndexableRowId, filterEntriesBefore, computeShardTasks, and closeWriterQuietly are copy-pasted from GenericIndexTopoBuilder — consider extracting them into a shared utility (e.g. GlobalIndexBuilderUtils).
| * parallelism. Supports both single-column and multi-column indexing. | ||
| */ | ||
| public class ESGlobalIndexTopoBuilder implements GlobalIndexTopologyBuilder { | ||
|
|
There was a problem hiding this comment.
Same as ESIndexTopoBuilder on the Flink side — this ES-specific Spark topology builder is not part of the multi-column framework and should go in a separate PR.
findMinNonIndexableRowId and filterEntriesBefore are again duplicated here (third copy). Please extract into a shared utility.
| return new String[] { | ||
| "BTree global index created successfully for table: " + table.name() | ||
| }; | ||
| } else if (indexType.startsWith(ESIndexTopoBuilder.ES_INDEX_TYPE_PREFIX)) { |
There was a problem hiding this comment.
This ES routing branch (ESIndexTopoBuilder.ES_INDEX_TYPE_PREFIX) is ES-specific and not related to multi-column support. Consider moving it to the ES-specific PR.
| if (builder != null) { | ||
| return builder; | ||
| } | ||
| // Prefix match: e.g. "es-multi-index-diskbbq" matches registered "es-multi-index" |
There was a problem hiding this comment.
The prefix matching logic here is ES-specific — the comment even says "es-multi-index-diskbbq" matches registered "es-multi-index". Consider moving this to the ES-specific PR.
| # limitations under the License. | ||
|
|
||
| org.apache.paimon.spark.globalindex.btree.BTreeIndexTopoBuilder | ||
| org.apache.paimon.spark.globalindex.ESGlobalIndexTopoBuilder |
There was a problem hiding this comment.
Registration of ESGlobalIndexTopoBuilder should go with the ES-specific PR.
136209d to
a28f05f
Compare
jerry-024
left a comment
There was a problem hiding this comment.
Thanks for splitting out the ES code and extracting the shared utilities — much cleaner now. Found a few issues in the latest version:
| @@ -475,7 +475,7 @@ void testAppendFilterOldFilesBeforeNewFiles() { | |||
| GenericIndexTopoBuilder.filterEntriesBefore( | |||
| entries, | |||
| GenericIndexTopoBuilder.findMinNonIndexableRowId( | |||
There was a problem hiding this comment.
Compile error: findMinNonIndexableRowId and filterEntriesBefore were moved from GenericIndexTopoBuilder to GlobalIndexBuilderUtils in commit 0cfc7ef, but this test still references them via GenericIndexTopoBuilder.findMinNonIndexableRowId(...) and GenericIndexTopoBuilder.filterEntriesBefore(...). This will fail to compile.
Should be:
GlobalIndexBuilderUtils.filterEntriesBefore(
entries,
GlobalIndexBuilderUtils.findMinNonIndexableRowId(
schemaManager, entries, Collections.singletonList("vec")));| } else { | ||
| Object fieldData = indexFieldGetters[0].getFieldOrNull(row); | ||
| if (fieldData == null) { | ||
| LOG.info( |
There was a problem hiding this comment.
Multi-column writer receives extra _ROW_ID column: In the multi-column path, row is passed directly to GlobalIndexMultiColumnWriter.write(row), but this row comes from projectedRowType which is indexColumns + _ROW_ID. The writer's contract (javadoc on GlobalIndexMultiColumnWriter.write) says the row layout should match the fields passed to GlobalIndexerFactory.create(List<DataField>, Options) — which doesn't include _ROW_ID.
The previous ES-specific code handled this with a ProjectedRow that stripped _ROW_ID before writing. Consider adding a similar projection here, or clarifying the writer contract.
| GlobalIndexMultiColumnWriter multiWriter = | ||
| (GlobalIndexMultiColumnWriter) indexWriter; | ||
| rows.forEachRemaining( | ||
| row -> { |
There was a problem hiding this comment.
Same _ROW_ID issue as the Flink side: rows come from a reader using readType = indexColumns + _ROW_ID, but the multi-column writer expects only index columns. The row passed to multiWriter.write(row) includes the extra _ROW_ID field.
| @@ -97,7 +106,7 @@ public String[] call( | |||
| BTreeIndexTopoBuilder.buildIndexAndExecute( | |||
| procedureContext.getExecutionEnvironment(), | |||
| table, | |||
There was a problem hiding this comment.
BTree silently drops extra columns: when a user passes "col1,col2" with index type btree, only indexColumns.get(0) is used — no error, no warning. Consider adding a validation:
if ("btree".equalsIgnoreCase(indexType)) {
checkArgument(indexColumns.size() == 1,
"BTree index only supports single column, got: %s", indexColumns);
}| } | ||
| return minRowId; | ||
| } | ||
|
|
There was a problem hiding this comment.
Minor: the old filterEntriesBefore in GenericIndexTopoBuilder had a LOG.info("Filtered {} files ...") line for observability. This was lost during extraction since GlobalIndexBuilderUtils has no logger. Consider adding one — this log is useful for debugging index build issues in production.
jerry-024
left a comment
There was a problem hiding this comment.
Found additional high-risk issues in code paths NOT modified by this PR but broken by the introduction of MULTI_COLUMN_INDEX_FIELD_ID = -1:
| import java.util.ArrayList; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
| import java.util.Map; |
There was a problem hiding this comment.
High risk — MERGE path crash: MULTI_COLUMN_INDEX_FIELD_ID = -1 breaks existing code that calls rowType.getField(globalIndexMeta.indexFieldId()) without guarding against -1:
MergeIntoUpdateChecker.java:104(Flink): scans index manifest entries and doesrowType.getField(globalIndexMeta.indexFieldId())— will throw when encountering a multi-column index.MergeIntoPaimonDataEvolutionTable.scala:514(Spark): same pattern —rowType.getField(globalIndexMeta.indexFieldId()).name().
Once a table has a multi-column global index, any MERGE INTO that touches indexed columns will crash with "Cannot find field by field id: -1".
Fix: these callers need to handle MULTI_COLUMN_INDEX_FIELD_ID by reading extraFieldIds() to get the actual column list.
There was a problem hiding this comment.
Fix:
Added getIndexedFieldNames helper in both Flink and Spark paths:
- When indexFieldId == MULTI_COLUMN_INDEX_FIELD_ID (-1): resolve column names from extraFieldIds()
- Otherwise: use the original single-column logic (rowType.getField(indexFieldId) + optional extraFieldIds)
Both the index filter (which entries are affected) and the error reporting (conflicted column names) now correctly handle multi-column indexes.
Affected files:
- paimon-flink/.../dataevolution/MergeIntoUpdateChecker.java
- paimon-spark/paimon-spark-common/.../MergeIntoPaimonDataEvolutionTable.scala
- paimon-spark/paimon-spark-4.0/.../MergeIntoPaimonDataEvolutionTable.scala
| if (textColumn.id() == id) { | ||
| return true; | ||
| } | ||
| } |
There was a problem hiding this comment.
High risk — read path mismatch: FullTextScanImpl now correctly selects multi-column index files via the extraFieldIds check added in this PR. However, the corresponding read path in FullTextReadImpl.java:72-74 still creates the reader with:
GlobalIndexerFactoryUtils.load(indexType).create(textColumn, options)This uses the single-column factory method. When the selected index file was built with create(List<DataField>[text, vector], options), reading it with a single-column reader will either fail to decode the file or produce incorrect results.
The scan/filter path and the read path are now inconsistent — scan discovers multi-column files, but read doesn't know how to open them. Need to use resolveFields-style metadata lookup + create(List<DataField>, Options) in the read path as well.
There was a problem hiding this comment.
Before creating the GlobalIndexer, inspect the first index file's GlobalIndexMeta to determine the file's column layout:
- indexFieldId == -1 (multi-column): resolve all field IDs from extraFieldIds(), create reader with factory.create(List, options)
- Otherwise (single-column): use original factory.create(textColumn, options)
This ensures the reader matches the format used when the index file was built.
Affected file:
- paimon-core/src/main/java/org/apache/paimon/table/source/FullTextReadImpl.java
There was a problem hiding this comment.
Same approach as FullTextReadImpl — inspect the first index file's GlobalIndexMeta before creating the reader:
- indexFieldId == -1 (multi-column): resolve fields from extraFieldIds(), use factory.create(List, options)
- Otherwise (single-column): use original factory.create(vectorColumn, options)
Affected file:
- paimon-core/src/main/java/org/apache/paimon/table/source/VectorReadImpl.java
| for (int id : globalIndex.extraFieldIds()) { | ||
| if (vectorColumn.id() == id || filterFieldIds.contains(id)) { | ||
| return true; | ||
| } |
There was a problem hiding this comment.
Same read path issue as FullText: VectorReadImpl.java:87-89 creates the reader with create(vectorColumn, options) (single-column), but this scan filter now discovers multi-column index files. The read side needs the same multi-column awareness.
| } else { | ||
| Object fieldData = indexFieldGetters[0].getFieldOrNull(row); | ||
| if (fieldData == null) { | ||
| LOG.info( |
There was a problem hiding this comment.
Still unresolved from last review: multi-column writer receives the full projected row including _ROW_ID. Need to project down to index-only columns before write(row). See ProjectedRow approach.
There was a problem hiding this comment.
commit no push,now already push
| @@ -97,7 +106,7 @@ public String[] call( | |||
| BTreeIndexTopoBuilder.buildIndexAndExecute( | |||
| procedureContext.getExecutionEnvironment(), | |||
| table, | |||
There was a problem hiding this comment.
Still unresolved: BTree and other index types that don't support multi-column will silently drop extra columns. When indexColumns.size() > 1 and the factory doesn't override create(List<DataField>, Options), the default implementation falls back to fields.get(0) without any error.
Suggest: throw UnsupportedOperationException in the default GlobalIndexerFactory.create(List<DataField>, Options) when fields.size() > 1, instead of silently falling back to create(fields.get(0), options). This forces implementations to explicitly opt-in to multi-column support.
d7d7b42 to
55a445a
Compare
|
+1 |
| import java.util.ArrayList; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
| import java.util.Map; |
There was a problem hiding this comment.
Risk — IndexManifestFileHandler overlap detection false positive with multi-column indexes
IndexManifestFileHandler.java:243 uses indexFieldId to determine whether two index files belong to the same field:
retainedMeta.indexFieldId() != addedMeta.indexFieldId()With MULTI_COLUMN_INDEX_FIELD_ID = -1, all multi-column indexes share the same sentinel value. Two indexes on different column sets (e.g. [title, vec] vs [content, embedding]) will both have indexFieldId == -1, so the handler treats them as "same field". If their row ranges overlap, it throws IllegalStateException and rejects the commit — even though they are logically independent indexes.
Suggestion: add extraFieldIds comparison (e.g. Arrays.equals) to the overlap check, or compare indexType as well.
There was a problem hiding this comment.
Thanks for catching this!
Fixed in IndexManifestFileHandler.validateRetainedIndexFiles().
Split the overlap detection into two branches:
- Single-column: keep original indexFieldId comparison
- Multi-column (indexFieldId == -1): use Arrays.equals(extraFieldIds) to distinguish different column groups
This way two multi-column indexes on different column sets (e.g. [title, vec] vs [content, embedding]) won't trigger false positive overlap errors.
| import java.util.ArrayList; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
| import java.util.Map; |
There was a problem hiding this comment.
Minor — TableIndexesTable shows null for multi-column index field name
TableIndexesTable.java:238 does logicalRowType.getField(globalMeta.indexFieldId()) which throws when indexFieldId == -1. The exception is caught, but index_field_name silently displays null to users.
Suggestion: when indexFieldId == MULTI_COLUMN_INDEX_FIELD_ID, resolve names from extraFieldIds() and join them with commas.
There was a problem hiding this comment.
Thanks! Fixed in TableIndexesTable.toRow(). When indexFieldId == MULTI_COLUMN_INDEX_FIELD_ID, resolve field names from extraFieldIds() and join with commas (e.g. "title,vec"). Single-column path unchanged.
| import java.util.ArrayList; | ||
| import java.util.HashMap; | ||
| import java.util.List; | ||
| import java.util.Map; |
There was a problem hiding this comment.
Suggestion — add isMultiColumn() helper to GlobalIndexMeta
The sentinel check indexFieldId == MULTI_COLUMN_INDEX_FIELD_ID is now scattered across many modules (MergeIntoUpdateChecker, MergeIntoPaimonDataEvolutionTable x2, FullTextReadImpl, VectorReadImpl, GlobalIndexScanner, GlobalIndexBuilderUtils, etc.). This is fragile — new code that touches indexFieldId() can easily forget the guard and crash on -1.
Consider adding a convenience method to GlobalIndexMeta:
public boolean isMultiColumn() {
return indexFieldId == MULTI_COLUMN_INDEX_FIELD_ID;
}Then all call sites replace meta.indexFieldId() == MULTI_COLUMN_INDEX_FIELD_ID with meta.isMultiColumn(), which is more readable and harder to miss.
There was a problem hiding this comment.
Done. Added GlobalIndexMeta.isMultiColumn() and replaced all sentinel checks across the following classes:
- GlobalIndexScanner
- FullTextReadImpl
- VectorReadImpl
- MergeIntoUpdateChecker
- TableIndexesTable
| } | ||
| return names; | ||
| } | ||
| } |
There was a problem hiding this comment.
Code duplication — getIndexedFieldNames is copy-pasted 3 times
This same helper logic exists in:
- Here —
MergeIntoUpdateChecker.java(Flink, Java) MergeIntoPaimonDataEvolutionTable.scala(Spark common, Scala)MergeIntoPaimonDataEvolutionTable.scala(Spark 4.0, Scala)
Consider extracting it as a static method in GlobalIndexMeta or GlobalIndexBuilderUtils so all three call sites can reuse a single implementation. This also reduces the risk of future inconsistencies when the logic evolves.
There was a problem hiding this comment.
Done. Extracted as an instance method GlobalIndexMeta.getIndexedFieldNames(RowType) and replaced all three call sites.
following classes:
GlobalIndexMeta、MergeIntoUpdateChecker、MergeIntoPaimonDataEvolutionTable.scala、MergeIntoPaimonDataEvolutionTable.scala
| @@ -170,7 +198,7 @@ public InternalRow[] call(InternalRow args) { | |||
| } catch (Exception e) { | |||
There was a problem hiding this comment.
Nit: this error message uses column (the raw comma-separated input string) instead of indexColumns (the parsed List<String>). The Flink procedure uses indexColumns in its error message — should be consistent.
// current
String.format("Failed to create %s index for columns '%s' on table '%s'.", indexType, column, tableIdent)
// suggested
String.format("Failed to create %s index for columns '%s' on table '%s'.", indexType, indexColumns, tableIdent)Extend the GlobalIndex SPI, build path, and query path to support one index builder handling multiple columns (e.g. Lucene indexing title + content + tags together). Key changes: - GlobalIndexerFactory/GlobalIndexer: add List<DataField> create overloads - GlobalIndexMultiColumnWriter: new interface for multi-column writes - GlobalIndexBuilderUtils: toIndexFileMetas/createIndexWriter accept List<DataField> - GlobalIndexScanner: route extraFieldIds to same reader group - VectorScanImpl/FullTextScanImpl: match against extraFieldIds - GenericIndexTopoBuilder (Flink): multi-column projection and writer dispatch - DefaultGlobalIndexBuilder/TopoBuilder (Spark): multi-column support - All single-column APIs preserved for backward compatibility
Allow index_column parameter to accept comma-separated column names (e.g. "title,embedding") for both Flink and Spark procedures. Add List<String> overload for GenericIndexTopoBuilder.buildIndexAndExecute.
…e into GlobalIndexBuilderUtils
…n, and restore observability logs
… index (indexFieldId=-1)
…-column for unsupported index types
…k, and multi-column guard
…count is unlimited
0ac045a to
6e79d86
Compare
…ion, and display fix - Add GlobalIndexMeta.isMultiColumn() helper to replace scattered sentinel checks - Fix IndexManifestFileHandler overlap detection for multi-column indexes - Fix TableIndexesTable showing null for multi-column index field names - Replace all MULTI_COLUMN_INDEX_FIELD_ID == checks with isMultiColumn()
… error message - Add GlobalIndexMeta.getIndexedFieldNames(RowType) to eliminate copy-pasted helper - Replace local getIndexedFieldNames in MergeIntoUpdateChecker (Flink) - Replace local getIndexedFieldNames in MergeIntoPaimonDataEvolutionTable (Spark common & 4.0) - Fix Spark CreateGlobalIndexProcedure error message to use indexColumns instead of column
indexColumns was declared inside the try block but referenced in the catch block's error message, which is out of scope. Hoist the parsing before the try so the catch can access it.
|
I reviewed the latest change and found two issues that look worth fixing before merge:
|
… shard Breaking out of the shard loop on the first null indexed value dropped all later rows in the shard from the index and broke row-id alignment. Pass every row through the writer instead: a null field advances the logical row id without indexing a value, so later non-null rows are still indexed. - Flink single-column: restore null pass-through (was a regression) - Flink/Spark multi-column: pass the projected row through; each index type decides how to handle null fields
…groups The scanner mapped each field id to a single multi-column group, so a field shared by several multi-column indexes (e.g. (a,b) and (a,c)) threw "Inconsistent extraFieldIds" or silently dropped readers. Model fieldId -> list of groups instead. For evaluation, every index covering a single field returns the same matching row ids, so pick one index rather than running them all: prefer the single-column index, otherwise fall back to one multi-column group.
JingsongLi
left a comment
There was a problem hiding this comment.
Our current design is a primary secondary concept, requiring a primary field to indicate the ownership of the index, followed by auxiliary fields to assist with the index. It is best not to change the current design.
It can be seen that current APIs such as GlobalIndexMeta, VectorSearchBuilder, FullTextSearchBuilder, etc. are designed around this concept.
…ndexes Previously a multi-column index stored indexFieldId=-1 and put all field ids in extraFieldIds, treating columns as parallel. Switch to a primary-column model: indexFieldId is always the first (primary) column and extraFieldIds holds the remaining columns. A primary column can own at most one index. - GlobalIndexMeta: isMultiColumn() based on extraFieldIds; add getIndexedFieldIds() and getIndexedFields(); unify getIndexedFieldNames() - GlobalIndexBuilderUtils: drop MULTI_COLUMN_INDEX_FIELD_ID; first column becomes the primary, rest become extraFieldIds - GlobalIndexScanner: key indexes by primary field id; reject conflicting indexes that share a primary with different columns - IndexManifestFileHandler: reject added index files sharing a primary with an existing one over an overlapping row range - FullText/VectorReadImpl: resolve the full column list via getIndexedFields() - TableIndexesTable: show all indexed column names; log when names cannot resolve
…overlap checks - GlobalIndexScanner: split single-/multi-column lookups (IndexMetaFileGroup), single-column index takes priority, fall back to the first multi-column index that has the field as an extra; reject a primary owning multiple indexes - GlobalIndexMultiColumnWriter.write now takes the shard-relative row id; the builders pass projected index columns plus that id - DefaultGlobalIndexBuilder (Spark): multi-column skips rows outside the shard range so the relative row id stays valid for boundary-spanning files - IndexManifestFileHandler: same-primary indexes with different columns always conflict, same columns only conflict on overlapping ranges - FullText/VectorScanImpl: match indexes by their primary column
JingsongLi
left a comment
There was a problem hiding this comment.
I found one issue that should be addressed before this is considered ready.
create_global_index now accepts multi-column global indexes for every index type except btree, and then passes the multi-column indexFields into the topology builder. However, the default GlobalIndexerFactory#create(List<DataField>, Options) still throws UnsupportedOperationException when more than one field is provided, and the existing real factories such as lumina and tantivy-fulltext only implement the single-column create(DataField, Options) method.
That means a user can submit a multi-column lumina / tantivy-fulltext global index creation request, but the job will fail later at runtime when DefaultGlobalIndexBuilder calls createIndexWriter(..., indexFields, ...) and the factory rejects the list. The procedure should either reject unsupported multi-column index types up front, or the intended index types need to implement the multi-column factory/writer path fully.
… time Add GlobalIndexerFactory.supportsMultiColumn() (default false). CreateGlobalIndexProcedure (Spark and Flink) now checks it up front and fails fast with a clear message when a multi-column index is requested for a type whose factory does not support it, instead of failing later in the build job when create(List) throws.
Extend the GlobalIndex SPI, build path, and query path to support one index builder handling multiple columns (e.g. Lucene indexing title + content + tags together). Key changes:
Purpose
Some index engines (e.g. Lucene) can build a single index over multiple columns — full-text on title and vector on embedding in the same index file. Previously the GlobalIndex SPI only supported one column
per indexer: GlobalIndexerFactory.create(DataField, Options) and GlobalIndexSingletonWriter.write(Object). This meant multi-column engines had to create separate index files per column, losing co-located
search benefits and doubling I/O.
This PR adds a multi-column path through the entire stack:
accepts InternalRow with all indexed columns projected in field order.
field list from metadata and passes it to the factory. Extra field IDs are registered in the indexMetas map so queries against any column in the group find the same reader.
includes embedding as an extra field.
findMinNonIndexableRowId checks containsAll(indexColumns) for schema evolution safety.
All existing single-column callers are unchanged — new APIs have default implementations that delegate to the original single-column methods.
Tests
queries on same index file; also tests SPI discovery path via GlobalIndexer.create("lucene", ...) (2 tests)
VectorSearchBuilder/FullTextSearchBuilder → ReadBuilder.newScan().withGlobalIndexResult(), reads back rows and asserts correctness (3 tests)