Skip to content

[fix](ann-index) Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient training rows.#64082

Open
kaka11chen wants to merge 7 commits into
apache:masterfrom
kaka11chen:ann-build-full-buffer-no-spill

Conversation

@kaka11chen
Copy link
Copy Markdown
Contributor

@kaka11chen kaka11chen commented Jun 3, 2026

What problem does this PR solve?

Issue Number: None

Related PR: None

Problem Summary:

This PR fixes several ANN index build issues:

  1. ANN index writer previously pre-reserved ann_index_build_chunk_size * dim floats during init, which could allocate excessive memory immediately for high-dimensional vectors.
  2. For train-required indexes such as IVF, IVF_ON_DISK, and PQ-quantized indexes, chunk-level training could train FAISS with only part of the segment data, causing poor or even zero recall.
  3. IVF_ON_DISK did not use nlist as its minimum FAISS training row requirement.
  4. Segments with fewer rows than the minimum training requirement could fail during build instead of skipping ANN index persistence.

This PR changes the build behavior as follows:

  1. Remove init-time large build-buffer reservation.
  2. Buffer segment vectors and train train-required ANN indexes once with the segment data.
  3. Skip persisting ANN indexes for empty or too-small segments.
  4. Add ann_index_build_min_segment_rows so small ANN indexes can be skipped by a Doris-side row threshold.
  5. Treat IVF_ON_DISK minimum training rows consistently with IVF.

Release note

Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient training rows.

Check List (For Author)

  • Test

    • Regression test
      • ./run-regression-test.sh --run -d ann_index_p0 -s ivf_pq_full_buffer_train_recall
    • Unit Test
    • Manual test
      • run buildall
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes. ANN indexes that require training now train once with full segment data. Segments with insufficient training rows skip ANN index build instead of failing.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

Copilot AI review requested due to automatic review settings June 3, 2026 12:01
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@kaka11chen kaka11chen changed the title Ann build full buffer no spill [fix](ann-index) Fix ivf recall zero and oom. Jun 3, 2026
@kaka11chen
Copy link
Copy Markdown
Contributor Author

run buildall

1 similar comment
@kaka11chen
Copy link
Copy Markdown
Contributor Author

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@kaka11chen kaka11chen force-pushed the ann-build-full-buffer-no-spill branch from a66b5d7 to 582071f Compare June 3, 2026 13:21
@kaka11chen
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29269 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 582071f0eeb96aa5a7754df2d0ef4ec12745e462, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17827	4092	4069	4069
q2	q3	10781	1462	831	831
q4	4692	477	345	345
q5	7544	876	599	599
q6	181	170	137	137
q7	773	873	627	627
q8	9412	1593	1572	1572
q9	5712	4555	4464	4464
q10	6760	1803	1534	1534
q11	440	277	253	253
q12	634	440	289	289
q13	18142	3384	2813	2813
q14	267	264	240	240
q15	q16	833	782	713	713
q17	1020	947	950	947
q18	6921	5860	5590	5590
q19	1362	1186	1034	1034
q20	518	413	253	253
q21	6177	2870	2637	2637
q22	472	376	322	322
Total cold run time: 100468 ms
Total hot run time: 29269 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5113	4833	4845	4833
q2	q3	4887	5277	4748	4748
q4	2119	2219	1416	1416
q5	4790	4861	4656	4656
q6	228	176	127	127
q7	1850	1827	1561	1561
q8	2405	2136	2159	2136
q9	7877	7608	7402	7402
q10	4720	4691	4229	4229
q11	534	386	352	352
q12	727	756	522	522
q13	3013	3363	2818	2818
q14	281	278	263	263
q15	q16	675	713	612	612
q17	1287	1259	1259	1259
q18	7369	6856	6835	6835
q19	1129	1122	1135	1122
q20	2225	2206	1942	1942
q21	5278	4622	4494	4494
q22	516	476	419	419
Total cold run time: 57023 ms
Total hot run time: 51746 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169468 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 582071f0eeb96aa5a7754df2d0ef4ec12745e462, data reload: false

query5	4322	636	479	479
query6	443	203	184	184
query7	4814	565	303	303
query8	378	220	208	208
query9	8783	4008	4018	4008
query10	475	331	259	259
query11	5927	2321	2229	2229
query12	158	103	100	100
query13	1264	625	455	455
query14	6423	5360	4995	4995
query14_1	4369	4344	4350	4344
query15	211	201	177	177
query16	1017	449	434	434
query17	921	679	576	576
query18	2452	468	342	342
query19	198	176	138	138
query20	110	111	104	104
query21	218	137	115	115
query22	13736	13525	13321	13321
query23	17626	16887	16554	16554
query23_1	16584	16352	16254	16254
query24	7673	1716	1296	1296
query24_1	1323	1310	1338	1310
query25	572	469	418	418
query26	1294	332	172	172
query27	2675	530	351	351
query28	4453	2033	2003	2003
query29	1095	641	508	508
query30	305	239	201	201
query31	1130	1092	957	957
query32	121	68	62	62
query33	546	322	274	274
query34	1188	1126	646	646
query35	764	789	686	686
query36	1379	1457	1231	1231
query37	157	107	93	93
query38	3192	3133	3056	3056
query39	920	921	890	890
query39_1	889	903	872	872
query40	223	158	101	101
query41	65	62	62	62
query42	98	96	93	93
query43	320	321	282	282
query44	
query45	196	187	178	178
query46	1094	1250	770	770
query47	2344	2422	2254	2254
query48	411	401	306	306
query49	628	468	359	359
query50	994	351	265	265
query51	4359	4337	4220	4220
query52	87	90	77	77
query53	237	271	199	199
query54	262	213	195	195
query55	80	77	69	69
query56	219	239	246	239
query57	1460	1395	1338	1338
query58	247	215	212	212
query59	1580	1620	1415	1415
query60	281	283	226	226
query61	160	160	155	155
query62	705	671	591	591
query63	224	181	186	181
query64	2559	767	628	628
query65	
query66	1821	469	348	348
query67	29740	29774	29591	29591
query68	
query69	430	295	263	263
query70	988	929	962	929
query71	305	225	251	225
query72	3008	2772	2419	2419
query73	843	740	465	465
query74	5129	4932	4768	4768
query75	2662	2589	2264	2264
query76	2329	1141	765	765
query77	365	371	294	294
query78	12462	12331	12014	12014
query79	1234	1056	767	767
query80	514	480	385	385
query81	445	282	241	241
query82	233	159	123	123
query83	275	278	248	248
query84	294	139	111	111
query85	833	531	434	434
query86	346	298	304	298
query87	3352	3345	3236	3236
query88	3608	2729	2721	2721
query89	414	384	328	328
query90	2190	183	179	179
query91	173	167	137	137
query92	63	61	54	54
query93	1481	1478	849	849
query94	549	356	322	322
query95	664	373	344	344
query96	1098	825	340	340
query97	2721	2699	2575	2575
query98	213	206	203	203
query99	1159	1175	1053	1053
Total cold run time: 251293 ms
Total hot run time: 169468 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 96.20% (76/79) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.91% (21032/39013)
Line Coverage 37.57% (199809/531817)
Region Coverage 33.67% (156804/465736)
Branch Coverage 34.63% (68582/198021)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 96.20% (76/79) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.92% (21036/39013)
Line Coverage 37.61% (199993/531817)
Region Coverage 33.69% (156893/465736)
Branch Coverage 34.65% (68611/198021)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 96.25% (77/80) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.88% (27478/38229)
Line Coverage 55.48% (294351/530545)
Region Coverage 52.30% (245926/470246)
Branch Coverage 53.41% (106186/198814)

- Test: Regression test
    - ./run-regression-test.sh --run -d ann_index_p0 -s ivf_pq_full_buffer_train_recall
- Behavior changed: No
- Does this need documentation: No
@kaka11chen
Copy link
Copy Markdown
Contributor Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 96.20% (76/79) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.92% (21045/39030)
Line Coverage 37.57% (199993/532279)
Region Coverage 33.66% (156901/466109)
Branch Coverage 34.62% (68635/198252)

### What problem does this PR solve?

Issue Number: None

Related PR: apache#64082

Problem Summary: Clarify why ANN index writer swaps the buffered vectors with an empty PODArray instead of using clear(). The swap intentionally releases the full-segment training buffer before saving the index, while clear() would keep the allocated capacity.

### Release note

None

### Check List (For Author)

- Test: No need to test (comment-only change)
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 96.25% (77/80) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.93% (27512/38246)
Line Coverage 55.52% (294804/531007)
Region Coverage 52.27% (246069/470731)
Branch Coverage 53.42% (106351/199101)

### What problem does this PR solve?

Issue Number: None

Related PR: apache#64082

Problem Summary: Remove the redundant ANN writer `_skip_build` state. The flag was only set from `close_on_error()`, while normal index skip behavior is already driven by zero rows or by the segment row count being smaller than the index training requirement. Keeping the writer state explicit avoids carrying an abort flag into regular add and finish paths.

### Release note

None

### Check List (For Author)

- Test: Unit Test
    - `ENABLE_PCH=OFF ./run-be-ut.sh --run --filter=AnnIndexWriterTest.*`
- Behavior changed: No
- Does this need documentation: No
@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29312 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 04d8a048174595050f3fb6792f07bf1a7aceee6b, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17632	4048	4021	4021
q2	q3	10781	1436	837	837
q4	4689	493	366	366
q5	7556	903	581	581
q6	185	177	140	140
q7	770	874	634	634
q8	9371	1659	1548	1548
q9	5898	4562	4546	4546
q10	6776	1815	1556	1556
q11	432	274	254	254
q12	620	437	294	294
q13	18193	3433	2764	2764
q14	276	265	242	242
q15	q16	834	775	717	717
q17	973	896	876	876
q18	6876	5875	5679	5679
q19	1319	1246	1091	1091
q20	520	403	263	263
q21	6325	2963	2591	2591
q22	476	379	312	312
Total cold run time: 100502 ms
Total hot run time: 29312 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5108	4828	4776	4776
q2	q3	4855	5355	4822	4822
q4	2113	2241	1419	1419
q5	4809	4819	4694	4694
q6	237	177	132	132
q7	1921	1817	1594	1594
q8	2418	2128	2096	2096
q9	7914	7449	7379	7379
q10	4777	4721	4263	4263
q11	534	395	359	359
q12	738	748	531	531
q13	3059	3380	2769	2769
q14	273	277	254	254
q15	q16	686	710	625	625
q17	1299	1267	1263	1263
q18	7380	6959	6861	6861
q19	1125	1133	1126	1126
q20	2240	2223	1950	1950
q21	5363	4627	4519	4519
q22	524	453	422	422
Total cold run time: 57373 ms
Total hot run time: 51854 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169349 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 04d8a048174595050f3fb6792f07bf1a7aceee6b, data reload: false

query5	4317	622	485	485
query6	450	201	183	183
query7	4865	567	308	308
query8	375	222	210	210
query9	8782	4092	4111	4092
query10	457	327	273	273
query11	5843	2368	2195	2195
query12	163	106	102	102
query13	1280	636	467	467
query14	6416	5407	5077	5077
query14_1	4438	4437	4365	4365
query15	205	194	177	177
query16	976	454	435	435
query17	912	699	563	563
query18	2494	488	343	343
query19	203	186	143	143
query20	108	111	104	104
query21	218	143	124	124
query22	13600	13557	13500	13500
query23	17287	16515	16245	16245
query23_1	16364	16332	16395	16332
query24	7577	1753	1324	1324
query24_1	1324	1345	1319	1319
query25	557	461	388	388
query26	1287	335	178	178
query27	2671	555	329	329
query28	4495	2049	2063	2049
query29	1066	619	480	480
query30	323	241	197	197
query31	1119	1084	946	946
query32	117	64	56	56
query33	535	319	258	258
query34	1209	1177	656	656
query35	753	799	674	674
query36	1379	1393	1249	1249
query37	160	104	92	92
query38	3216	3165	3057	3057
query39	943	925	892	892
query39_1	888	881	862	862
query40	223	125	103	103
query41	69	60	62	60
query42	95	93	93	93
query43	332	336	286	286
query44	
query45	193	190	178	178
query46	1093	1217	759	759
query47	2335	2393	2217	2217
query48	408	430	282	282
query49	634	463	370	370
query50	1025	353	263	263
query51	4350	4310	4318	4310
query52	90	90	80	80
query53	250	272	197	197
query54	272	218	199	199
query55	80	74	71	71
query56	238	227	219	219
query57	1447	1405	1316	1316
query58	258	223	216	216
query59	1628	1718	1454	1454
query60	285	254	232	232
query61	163	176	187	176
query62	703	690	581	581
query63	244	222	191	191
query64	2553	793	633	633
query65	
query66	1804	473	339	339
query67	29818	29748	29100	29100
query68	
query69	422	313	270	270
query70	984	973	965	965
query71	304	222	213	213
query72	2969	2718	2432	2432
query73	871	762	448	448
query74	5137	4991	4793	4793
query75	2716	2602	2278	2278
query76	2339	1173	796	796
query77	370	401	319	319
query78	12501	12381	11940	11940
query79	1274	1043	784	784
query80	547	504	425	425
query81	455	290	249	249
query82	242	164	126	126
query83	284	299	265	265
query84	298	158	124	124
query85	952	624	527	527
query86	336	323	295	295
query87	3382	3341	3219	3219
query88	3621	2813	2739	2739
query89	415	378	332	332
query90	2220	186	193	186
query91	176	164	136	136
query92	67	65	54	54
query93	1555	1464	845	845
query94	533	340	330	330
query95	712	490	352	352
query96	1069	837	349	349
query97	2707	2729	2569	2569
query98	209	210	204	204
query99	1174	1164	1037	1037
Total cold run time: 251246 ms
Total hot run time: 169349 ms

…added no-train indexes during segment writing. This made the build strategy harder to reason about and could still spend CPU/memory building small HNSW/FLAT segments that should be skipped by a Doris-side row threshold. This change removes the chunk add configs, buffers ANN vectors for the whole segment, applies effective_min_rows = max(vector_index->get_min_train_rows(), config::ann_index_build_min_segment_rows) in finish(), and then trains when needed, adds once, releases the build buffer, and saves the index. Empty segments or segments below the effective threshold delete only the current index entry instead of persisting an ANN index.

Add BE config ann_index_build_min_segment_rows to skip persisting ANN indexes for small segments. Remove ann_index_build_add_chunk_size and ann_index_build_add_chunk_bytes.
@kaka11chen
Copy link
Copy Markdown
Contributor Author

run buildall

@kaka11chen kaka11chen changed the title [fix](ann-index) Fix ivf recall zero and oom. [fix](ann-index) Fix ANN IVF/PQ recall, avoid init-time large ANN build-buffer reservation, and skip ANN index build for segments with insufficient training rows. Jun 4, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants