Skip to content

[fix](filecache) Make file cache sync clear wait for async cleanup#63602

Open
freemandealer wants to merge 6 commits into
apache:masterfrom
freemandealer:task-master-file-cache-clear-sync-wrapper
Open

[fix](filecache) Make file cache sync clear wait for async cleanup#63602
freemandealer wants to merge 6 commits into
apache:masterfrom
freemandealer:task-master-file-cache-clear-sync-wrapper

Conversation

@freemandealer
Copy link
Copy Markdown
Member

Problem Summary: The old sync file cache clear path bypassed the normal FileBlock lifecycle and was not safe with concurrent cache users, while the async clear path only marked or enqueued deletes and returned before held blocks, recycle-queue deletes, and file-cache meta deletes were finished. This change makes the factory sync clear path use a synchronous wrapper over the async clear semantics. The wrapper serializes clear operations, pauses the TTL manager during clear, marks existing blocks for deletion, drains recycled blocks, waits for held deleting blocks to be released, and waits for file-cache meta delete fences before reporting completion. The direct clear helper is kept only for BE tests. HTTP sync clear now runs through a cancellable async reply path so client disconnects can cancel waiting without adding a clear timeout config.

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

Issue Number: N/A

Related PR: N/A

Problem Summary: The old sync file cache clear path bypassed the normal FileBlock lifecycle and was not safe with concurrent cache users, while the async clear path only marked or enqueued deletes and returned before held blocks, recycle-queue deletes, and file-cache meta deletes were finished. This change makes the factory sync clear path use a synchronous wrapper over the async clear semantics. The wrapper serializes clear operations, pauses the TTL manager during clear, marks existing blocks for deletion, drains recycled blocks, waits for held deleting blocks to be released, and waits for file-cache meta delete fences before reporting completion. The direct clear helper is kept only for BE tests. HTTP sync clear now runs through a cancellable async reply path so client disconnects can cancel waiting without adding a clear timeout config.

None

- Test:
    - Unit Test: `DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:HttpRequestAsyncReplyTest.*'`
    - Unit Test: `DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='FileCacheActionTest.*'`
    - Manual test: `git diff --check`
- Behavior changed: Yes. File cache sync clear now waits for the async cleanup path to finish instead of using the unsafe direct clear path.
- Does this need documentation: No
Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
@freemandealer freemandealer force-pushed the task-master-file-cache-clear-sync-wrapper branch from e169458 to ab16f05 Compare June 2, 2026 10:21
@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@freemandealer
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found two blocking issues in the new synchronous file-cache clear path.

Critical checkpoint conclusions:

  • Goal/test: The PR aims to make sync file-cache clear wait for holder/recycle/meta cleanup and adds unit tests for held blocks, cancellation, meta fences, and recycle queue draining. The tests do not cover concurrent cache creation during a sync clear or BE thread memory tracking for the HTTP worker.
  • Scope/focus: The change is mostly focused on file-cache clear and HTTP cancellation, but the sync-clear semantics changed from an atomic in-memory/storage clear to a staged async clear plus wait loop.
  • Concurrency: This change is concurrency-sensitive. _clear_mutex serializes clear requests only; normal get_or_set paths use _mutex and can create blocks while clear_file_cache_sync() waits. The new detached HTTP thread also introduces a new execution context.
  • Lifecycle/static initialization: No new static initialization issue found. The detached thread relies on HttpRequest lifetime being held by wait_finish_send_reply(), but the thread context itself still needs BE runtime initialization.
  • Config/incompatible changes/protocol: No new config, storage format, or FE-BE protocol compatibility issue found.
  • Parallel paths: Async clear, sync clear, recycle GC, and direct test-only clear were checked. The sync path now differs materially from old production direct clear under concurrent readers.
  • Special conditions: Cancellation is checked in the wait loop, but not enough to address the two issues below.
  • Test coverage/results: Added tests cover several serial and holder scenarios, but miss concurrent cache insertion during sync clear and memory-context requirements for the new thread.
  • Observability: New result logging is useful; no additional observability blocker found.
  • Transaction/persistence/data writes: File-cache metadata deletion fences were considered; no separate persistence blocker found beyond the concurrent clear semantics.
  • Memory safety: The detached std::thread enters BE work without SCOPED_INIT_THREAD_CONTEXT, leaving allocations in that worker outside the expected memory-tracking context.
  • Performance: The wait loop is bounded by 10ms sleeps and drains recycle work; no additional performance blocker found.

User focus: no additional user-provided review focus was specified.

Comment thread be/src/io/cache/block_file_cache.cpp
Comment thread be/src/service/http/action/file_cache_action.cpp
### What problem does this PR solve?

Issue Number: N/A

Problem Summary: Sync file cache clear could finish while normal readers created new cache blocks during the clear window, and the HTTP sync clear worker did not initialize a BE thread context. This change makes cache misses and holes return transient SKIP_CACHE blocks while sync clear is draining, waits for async cache loading before the sync clear barrier, and initializes thread context in the detached HTTP worker.

### Release note

None

### Check List (For Author)

- Test:

    - Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:HttpRequestAsyncReplyTest.*'

    - Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='FileCacheActionTest.*'

    - Manual test: git diff --check

    - Manual test: PATH=/mnt/disk1/zhangzhengyu/build-dep/ldb_toolchain.back/bin:$PATH build-support/check-format.sh

- Behavior changed: Yes. During sync file cache clear, cache misses skip caching until the clear barrier finishes.

- Does this need documentation: No
@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@freemandealer
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review conclusion: request changes. The current sync-clear design fixes the previously raised new-block insertion window and now initializes the detached HTTP worker thread context, but it introduces a separate liveness failure when cache async loading terminates with an error.

Critical checkpoint conclusions:

  • Goal and proof: The PR aims to make sync file-cache clear wait for async cleanup. The normal success path is covered by new BE unit tests, but async-load failure before clear is not covered and can hang.
  • Scope and clarity: The change is mostly focused on file-cache clear and HTTP async reply handling.
  • Concurrency: The new clear path uses _clear_mutex, _mutex, _recycle_keys_mutex, and atomic cancellation. Lock ordering is mostly consistent, but the pre-barrier wait on _async_open_done has no failure terminal state and can block forever while serializing all clears.
  • Lifecycle: Background cache loading can exit through the catch path without setting _async_open_done; sync clear now depends on that flag for progress.
  • Configuration: No new config items.
  • Compatibility: No storage/protocol compatibility issue found.
  • Parallel paths: Async clear and BE-test direct clear remain separate; the sync path now depends on async-load completion.
  • Special checks: Cancellation is handled for HTTP disconnects, but a normal waiting client or direct caller has no timeout/failure exit when async loading failed.
  • Test coverage: New unit tests cover normal wait, cancellation, recycle queue draining, meta-store fence, and skip-cache during clear. Missing negative coverage for async-load failure.
  • Observability: Existing logs show async-load exceptions, but sync clear does not surface that state to the caller.
  • Transaction/persistence/data-write concerns: Not applicable beyond file-cache metadata cleanup; meta delete fences are waited on in the normal path.
  • Performance: No blocking performance issue found beyond the liveness problem.
  • User focus: No additional user-provided review focus was supplied.

Comment thread be/src/io/cache/block_file_cache.cpp
@freemandealer freemandealer dismissed github-actions[bot]’s stale review June 2, 2026 15:44

this will not happen as _async_open_done will always succeed

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29443 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit ab16f05e675116f57e14e654cc9196e664b1ee45, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17584	4064	4047	4047
q2	q3	10754	1457	833	833
q4	4691	480	354	354
q5	7550	897	602	602
q6	181	174	137	137
q7	763	861	642	642
q8	9370	1587	1558	1558
q9	5909	4544	4479	4479
q10	6794	1813	1535	1535
q11	443	273	259	259
q12	637	433	292	292
q13	18148	3443	2813	2813
q14	263	265	242	242
q15	q16	833	772	712	712
q17	946	915	932	915
q18	6954	5814	5510	5510
q19	2474	1394	1161	1161
q20	565	417	258	258
q21	6376	2804	2770	2770
q22	470	374	324	324
Total cold run time: 101705 ms
Total hot run time: 29443 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5107	4777	4837	4777
q2	q3	4829	5353	4646	4646
q4	2135	2212	1390	1390
q5	4842	4894	4764	4764
q6	235	185	128	128
q7	1894	1783	1622	1622
q8	2479	2111	2095	2095
q9	8074	7861	7454	7454
q10	4717	4663	4245	4245
q11	540	411	371	371
q12	740	734	521	521
q13	3073	3504	2767	2767
q14	266	293	251	251
q15	q16	679	703	610	610
q17	1279	1263	1258	1258
q18	7333	6863	6753	6753
q19	1129	1074	1097	1074
q20	2219	2226	1944	1944
q21	5312	4610	4533	4533
q22	527	459	400	400
Total cold run time: 57409 ms
Total hot run time: 51603 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29287 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 7bc50781b41f13b104f5b496747d135b30b0a50a, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17752	4117	4189	4117
q2	q3	10836	1399	820	820
q4	4691	485	343	343
q5	7546	890	587	587
q6	183	177	139	139
q7	767	845	661	661
q8	9398	1560	1711	1560
q9	5962	4505	4536	4505
q10	6697	1841	1528	1528
q11	448	280	253	253
q12	628	428	295	295
q13	18174	3362	2763	2763
q14	270	266	246	246
q15	q16	820	780	714	714
q17	968	902	1007	902
q18	7026	5708	5527	5527
q19	1301	1282	1054	1054
q20	531	407	279	279
q21	6426	2840	2664	2664
q22	456	373	330	330
Total cold run time: 100880 ms
Total hot run time: 29287 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5051	4927	4838	4838
q2	q3	4859	5232	4721	4721
q4	2166	2183	1396	1396
q5	4869	4869	4704	4704
q6	235	188	137	137
q7	1858	1725	1630	1630
q8	2421	2172	2099	2099
q9	8029	7930	7468	7468
q10	4780	4699	4260	4260
q11	530	388	360	360
q12	730	740	524	524
q13	3054	3398	2820	2820
q14	274	280	258	258
q15	q16	687	696	612	612
q17	1295	1265	1269	1265
q18	7264	6998	6979	6979
q19	1175	1092	1126	1092
q20	2217	2201	1957	1957
q21	5296	4597	4456	4456
q22	515	481	437	437
Total cold run time: 57305 ms
Total hot run time: 52013 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169529 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit ab16f05e675116f57e14e654cc9196e664b1ee45, data reload: false

query5	4326	636	473	473
query6	471	205	185	185
query7	4821	544	306	306
query8	379	220	199	199
query9	8747	4063	4057	4057
query10	435	328	260	260
query11	5967	2352	2222	2222
query12	158	103	98	98
query13	1298	638	425	425
query14	7011	5438	5072	5072
query14_1	4415	4441	4413	4413
query15	206	204	179	179
query16	1012	460	450	450
query17	1157	708	600	600
query18	2490	486	349	349
query19	205	189	152	152
query20	117	113	107	107
query21	216	139	120	120
query22	13564	13566	13383	13383
query23	17177	16514	16244	16244
query23_1	16265	16132	16247	16132
query24	7616	1805	1293	1293
query24_1	1321	1326	1340	1326
query25	568	483	405	405
query26	1303	309	181	181
query27	2677	564	349	349
query28	4509	2049	2018	2018
query29	1089	652	501	501
query30	309	238	205	205
query31	1169	1089	951	951
query32	104	65	63	63
query33	523	333	309	309
query34	1159	1148	646	646
query35	745	795	683	683
query36	1433	1393	1187	1187
query37	154	105	90	90
query38	3208	3138	3054	3054
query39	934	931	911	911
query39_1	867	870	872	870
query40	214	123	101	101
query41	64	63	60	60
query42	96	94	93	93
query43	315	318	278	278
query44	
query45	198	186	177	177
query46	1130	1203	757	757
query47	2388	2375	2278	2278
query48	413	436	300	300
query49	630	477	362	362
query50	1051	359	266	266
query51	4346	4366	4248	4248
query52	88	89	79	79
query53	239	272	195	195
query54	269	222	192	192
query55	81	74	68	68
query56	231	229	215	215
query57	1432	1401	1331	1331
query58	258	220	213	213
query59	1610	1699	1467	1467
query60	281	242	228	228
query61	156	163	156	156
query62	706	660	581	581
query63	232	183	187	183
query64	2560	793	614	614
query65	
query66	1792	469	337	337
query67	29751	29761	29638	29638
query68	
query69	461	305	268	268
query70	956	958	944	944
query71	297	222	207	207
query72	3045	2754	2418	2418
query73	855	790	463	463
query74	5160	4961	4766	4766
query75	2675	2599	2260	2260
query76	2318	1132	762	762
query77	356	377	291	291
query78	12399	12382	12014	12014
query79	1324	1106	766	766
query80	546	472	395	395
query81	451	284	255	255
query82	244	165	121	121
query83	352	281	249	249
query84	261	142	114	114
query85	856	543	442	442
query86	348	310	281	281
query87	3362	3343	3185	3185
query88	3603	2727	2704	2704
query89	416	391	329	329
query90	2150	187	176	176
query91	211	167	143	143
query92	63	62	59	59
query93	1406	1406	833	833
query94	525	359	299	299
query95	687	412	438	412
query96	1126	798	338	338
query97	2680	2706	2565	2565
query98	213	207	212	207
query99	1173	1160	1030	1030
Total cold run time: 251645 ms
Total hot run time: 169529 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170473 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 7bc50781b41f13b104f5b496747d135b30b0a50a, data reload: false

query5	4356	630	476	476
query6	451	201	182	182
query7	4817	540	285	285
query8	367	216	205	205
query9	8783	4116	4116	4116
query10	472	315	269	269
query11	5938	2389	2256	2256
query12	167	104	98	98
query13	1270	607	426	426
query14	6396	5422	5045	5045
query14_1	4447	4388	4432	4388
query15	208	201	176	176
query16	1023	475	458	458
query17	1132	735	596	596
query18	2563	485	379	379
query19	205	183	135	135
query20	125	106	104	104
query21	216	147	118	118
query22	13767	13548	13585	13548
query23	17279	16521	16104	16104
query23_1	16282	16414	16635	16414
query24	8234	1800	1305	1305
query24_1	1294	1323	1286	1286
query25	555	455	383	383
query26	1292	315	173	173
query27	2674	535	350	350
query28	4422	2048	2045	2045
query29	1057	612	486	486
query30	320	235	194	194
query31	1132	1083	969	969
query32	127	60	58	58
query33	509	314	247	247
query34	1161	1137	690	690
query35	763	780	685	685
query36	1358	1476	1259	1259
query37	160	104	98	98
query38	3215	3137	3076	3076
query39	944	911	899	899
query39_1	885	866	901	866
query40	234	122	102	102
query41	70	62	66	62
query42	97	93	94	93
query43	320	325	284	284
query44	
query45	197	185	181	181
query46	1115	1234	733	733
query47	2344	2376	2241	2241
query48	401	406	309	309
query49	621	469	363	363
query50	989	342	271	271
query51	4304	4282	4257	4257
query52	88	88	78	78
query53	234	276	197	197
query54	269	229	203	203
query55	80	80	74	74
query56	246	246	232	232
query57	1461	1399	1304	1304
query58	281	225	230	225
query59	1595	1708	1458	1458
query60	308	258	255	255
query61	183	176	181	176
query62	700	661	594	594
query63	235	192	190	190
query64	2580	845	678	678
query65	
query66	1798	495	352	352
query67	29798	29725	29542	29542
query68	
query69	441	354	265	265
query70	955	911	958	911
query71	287	224	209	209
query72	2974	2896	2437	2437
query73	819	793	411	411
query74	5155	4918	4807	4807
query75	2655	2594	2235	2235
query76	2314	1149	803	803
query77	359	375	295	295
query78	12451	12398	11980	11980
query79	1486	1090	793	793
query80	589	476	392	392
query81	457	276	249	249
query82	801	161	123	123
query83	356	293	253	253
query84	258	141	109	109
query85	925	561	433	433
query86	374	304	274	274
query87	3389	3343	3247	3247
query88	3647	2736	2740	2736
query89	427	394	334	334
query90	1982	186	176	176
query91	179	168	142	142
query92	68	63	57	57
query93	1532	1599	854	854
query94	564	362	310	310
query95	675	390	344	344
query96	1028	784	355	355
query97	2720	2710	2578	2578
query98	215	206	204	204
query99	1189	1210	1032	1032
Total cold run time: 252417 ms
Total hot run time: 170473 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 72.66% (311/428) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.21% (21172/39057)
Line Coverage 37.73% (201152/533066)
Region Coverage 33.84% (158289/467749)
Branch Coverage 34.81% (69120/198539)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 78.27% (335/428) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 74.00% (28300/38245)
Line Coverage 57.95% (308077/531618)
Region Coverage 54.82% (258830/472122)
Branch Coverage 56.21% (112021/199277)

@freemandealer
Copy link
Copy Markdown
Member Author

run nonConcurrent

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 78.27% (335/428) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.99% (28298/38245)
Line Coverage 57.95% (308057/531618)
Region Coverage 54.82% (258823/472122)
Branch Coverage 56.21% (112015/199277)

### What problem does this PR solve?

Issue Number: N/A

Problem Summary: The sync clear drain path needs to wait for file cache meta delete fences, but normal background GC should keep the original async metadata delete behavior. This change makes recycle-key removal wait for the meta delete fence only when called from clear_file_cache_sync(), while run_background_gc() keeps using the normal async remove path. It also adds a non-clear regression test for remove_if_cached_async() with a held deleting block.

### Release note

None

### Check List (For Author)

- Test:

    - Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:BlockFileCacheTest.remove_if_cached_async_recycles_held_deleting_block:HttpRequestAsyncReplyTest.*'

    - Manual test: git diff --check

    - Manual test: PATH=/mnt/disk1/zhangzhengyu/build-dep/ldb_toolchain.back/bin:$PATH build-support/check-format.sh

- Behavior changed: No. Normal background GC keeps async metadata deletion; sync file cache clear still waits for drained recycle-key metadata fences.

- Does this need documentation: No
@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@freemandealer
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one additional blocking issue beyond the already-open sync-clear threads.

Critical checkpoint conclusions:

  • Goal/test evidence: The PR aims to make sync file-cache clear wait for outstanding deletes while avoiding newly cached blocks. The added tests cover new misses during clear, but not new readers hitting old blocks that were already marked deleting.
  • Scope/focus: The implementation is focused on file-cache clear behavior, but the clear barrier is incomplete because existing deleting blocks remain visible to new get_or_set callers.
  • Concurrency/lifecycle: This path is explicitly concurrent with normal cache readers. Existing review threads already cover the async-open terminal state, the original new-insertion race, and the detached-thread memory context; I did not duplicate those. The new issue below is a distinct reader-lifecycle problem for old deleting blocks.
  • Configuration/compatibility: No new config item or serialization compatibility concern found in the current PR files.
  • Parallel paths: Background GC keeps async metadata deletion while sync clear waits for fences; that split is intentional and covered.
  • Test coverage/results: Missing coverage for readers that arrive after the clear barrier and hit a block already marked deleting.
  • Observability/performance: Logs/metrics are adequate for the modified paths, but sync clear can be starved by hot readers.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/io/cache/block_file_cache.cpp Outdated
### What problem does this PR solve?

Issue Number: N/A

Problem Summary: New readers could still get existing file cache blocks after the sync clear barrier opened, including blocks that clear_file_cache_sync was trying to drain. Continuous traffic to a hot held block could therefore keep extending the clear wait. This change simplifies the barrier behavior: while sync clear is running, get_or_set() returns transient SKIP_CACHE blocks for the whole requested range and does not read or insert _files entries. Existing pre-barrier holders are still waited on, but post-barrier readers cannot pin blocks being cleared.

### Release note

None

### Check List (For Author)

- Test:

    - Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:BlockFileCacheTest.remove_if_cached_async_recycles_held_deleting_block:HttpRequestAsyncReplyTest.*'

    - Manual test: git diff --check

    - Manual test: PATH=/mnt/disk1/zhangzhengyu/build-dep/ldb_toolchain.back/bin:$PATH build-support/check-format.sh

- Behavior changed: Yes. While sync file cache clear is running, new get_or_set() calls bypass the cache and return transient SKIP_CACHE blocks.

- Does this need documentation: No
@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@freemandealer
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

PR goal: make synchronous file-cache clear use the normal async cleanup lifecycle and wait until the cache data/meta cleanup is complete. The main path is focused and has useful BE unit coverage for held blocks, cancellation, meta fences, recycle draining, and skip-cache behavior during the barrier.

Critical checkpoint conclusions:

  • Goal/test: mostly covered, but one concurrent path still lets sync clear return before metadata cleanup, so the goal is not fully achieved.
  • Scope: changes are generally focused on file-cache clear and HTTP cancellation.
  • Concurrency: clear/read insertion is guarded by _mutex and _clear_mutex, but recycle-key draining races with background GC as noted inline.
  • Lifecycle/static initialization: no new cross-TU static lifecycle issue found. Detached HTTP worker now initializes thread context.
  • Config/compatibility: no new config, storage format, or FE/BE protocol compatibility issue found.
  • Parallel paths: background GC is a parallel recycle-key consumer and must preserve sync clear's fence guarantee.
  • Tests: added tests cover the direct sync clear drain path, but not the race where background GC dequeues during sync clear.
  • Observability/performance: existing logs/metrics remain sufficient for this path; no additional blocking performance issue found beyond the race.
  • Transaction/persistence/data writes: file-cache metadata persistence is affected; sync clear can report completion before RocksDB meta delete if background GC consumes the key.

User focus: no additional user-provided review focus was specified.

Comment thread be/src/io/cache/block_file_cache.cpp
### What problem does this PR solve?

Issue Number: N/A

Problem Summary: Background GC shared the same file-cache recycle queue with sync clear and could consume a key while clear_file_cache_sync was running. Because background GC intentionally uses async metadata delete, sync clear could observe the recycle queue as idle and return before the meta delete fence completed. This change keeps recycle keys in the queue for the sync clear drain path while sync clear is active, so clear_file_cache_sync removes them with metadata delete fences. Normal background GC keeps the async remove behavior outside the sync clear window.

### Release note

None

### Check List (For Author)

- Test:

    - Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:BlockFileCacheTest.remove_if_cached_async_recycles_held_deleting_block:HttpRequestAsyncReplyTest.*'

    - Manual test: git diff --check

    - Manual test: PATH=/mnt/disk1/zhangzhengyu/build-dep/ldb_toolchain.back/bin:$PATH build-support/check-format.sh

- Behavior changed: Yes. While sync file cache clear is running, background GC leaves recycle keys for the sync clear drain path.

- Does this need documentation: No
@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@freemandealer
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I reviewed the live PR files reported by GitHub for PR 63602 and did not find an additional distinct inline issue beyond the already-open review threads.

Critical checkpoint conclusions:

  • Goal/test proof: the goal is to make synchronous file-cache clear wait for held blocks/recycle cleanup while supporting request cancellation. The added BE tests cover held deleting blocks, skip-cache during clear, recycle draining, metadata fences, and HTTP cancellation.
  • Scope/focus: the live GitHub PR is focused on BE file-cache/http changes. Existing review threads already cover the main remaining sync-clear risks and should be resolved before merge.
  • Concurrency: the latest head prevents background GC from stealing recycle keys during sync clear by making try_dequeue_recycle_key(..., false) return false while _clear_file_cache_sync_running is active. Existing threads still cover other concurrency/lifecycle concerns and should not be duplicated here.
  • Lifecycle: the detached sync-clear HTTP worker now uses SCOPED_INIT_THREAD_CONTEXT; request cancellation waits through HttpRequest::wait_finish_send_reply(). The existing async-open terminal-state thread remains known review context.
  • Configuration/compatibility: no new configuration or storage-format compatibility issue found in the live PR files.
  • Parallel paths: clear paths for all cache instances go through clear_file_cache_sync() or async clear consistently.
  • Tests: BE unit tests were added for the important sync-clear lifecycle cases. I did not run tests in this review runner.
  • Observability/performance: no additional observability or hot-path performance blocker found beyond the existing comments.
  • Transaction/persistence/data correctness: the new metadata delete fence waits for RocksDB delete completion on the sync-clear drain path; the already-open thread covers ensuring all relevant async meta deletes participate in that contract.

User focus: no additional user-provided review focus was specified.

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29403 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit bc1603ed999bac8c4ef4f877780b84a8e1b931e5, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17659	3985	4045	3985
q2	q3	10760	1414	801	801
q4	4694	482	349	349
q5	7609	879	573	573
q6	185	176	138	138
q7	785	857	631	631
q8	9565	1615	1639	1615
q9	5925	4577	4573	4573
q10	6757	1826	1533	1533
q11	449	269	247	247
q12	629	425	286	286
q13	18214	3419	2786	2786
q14	261	258	242	242
q15	q16	821	767	706	706
q17	998	984	980	980
q18	7021	5737	5571	5571
q19	1351	1290	1083	1083
q20	519	427	254	254
q21	6375	2875	2738	2738
q22	464	369	312	312
Total cold run time: 101041 ms
Total hot run time: 29403 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5059	4755	4664	4664
q2	q3	4865	5278	4767	4767
q4	2190	2191	1367	1367
q5	4823	4892	4678	4678
q6	230	171	124	124
q7	1824	1797	1589	1589
q8	2419	2151	2061	2061
q9	7939	7626	7437	7437
q10	4726	4648	4274	4274
q11	534	381	355	355
q12	739	741	524	524
q13	3032	3350	2858	2858
q14	266	273	259	259
q15	q16	669	702	611	611
q17	1278	1247	1239	1239
q18	7418	6819	6980	6819
q19	1109	1069	1113	1069
q20	2221	2209	1925	1925
q21	5297	4555	4367	4367
q22	513	462	413	413
Total cold run time: 57151 ms
Total hot run time: 51400 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29459 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 8cd1dc407ff6aa83827d4d50c354e81951660656, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17620	4044	4016	4016
q2	q3	10744	1448	827	827
q4	4691	483	347	347
q5	7546	871	590	590
q6	192	180	142	142
q7	775	909	655	655
q8	9354	1621	1628	1621
q9	5875	4561	4512	4512
q10	6779	1889	1557	1557
q11	429	276	252	252
q12	627	438	296	296
q13	18128	3423	2762	2762
q14	277	265	252	252
q15	q16	824	791	710	710
q17	1004	917	954	917
q18	6808	5779	5623	5623
q19	1172	1281	1116	1116
q20	529	417	279	279
q21	6035	2872	2659	2659
q22	476	375	326	326
Total cold run time: 99885 ms
Total hot run time: 29459 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5109	4822	4856	4822
q2	q3	4909	5309	4712	4712
q4	2147	2232	1628	1628
q5	4758	4909	4687	4687
q6	231	180	134	134
q7	1829	1993	1582	1582
q8	2395	2184	2071	2071
q9	7910	7632	7397	7397
q10	4757	4658	4194	4194
q11	529	384	361	361
q12	735	747	530	530
q13	3003	3359	2829	2829
q14	280	283	258	258
q15	q16	674	703	619	619
q17	1293	1258	1253	1253
q18	7206	6990	6802	6802
q19	1087	1081	1109	1081
q20	2219	2203	1923	1923
q21	5301	4658	4543	4543
q22	528	449	408	408
Total cold run time: 56900 ms
Total hot run time: 51834 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168902 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit bc1603ed999bac8c4ef4f877780b84a8e1b931e5, data reload: false

query5	4327	628	494	494
query6	487	200	190	190
query7	4934	552	300	300
query8	393	215	205	205
query9	8752	4019	4004	4004
query10	435	318	258	258
query11	5905	2377	2166	2166
query12	164	101	101	101
query13	1312	602	424	424
query14	6393	5425	5044	5044
query14_1	4429	4358	4497	4358
query15	209	199	179	179
query16	1039	445	409	409
query17	978	695	568	568
query18	2446	474	343	343
query19	210	179	138	138
query20	116	107	106	106
query21	232	141	122	122
query22	13599	13793	13447	13447
query23	17319	16568	16165	16165
query23_1	16341	16340	16346	16340
query24	7556	1792	1306	1306
query24_1	1299	1335	1289	1289
query25	568	487	411	411
query26	1351	330	170	170
query27	2655	538	335	335
query28	4489	2049	2015	2015
query29	1105	603	506	506
query30	309	245	203	203
query31	1134	1083	972	972
query32	111	64	59	59
query33	530	329	270	270
query34	1197	1191	673	673
query35	759	794	677	677
query36	1379	1405	1277	1277
query37	162	105	93	93
query38	3229	3168	3027	3027
query39	954	928	912	912
query39_1	900	878	870	870
query40	232	124	106	106
query41	73	68	69	68
query42	100	98	98	98
query43	322	328	284	284
query44	
query45	204	191	187	187
query46	1108	1186	734	734
query47	2419	2351	2251	2251
query48	414	432	297	297
query49	648	516	364	364
query50	945	371	260	260
query51	4314	4326	4336	4326
query52	87	88	85	85
query53	238	271	190	190
query54	289	223	203	203
query55	83	77	71	71
query56	251	232	222	222
query57	1427	1392	1327	1327
query58	234	210	209	209
query59	1570	1648	1409	1409
query60	273	242	225	225
query61	154	154	149	149
query62	707	665	590	590
query63	234	183	185	183
query64	2587	780	626	626
query65	
query66	1789	463	346	346
query67	29697	29707	29466	29466
query68	
query69	472	313	263	263
query70	992	950	883	883
query71	289	218	199	199
query72	2922	2678	2365	2365
query73	840	740	425	425
query74	5175	4968	4766	4766
query75	2690	2580	2222	2222
query76	2302	1120	770	770
query77	354	381	316	316
query78	12416	12488	11873	11873
query79	1307	1067	742	742
query80	532	459	390	390
query81	450	288	245	245
query82	251	156	121	121
query83	358	267	248	248
query84	334	150	117	117
query85	895	526	434	434
query86	410	300	293	293
query87	3390	3295	3182	3182
query88	3619	2732	2724	2724
query89	420	377	329	329
query90	1955	179	175	175
query91	176	163	134	134
query92	62	61	58	58
query93	1445	1427	853	853
query94	567	359	317	317
query95	670	469	362	362
query96	1118	773	348	348
query97	2677	2726	2571	2571
query98	218	205	203	203
query99	1154	1170	1014	1014
Total cold run time: 251109 ms
Total hot run time: 168902 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169292 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 8cd1dc407ff6aa83827d4d50c354e81951660656, data reload: false

query5	4322	627	479	479
query6	435	198	184	184
query7	4818	546	305	305
query8	361	222	212	212
query9	8828	3996	4007	3996
query10	450	315	250	250
query11	5944	2351	2164	2164
query12	158	102	102	102
query13	1289	611	420	420
query14	6370	5344	5048	5048
query14_1	4384	4397	4361	4361
query15	202	199	172	172
query16	1004	447	410	410
query17	929	680	577	577
query18	2414	479	338	338
query19	194	180	139	139
query20	112	104	106	104
query21	213	138	112	112
query22	13694	13563	13372	13372
query23	17377	16461	16158	16158
query23_1	16320	16222	16376	16222
query24	7680	1779	1307	1307
query24_1	1321	1317	1306	1306
query25	579	494	417	417
query26	1319	315	165	165
query27	2737	582	340	340
query28	4497	2026	2057	2026
query29	1115	642	505	505
query30	327	234	204	204
query31	1117	1074	957	957
query32	116	66	66	66
query33	545	331	264	264
query34	1176	1143	677	677
query35	758	784	678	678
query36	1443	1408	1219	1219
query37	166	114	91	91
query38	3220	3149	3037	3037
query39	936	941	920	920
query39_1	872	900	869	869
query40	227	129	106	106
query41	71	70	70	70
query42	98	99	99	99
query43	325	332	278	278
query44	
query45	204	194	184	184
query46	1116	1178	753	753
query47	2392	2376	2253	2253
query48	380	434	284	284
query49	647	481	388	388
query50	960	348	271	271
query51	4361	4271	4291	4271
query52	91	90	83	83
query53	247	262	194	194
query54	286	241	231	231
query55	94	76	73	73
query56	261	250	252	250
query57	1421	1398	1336	1336
query58	257	249	226	226
query59	1613	1683	1517	1517
query60	329	248	229	229
query61	168	158	153	153
query62	710	655	577	577
query63	229	200	186	186
query64	2560	795	631	631
query65	
query66	1801	454	348	348
query67	29802	29675	29659	29659
query68	
query69	424	307	275	275
query70	990	969	946	946
query71	309	229	219	219
query72	3029	2759	2422	2422
query73	849	803	411	411
query74	5131	4951	4794	4794
query75	2658	2599	2244	2244
query76	2338	1157	769	769
query77	371	358	286	286
query78	12288	12512	11915	11915
query79	1433	1060	801	801
query80	593	470	395	395
query81	448	275	242	242
query82	572	161	128	128
query83	370	287	248	248
query84	304	146	152	146
query85	886	546	441	441
query86	368	298	297	297
query87	3360	3357	3172	3172
query88	3618	2751	2710	2710
query89	422	373	324	324
query90	1999	179	178	178
query91	177	165	141	141
query92	68	65	67	65
query93	1534	1433	977	977
query94	547	342	317	317
query95	695	483	348	348
query96	1048	806	331	331
query97	2700	2707	2534	2534
query98	214	207	209	207
query99	1131	1176	1043	1043
Total cold run time: 251704 ms
Total hot run time: 169292 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 79.11% (337/426) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.87% (27516/38284)
Line Coverage 55.43% (294616/531549)
Region Coverage 52.15% (245716/471159)
Branch Coverage 53.29% (106223/199314)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants