Skip to content

[improvement](filecache) Adapt file cache queue consumption#63504

Open
freemandealer wants to merge 6 commits into
apache:masterfrom
freemandealer:task-master-pick-file-cache-adaptive-queue-consu
Open

[improvement](filecache) Adapt file cache queue consumption#63504
freemandealer wants to merge 6 commits into
apache:masterfrom
freemandealer:task-master-pick-file-cache-adaptive-queue-consu

Conversation

@freemandealer
Copy link
Copy Markdown
Member

Problem Summary: File cache background consumers used fixed intervals and batch sizes for LRU recorder log replay and _need_update_lru_blocks updates. When producers outpaced those consumers, backlog growth was hard to observe and could increase memory pressure. This change adds queue length metrics for LRU recorder log queues, exposes queue-size accessors, supports bounded LRU log replay, and makes both background consumers adapt their interval and batch size according to backlog watermarks. It also slices block LRU update work into smaller lock-hold batches and skips LRU log recording when tail-record retention is disabled.

None

  • Test: Unit Test
    • CCACHE_DISABLE=1 DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 EXTRA_CXX_FLAGS='-Wno-error=deprecated-literal-operator' sh run-be-ut.sh --run --filter=BlockFileCacheTest.test_lru_log_replay_bound_and_disable_record
    • build-support/check-format.sh
    • git diff --check
    • Tried build-support/run-clang-tidy.sh --base origin/master --build-dir be/ut_build_ASAN; it was blocked by pre-existing/file-level diagnostics and system header lookup errors before producing a clean result.
  • Behavior changed: Yes. File cache background queue consumers can increase consume frequency and batch size when backlog crosses configured watermarks.
  • Does this need documentation: No

What problem does this PR solve?

Issue Number: close #xxx

Related PR: #xxx

Problem Summary:

Release note

None

Check List (For Author)

  • Test

    • Regression test
    • Unit Test
    • Manual test (add detailed scripts or steps below)
    • No need to test or manual test. Explain why:
      • This is a refactor/code format and no logic has been changed.
      • Previous test can cover this change.
      • No code files have been changed.
      • Other reason
  • Behavior changed:

    • No.
    • Yes.
  • Does this need documentation?

    • No.
    • Yes.

Check List (For Reviewer who merge this PR)

  • Confirm the release note
  • Confirm test cases
  • Confirm document
  • Add branch pick label

@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 31083 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit a3410c382f1b8f3a57c58bb9792961b0c76a58b4, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17677	3867	3893	3867
q2	q3	10803	1407	802	802
q4	4682	472	349	349
q5	7621	2292	2087	2087
q6	383	175	141	141
q7	960	767	662	662
q8	9370	1672	1592	1592
q9	6971	4907	4922	4907
q10	6438	2089	1802	1802
q11	444	279	254	254
q12	636	427	290	290
q13	18156	3363	2764	2764
q14	264	253	237	237
q15	q16	825	779	721	721
q17	1008	921	981	921
q18	7028	5757	5577	5577
q19	1185	1289	1084	1084
q20	515	415	270	270
q21	5505	2556	2438	2438
q22	431	362	318	318
Total cold run time: 100902 ms
Total hot run time: 31083 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4306	4134	4176	4134
q2	q3	4543	4923	4344	4344
q4	2100	2222	1395	1395
q5	4395	4280	4315	4280
q6	232	185	212	185
q7	2160	1813	1641	1641
q8	2585	2188	2130	2130
q9	7886	7765	7678	7678
q10	4565	4487	4068	4068
q11	591	463	487	463
q12	731	738	526	526
q13	3385	3645	3068	3068
q14	299	310	311	310
q15	q16	721	752	647	647
q17	1363	1326	1290	1290
q18	8046	7346	7150	7150
q19	1127	1104	1115	1104
q20	2218	2240	1943	1943
q21	5312	4612	4488	4488
q22	547	467	422	422
Total cold run time: 57112 ms
Total hot run time: 51266 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170549 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit a3410c382f1b8f3a57c58bb9792961b0c76a58b4, data reload: false

query5	4338	656	515	515
query6	337	230	209	209
query7	4272	552	320	320
query8	331	244	221	221
query9	8903	4052	4027	4027
query10	479	352	298	298
query11	5818	2394	2196	2196
query12	189	128	129	128
query13	1308	621	438	438
query14	5979	5347	5068	5068
query14_1	4367	4358	4363	4358
query15	210	203	184	184
query16	986	457	514	457
query17	1021	748	601	601
query18	2474	502	373	373
query19	221	205	172	172
query20	136	135	128	128
query21	217	144	121	121
query22	13568	13573	13279	13279
query23	17147	16416	16052	16052
query23_1	16095	16197	16191	16191
query24	7430	1729	1300	1300
query24_1	1326	1305	1305	1305
query25	597	506	441	441
query26	1305	330	180	180
query27	2674	548	358	358
query28	4481	1985	1971	1971
query29	991	644	522	522
query30	311	244	200	200
query31	1112	1070	971	971
query32	94	76	71	71
query33	535	349	286	286
query34	1165	1106	648	648
query35	756	776	677	677
query36	1342	1388	1184	1184
query37	153	109	101	101
query38	3192	3133	3085	3085
query39	926	919	902	902
query39_1	873	865	866	865
query40	224	143	122	122
query41	64	64	62	62
query42	108	105	107	105
query43	327	324	278	278
query44	
query45	207	201	198	198
query46	1056	1165	723	723
query47	2322	2339	2231	2231
query48	405	390	288	288
query49	640	477	370	370
query50	1055	342	259	259
query51	4335	4273	4220	4220
query52	105	106	91	91
query53	256	279	205	205
query54	303	264	250	250
query55	92	91	82	82
query56	294	303	303	303
query57	1409	1406	1346	1346
query58	300	260	257	257
query59	1529	1619	1408	1408
query60	321	323	297	297
query61	161	158	148	148
query62	666	629	564	564
query63	239	199	204	199
query64	2390	813	618	618
query65	
query66	1745	485	369	369
query67	29342	29969	29695	29695
query68	
query69	454	334	308	308
query70	1039	1004	974	974
query71	306	272	268	268
query72	2976	2663	2418	2418
query73	854	781	380	380
query74	5038	4889	4723	4723
query75	2638	2629	2246	2246
query76	2311	1135	765	765
query77	390	401	331	331
query78	12167	12099	11664	11664
query79	1448	995	729	729
query80	642	542	456	456
query81	456	276	239	239
query82	1379	156	127	127
query83	363	271	252	252
query84	297	143	107	107
query85	870	527	459	459
query86	403	341	338	338
query87	3393	3402	3211	3211
query88	3514	2677	2654	2654
query89	436	384	335	335
query90	2011	180	179	179
query91	178	167	137	137
query92	80	80	74	74
query93	1506	1404	847	847
query94	531	340	295	295
query95	687	373	348	348
query96	1012	774	336	336
query97	2669	2729	2568	2568
query98	234	228	235	228
query99	1133	1087	959	959
Total cold run time: 251885 ms
Total hot run time: 170549 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 86.30% (126/146) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.65% (20744/38667)
Line Coverage 37.23% (196431/527596)
Region Coverage 33.55% (153980/458920)
Branch Coverage 34.57% (67090/194082)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 86.30% (126/146) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 71.90% (27245/37892)
Line Coverage 55.25% (290897/526546)
Region Coverage 52.23% (242116/463520)
Branch Coverage 53.56% (104369/194861)

Comment thread be/src/io/cache/block_file_cache.cpp Outdated
Comment thread be/src/io/cache/block_file_cache.cpp Outdated
Comment thread be/src/io/cache/block_file_cache.cpp Outdated
Comment thread be/src/io/cache/block_file_cache.cpp
Comment thread be/src/common/config.cpp Outdated
### What problem does this PR solve?

Issue Number: N/A

Related PR: N/A

Problem Summary: File cache hit ratio metrics are derived from global file cache read bytes, but warmup reads from manual warmup, periodic warmup, event-driven warmup, and rebalance-triggered warmup used to update the same counters as query reads. This polluted the query hit ratio. Mixed hit/miss reads could also be attributed to one source for the whole request. This change skips warmup updates to global file cache read metrics while preserving per-IOContext profile stats, records local/remote/peer bytes by actual returned bytes, and avoids updating metrics for failed reads. It also fixes direct-read partial continuation and no-warmup miss-only hit ratio refresh. After rebase, the warmup metrics UT exposed a separate ASAN issue because the test snapshot helper triggered all metric hooks, including a stale StorageEngine hook that captured a destroyed engine. The test now snapshots FileCacheMetrics directly, and StorageEngine deregisters its hook on destruction.

### Release note

File cache hit ratio metrics now exclude warmup reads.

### Check List (For Author)

- Test: Regression test / Unit Test
    - Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter=BlockFileCacheTest.cached_remote_file_reader_warmup_does_not_update_global_metrics
    - Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.cached_remote_file_reader*'
- Behavior changed: Yes. Warmup reads no longer contribute to global file cache read metrics used for query hit ratio; per-IOContext profile stats are preserved.
- Does this need documentation: No
Problem Summary: File cache background consumers used fixed intervals and batch sizes for LRU recorder log replay and _need_update_lru_blocks updates. When producers outpaced those consumers, backlog growth was hard to observe and could increase memory pressure. This change adds queue length metrics for LRU recorder log queues, exposes queue-size accessors, supports bounded LRU log replay, and makes both background consumers adapt their interval and batch size according to backlog watermarks. It also slices block LRU update work into smaller lock-hold batches and skips LRU log recording when tail-record retention is disabled.

None

- Test: Unit Test
    - `CCACHE_DISABLE=1 DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 EXTRA_CXX_FLAGS='-Wno-error=deprecated-literal-operator' sh run-be-ut.sh --run --filter=BlockFileCacheTest.test_lru_log_replay_bound_and_disable_record`
    - `build-support/check-format.sh`
    - `git diff --check`
    - Tried `build-support/run-clang-tidy.sh --base origin/master --build-dir be/ut_build_ASAN`; it was blocked by pre-existing/file-level diagnostics and system header lookup errors before producing a clean result.
- Behavior changed: Yes. File cache background queue consumers can increase consume frequency and batch size when backlog crosses configured watermarks.
- Does this need documentation: No
Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
@freemandealer freemandealer force-pushed the task-master-pick-file-cache-adaptive-queue-consu branch from a3410c3 to 97a715b Compare June 1, 2026 14:37
@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29197 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 97a715b6d34747dea06b7c43a4c66fb192458afa, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17619	3972	3970	3970
q2	q3	10741	1454	833	833
q4	4722	486	341	341
q5	7851	905	605	605
q6	206	173	143	143
q7	830	877	643	643
q8	10383	1659	1629	1629
q9	7196	4554	4566	4554
q10	6822	1829	1506	1506
q11	441	288	250	250
q12	641	434	294	294
q13	18154	3498	2738	2738
q14	263	258	241	241
q15	q16	836	780	714	714
q17	934	884	988	884
q18	6824	5894	5535	5535
q19	1162	1258	1128	1128
q20	517	406	278	278
q21	6252	2812	2588	2588
q22	451	365	323	323
Total cold run time: 102845 ms
Total hot run time: 29197 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4880	4742	4686	4686
q2	q3	5098	5349	4685	4685
q4	2148	2199	1388	1388
q5	4809	4821	4733	4733
q6	235	188	136	136
q7	1902	1782	1597	1597
q8	2399	2162	1969	1969
q9	7457	7477	7454	7454
q10	4752	4714	4257	4257
q11	536	391	356	356
q12	734	751	530	530
q13	3046	3464	2813	2813
q14	278	296	248	248
q15	q16	669	694	603	603
q17	1282	1259	1256	1256
q18	7331	6848	6808	6808
q19	1148	1106	1131	1106
q20	2212	2215	1932	1932
q21	5265	4569	4398	4398
q22	525	451	401	401
Total cold run time: 56706 ms
Total hot run time: 51356 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 170457 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 97a715b6d34747dea06b7c43a4c66fb192458afa, data reload: false

query5	4327	662	524	524
query6	331	232	204	204
query7	4239	589	309	309
query8	332	235	220	220
query9	8802	3996	4013	3996
query10	478	372	296	296
query11	5782	2350	2134	2134
query12	190	133	130	130
query13	1305	618	438	438
query14	6100	5446	5133	5133
query14_1	4435	4472	4424	4424
query15	216	205	182	182
query16	1002	426	430	426
query17	1174	750	612	612
query18	2613	497	366	366
query19	220	214	185	185
query20	141	141	132	132
query21	220	144	120	120
query22	13678	13664	13375	13375
query23	17523	16514	16232	16232
query23_1	16358	16352	16404	16352
query24	7520	1781	1331	1331
query24_1	1344	1316	1327	1316
query25	602	520	473	473
query26	1345	330	175	175
query27	2643	597	357	357
query28	4432	2051	2046	2046
query29	1019	671	569	569
query30	303	236	197	197
query31	1137	1099	947	947
query32	87	72	72	72
query33	528	351	286	286
query34	1177	1144	649	649
query35	764	801	708	708
query36	1408	1451	1281	1281
query37	156	103	90	90
query38	3186	3174	3052	3052
query39	937	919	899	899
query39_1	878	888	890	888
query40	238	147	126	126
query41	65	64	61	61
query42	117	112	110	110
query43	334	341	285	285
query44	
query45	202	203	194	194
query46	1081	1242	739	739
query47	2433	2343	2280	2280
query48	407	405	286	286
query49	644	498	393	393
query50	1003	335	250	250
query51	4345	4326	4220	4220
query52	106	107	96	96
query53	261	284	204	204
query54	315	268	269	268
query55	97	89	88	88
query56	329	301	312	301
query57	1435	1402	1305	1305
query58	301	276	272	272
query59	1536	1672	1407	1407
query60	324	324	316	316
query61	164	159	159	159
query62	706	662	592	592
query63	251	201	211	201
query64	2210	806	637	637
query65	
query66	1633	507	359	359
query67	29883	29703	29561	29561
query68	
query69	458	340	285	285
query70	1007	966	987	966
query71	317	276	275	275
query72	3260	2631	2388	2388
query73	849	793	428	428
query74	5155	4912	4782	4782
query75	2695	2613	2264	2264
query76	2311	1153	782	782
query77	404	399	330	330
query78	12437	12720	11895	11895
query79	1466	1062	766	766
query80	1180	531	472	472
query81	502	284	241	241
query82	1345	156	125	125
query83	350	282	249	249
query84	269	143	112	112
query85	931	529	455	455
query86	434	349	347	347
query87	3433	3372	3217	3217
query88	3631	2725	2731	2725
query89	453	411	345	345
query90	1767	185	181	181
query91	179	168	138	138
query92	75	80	71	71
query93	1594	1402	947	947
query94	629	357	314	314
query95	674	472	343	343
query96	1009	779	345	345
query97	2741	2721	2622	2622
query98	237	230	233	230
query99	1151	1180	1056	1056
Total cold run time: 255112 ms
Total hot run time: 170457 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 83.17% (173/208) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.11% (21069/38934)
Line Coverage 37.61% (199573/530587)
Region Coverage 33.87% (156291/461482)
Branch Coverage 34.84% (67942/195015)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 87.50% (182/208) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.80% (28135/38123)
Line Coverage 57.80% (305909/529210)
Region Coverage 54.99% (256191/465909)
Branch Coverage 56.43% (110466/195765)

@freemandealer
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found one correctness issue in the bounded LRU log replay path.

Critical checkpoint conclusions:

  • Goal/test: The PR improves file-cache queue consumption and metric accounting, and adds BE/regression coverage, but the bounded replay change can make persisted LRU dumps miss already-recorded events.
  • Scope/minimality: The final version is mostly focused and removed the previously raised extra knobs/metrics.
  • Concurrency/lifecycle: The issue is a lifecycle interaction between the replay thread and dump thread over the shadow queue and update counter; both use the recorder mutex for the shadow queue, but dump can still observe an intentionally partially replayed state.
  • Config/compatibility: No remaining new configs in the final head; no storage format or FE/BE protocol compatibility concern found.
  • Parallel paths: The concern applies to all four LRU queues because they all use the same bounded replay and dump path.
  • Tests: Unit tests cover bounded replay itself, but do not cover dumping while replay backlog remains pending.
  • Observability/performance: The new backlog metric/logging helps observe backlog, but does not prevent stale dumps.
  • Transaction/persistence: This is persistence-related for file-cache LRU tail restore; a crash after a stale dump can restore an out-of-date LRU order.

User focus: No additional user-provided review focus was specified.

Comment thread be/src/io/cache/block_file_cache.cpp
Signed-off-by: zhengyu <zhangzhengyu@selectdb.com>
@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@freemandealer
Copy link
Copy Markdown
Member Author

/review

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Automated review summary for PR 63504 at 12268c6:

I did not find additional blocking issues beyond the existing review threads. The previously raised LRU dump/replay race is addressed in the current diff by draining pending recorder logs under _mutex_lru_log before collecting dump entries and by subtracting only the counted updates after a successful dump.

Critical checkpoint conclusions:

  • Goal and proof: The PR narrows file-cache read metrics so warmup reads do not affect global read metrics, splits mixed local/remote/peer read bytes, converts queue length metrics to bvar::Status, and bounds/adapts LRU update/replay draining. Added BE unit tests and a cloud docker regression cover these paths.
  • Small and focused: The actual GitHub PR diff is focused on BE file-cache metrics/LRU behavior and related tests.
  • Concurrency: LRU recorder replay/dump/counter updates are protected by _mutex_lru_log. Block LRU updates drain outside the cache mutex and slice work under _mutex, keeping lock holds bounded. No new lock-order issue found in the reviewed paths.
  • Lifecycle/static initialization: No new cross-TU static initialization dependency or special ownership cycle found. StorageEngine hook deregistration is paired in destructor.
  • Configuration: No new config item remains in the actual PR diff; existing dynamic config reads are used directly.
  • Compatibility: No storage format, Thrift, or FE/BE protocol compatibility issue found.
  • Parallel paths: Direct read, indirect cached read, remote fallback, warmup, and mixed hit/miss paths are covered in code/tests.
  • Special conditions: Warmup exclusion from global metrics is explicit; failure paths avoid publishing partial global stats.
  • Test coverage: BE unit tests cover metric deltas, warmup exclusion, mixed reads, failed reads, and LRU dump/replay behavior. The cloud docker regression covers manual/periodic/event/rebalance warmup metric behavior. I did not run tests locally in this review.
  • Observability: Queue backlog status metrics and warning logs are present; no additional observability gap found.
  • Transaction/persistence/data correctness: The LRU dump persistence path now drains pending records before dumping, preserving restore consistency for counted records. No transaction path changes.
  • Performance: Adaptive draining and lock slicing reduce backlog/lock pressure. Dump collection is done under the recorder mutex and disk IO is outside it.

User focus: No additional user-provided review focus was supplied.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 80.52% (215/267) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 54.16% (21142/39039)
Line Coverage 37.69% (200821/532772)
Region Coverage 33.78% (157933/467507)
Branch Coverage 34.78% (69001/198406)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 83.90% (224/267) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 73.92% (28260/38229)
Line Coverage 57.86% (307475/531396)
Region Coverage 54.40% (256744/471928)
Branch Coverage 56.01% (111543/199152)

@freemandealer freemandealer dismissed stale reviews from gavinchou and github-actions[bot] June 2, 2026 14:51

done fix

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29742 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 12268c64e90d12389c7a69e3ca30dd2b2fcbcdcd, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17610	4205	4236	4205
q2	q3	10817	1415	837	837
q4	4687	481	348	348
q5	7552	908	585	585
q6	191	176	142	142
q7	781	862	653	653
q8	9394	1709	1673	1673
q9	5905	4533	4537	4533
q10	6765	1858	1569	1569
q11	447	279	261	261
q12	650	427	300	300
q13	18128	3422	2790	2790
q14	266	258	243	243
q15	q16	840	777	698	698
q17	1011	974	905	905
q18	7098	5918	5541	5541
q19	1523	1360	1142	1142
q20	525	407	255	255
q21	6362	2850	2734	2734
q22	460	389	328	328
Total cold run time: 101012 ms
Total hot run time: 29742 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	5220	4882	4919	4882
q2	q3	4830	5300	4678	4678
q4	2149	2232	1405	1405
q5	4819	4880	4703	4703
q6	239	181	128	128
q7	1909	1810	1583	1583
q8	2427	2167	2140	2140
q9	7959	7586	7472	7472
q10	4775	4697	4240	4240
q11	542	383	354	354
q12	734	743	532	532
q13	3001	3335	2774	2774
q14	287	282	249	249
q15	q16	685	706	615	615
q17	1277	1277	1283	1277
q18	7454	6820	6883	6820
q19	1140	1078	1099	1078
q20	2221	2225	1940	1940
q21	5323	4646	4431	4431
q22	556	462	394	394
Total cold run time: 57547 ms
Total hot run time: 51695 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 169839 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 12268c64e90d12389c7a69e3ca30dd2b2fcbcdcd, data reload: false

query5	4340	646	494	494
query6	447	203	193	193
query7	4838	573	301	301
query8	375	226	205	205
query9	8772	4125	4083	4083
query10	462	311	266	266
query11	5943	2374	2148	2148
query12	158	105	102	102
query13	1244	600	450	450
query14	6942	5392	5127	5127
query14_1	4392	4380	4372	4372
query15	206	197	176	176
query16	1008	454	426	426
query17	1088	681	558	558
query18	2427	477	337	337
query19	203	185	141	141
query20	110	111	104	104
query21	218	142	117	117
query22	13586	13523	13350	13350
query23	17257	16532	16245	16245
query23_1	16251	16352	16335	16335
query24	7624	1804	1311	1311
query24_1	1294	1303	1320	1303
query25	550	474	382	382
query26	1299	321	175	175
query27	2656	591	341	341
query28	4471	2036	1998	1998
query29	1099	621	508	508
query30	313	249	198	198
query31	1148	1086	978	978
query32	106	66	62	62
query33	521	349	264	264
query34	1184	1175	650	650
query35	765	805	676	676
query36	1423	1390	1247	1247
query37	163	111	100	100
query38	3232	3167	3055	3055
query39	939	950	908	908
query39_1	878	878	912	878
query40	223	129	110	110
query41	72	70	72	70
query42	105	97	98	97
query43	329	331	286	286
query44	
query45	203	190	196	190
query46	1082	1206	748	748
query47	2377	2333	2281	2281
query48	399	406	313	313
query49	653	485	371	371
query50	1008	363	261	261
query51	4337	4294	4241	4241
query52	90	92	81	81
query53	252	280	203	203
query54	283	241	219	219
query55	82	78	74	74
query56	262	257	230	230
query57	1434	1395	1343	1343
query58	267	228	224	224
query59	1578	1658	1443	1443
query60	297	274	247	247
query61	187	182	184	182
query62	742	663	581	581
query63	229	187	188	187
query64	2543	774	632	632
query65	
query66	1794	467	347	347
query67	29809	29727	29555	29555
query68	
query69	424	311	258	258
query70	940	958	939	939
query71	302	220	215	215
query72	2997	2670	2457	2457
query73	855	742	480	480
query74	5181	4954	4766	4766
query75	2684	2591	2248	2248
query76	2333	1150	817	817
query77	355	375	293	293
query78	12492	12414	11890	11890
query79	1470	1044	786	786
query80	1107	477	398	398
query81	504	280	237	237
query82	581	161	122	122
query83	359	274	255	255
query84	267	144	117	117
query85	935	529	439	439
query86	416	306	287	287
query87	3382	3320	3199	3199
query88	3641	2747	2712	2712
query89	436	386	339	339
query90	1775	184	185	184
query91	179	162	138	138
query92	62	61	55	55
query93	1436	1429	887	887
query94	632	358	311	311
query95	691	382	347	347
query96	1147	810	356	356
query97	2749	2709	2547	2547
query98	225	212	203	203
query99	1192	1163	1036	1036
Total cold run time: 252880 ms
Total hot run time: 169839 ms

}

if (_no_warmup_num_hit_blocks->get_value() > 0) {
if (_no_warmup_num_read_blocks->get_value() > 0) {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why change these?

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a quick fix for div-by-zero bug

Comment thread be/src/io/cache/block_file_cache.cpp
@freemandealer
Copy link
Copy Markdown
Member Author

run buildall

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-H: Total hot run time: 29264 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpch-tools
Tpch sf100 test result on commit 37bc18e46f1321ba8c261c7698fd238d0b1aa77c, data reload: false

------ Round 1 ----------------------------------
orders	Doris	NULL	NULL	0	0	0	NULL	0	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	17753	4027	3991	3991
q2	q3	10739	1452	817	817
q4	4688	479	355	355
q5	7706	986	630	630
q6	196	179	139	139
q7	794	845	664	664
q8	10139	1555	1728	1555
q9	6027	4638	4666	4638
q10	6788	1811	1529	1529
q11	439	270	247	247
q12	631	427	289	289
q13	18447	3349	2808	2808
q14	262	259	242	242
q15	q16	837	786	720	720
q17	1009	962	926	926
q18	7151	5722	5719	5719
q19	1340	1225	978	978
q20	510	406	258	258
q21	5906	2636	2458	2458
q22	442	364	301	301
Total cold run time: 101804 ms
Total hot run time: 29264 ms

----- Round 2, with runtime_filter_mode=off -----
orders	Doris	NULL	NULL	150000000	42	6422171781	NULL	22778155	NULL	NULL	2023-12-26 18:27:23	2023-12-26 18:42:55	NULL	utf-8	NULL	NULL	
============================================
q1	4333	4267	4255	4255
q2	q3	4557	4949	4376	4376
q4	2100	2204	1373	1373
q5	4427	4294	4333	4294
q6	225	174	129	129
q7	1750	2245	1784	1784
q8	2565	2194	2183	2183
q9	8199	8307	7989	7989
q10	4835	4729	4325	4325
q11	586	427	389	389
q12	772	793	544	544
q13	3333	3651	2986	2986
q14	290	296	276	276
q15	q16	724	751	670	670
q17	1378	1334	1464	1334
q18	8120	7346	7157	7157
q19	1150	1124	1092	1092
q20	2227	2201	1933	1933
q21	5267	4555	4439	4439
q22	521	441	400	400
Total cold run time: 57359 ms
Total hot run time: 51928 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

TPC-DS: Total hot run time: 168476 ms
machine: 'aliyun_ecs.c7a.8xlarge_32C64G'
scripts: https://github.com/apache/doris/tree/master/tools/tpcds-tools
TPC-DS sf100 test result on commit 37bc18e46f1321ba8c261c7698fd238d0b1aa77c, data reload: false

query5	4309	612	495	495
query6	444	199	177	177
query7	4899	545	296	296
query8	358	211	210	210
query9	8779	4020	4020	4020
query10	457	301	249	249
query11	5916	2358	2151	2151
query12	149	109	107	107
query13	1277	612	412	412
query14	6429	5355	5040	5040
query14_1	4415	4381	4377	4377
query15	210	197	180	180
query16	991	459	424	424
query17	979	716	569	569
query18	2467	503	350	350
query19	215	189	149	149
query20	112	108	108	108
query21	223	143	122	122
query22	13594	13600	13376	13376
query23	17359	16540	16273	16273
query23_1	16357	16197	16264	16197
query24	7529	1750	1306	1306
query24_1	1297	1296	1294	1294
query25	555	442	384	384
query26	1287	299	162	162
query27	2733	555	338	338
query28	4519	2014	2016	2014
query29	1062	606	479	479
query30	310	232	196	196
query31	1112	1064	951	951
query32	113	63	64	63
query33	520	326	250	250
query34	1181	1164	655	655
query35	732	765	688	688
query36	1427	1407	1237	1237
query37	146	109	88	88
query38	3198	3134	3031	3031
query39	924	922	897	897
query39_1	882	890	866	866
query40	213	122	101	101
query41	64	63	62	62
query42	93	96	93	93
query43	320	319	278	278
query44	
query45	192	184	177	177
query46	1095	1182	752	752
query47	2376	2422	2236	2236
query48	403	422	287	287
query49	666	462	365	365
query50	982	345	249	249
query51	4352	4278	4273	4273
query52	97	90	76	76
query53	247	262	189	189
query54	275	216	198	198
query55	84	74	69	69
query56	236	217	224	217
query57	1435	1415	1300	1300
query58	250	215	210	210
query59	1559	1614	1409	1409
query60	295	249	230	230
query61	163	161	164	161
query62	686	649	586	586
query63	238	193	185	185
query64	2555	785	610	610
query65	
query66	1809	462	342	342
query67	29959	29661	29377	29377
query68	
query69	419	341	254	254
query70	992	938	985	938
query71	297	226	211	211
query72	2935	2671	2344	2344
query73	836	799	452	452
query74	5082	4949	4782	4782
query75	2670	2562	2242	2242
query76	2312	1162	772	772
query77	333	399	278	278
query78	12448	12376	11852	11852
query79	1225	1005	715	715
query80	520	500	378	378
query81	442	275	242	242
query82	232	164	131	131
query83	268	318	248	248
query84	284	138	109	109
query85	866	532	449	449
query86	319	303	284	284
query87	3354	3337	3181	3181
query88	3591	2741	2729	2729
query89	413	383	322	322
query90	2168	179	178	178
query91	177	161	138	138
query92	60	64	57	57
query93	1520	1377	917	917
query94	517	352	321	321
query95	665	475	345	345
query96	1072	791	362	362
query97	2723	2697	2611	2611
query98	210	205	204	204
query99	1144	1157	1037	1037
Total cold run time: 250645 ms
Total hot run time: 168476 ms

@hello-stephen
Copy link
Copy Markdown
Contributor

BE UT Coverage Report

Increment line coverage 75.85% (201/265) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 53.92% (21042/39027)
Line Coverage 37.58% (199931/532005)
Region Coverage 33.65% (156814/465999)
Branch Coverage 34.64% (68603/198069)

@freemandealer
Copy link
Copy Markdown
Member Author

run p0

@freemandealer
Copy link
Copy Markdown
Member Author

run cloud_p0

@freemandealer
Copy link
Copy Markdown
Member Author

run nonConcurrent

@freemandealer
Copy link
Copy Markdown
Member Author

/review

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 75.48% (197/261) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.79% (22098/38241)
Line Coverage 41.29% (219133/530729)
Region Coverage 37.52% (176535/470507)
Branch Coverage 38.29% (76153/198862)

Copy link
Copy Markdown
Contributor

@github-actions github-actions Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Review conclusion: no additional blocking issue found in the actual GitHub PR diff.

Critical checkpoint conclusions:

  • Goal/test proof: The PR addresses adaptive file-cache queue consumption and queue visibility. BE unit tests cover bounded LRU log replay, dump-counter preservation, hard-cap drops, need-update LRU block behavior, and file-cache read metrics. A docker regression test is added for warmup read metrics.
  • Scope/minimality: The GitHub PR file list is focused on BE file-cache code plus related tests. The earlier local-only unrelated FE/README diff was not part of GitHub's accepted PR diff.
  • Concurrency: Reviewed the LRU recorder mutex usage, bounded replay, dump-drain-before-reset flow, block LRU update slicing under the cache lock, and the close condition-variable loops. No new distinct concurrency blocker found beyond the already-known/replied review threads.
  • Lifecycle: Background threads are still joined in the cache destructor; the added storage-engine metric deregistration is appropriate for destructor lifecycle.
  • Configuration: Extra adaptive knobs from earlier review context are no longer present; the remaining adaptive thresholds are internal constants. Existing mutable config values are still read in the background loops.
  • Compatibility/storage/transactions: No storage format, FE/BE protocol, EditLog, or transaction compatibility issue identified.
  • Parallel paths: File-cache read accounting now splits local/remote/peer bytes for direct, indirect, warmup, fallback, and mixed-hit paths; tests exercise these paths.
  • Tests: Relevant BE unit coverage is present. I did not run tests in this review runner.
  • Observability/performance: Queue length/drop metrics and high-backlog warnings are present; lock slicing reduces long cache-lock holds. No additional observability blocker found.
  • User focus: No additional user-provided review focus was specified.

Existing review threads were checked first; I did not repeat the previously raised metrics, wait interval, or dump/replay race comments.

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 75.48% (197/261) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.79% (22098/38241)
Line Coverage 41.29% (219133/530729)
Region Coverage 37.52% (176535/470507)
Branch Coverage 38.29% (76153/198862)

@hello-stephen
Copy link
Copy Markdown
Contributor

BE Regression && UT Coverage Report

Increment line coverage 75.48% (197/261) 🎉

Increment coverage report
Complete coverage report

Category Coverage
Function Coverage 57.77% (22090/38241)
Line Coverage 41.29% (219152/530729)
Region Coverage 37.53% (176598/470507)
Branch Coverage 38.30% (76161/198862)

@hello-stephen
Copy link
Copy Markdown
Contributor

skip buildall

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants