[fix](filecache) Make file cache sync clear wait for async cleanup#63602
[fix](filecache) Make file cache sync clear wait for async cleanup#63602freemandealer wants to merge 6 commits into
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
Issue Number: N/A
Related PR: N/A
Problem Summary: The old sync file cache clear path bypassed the normal FileBlock lifecycle and was not safe with concurrent cache users, while the async clear path only marked or enqueued deletes and returned before held blocks, recycle-queue deletes, and file-cache meta deletes were finished. This change makes the factory sync clear path use a synchronous wrapper over the async clear semantics. The wrapper serializes clear operations, pauses the TTL manager during clear, marks existing blocks for deletion, drains recycled blocks, waits for held deleting blocks to be released, and waits for file-cache meta delete fences before reporting completion. The direct clear helper is kept only for BE tests. HTTP sync clear now runs through a cancellable async reply path so client disconnects can cancel waiting without adding a clear timeout config.
None
- Test:
- Unit Test: `DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:HttpRequestAsyncReplyTest.*'`
- Unit Test: `DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='FileCacheActionTest.*'`
- Manual test: `git diff --check`
- Behavior changed: Yes. File cache sync clear now waits for the async cleanup path to finish instead of using the unsafe direct clear path.
- Does this need documentation: No
e169458 to
ab16f05
Compare
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found two blocking issues in the new synchronous file-cache clear path.
Critical checkpoint conclusions:
- Goal/test: The PR aims to make sync file-cache clear wait for holder/recycle/meta cleanup and adds unit tests for held blocks, cancellation, meta fences, and recycle queue draining. The tests do not cover concurrent cache creation during a sync clear or BE thread memory tracking for the HTTP worker.
- Scope/focus: The change is mostly focused on file-cache clear and HTTP cancellation, but the sync-clear semantics changed from an atomic in-memory/storage clear to a staged async clear plus wait loop.
- Concurrency: This change is concurrency-sensitive.
_clear_mutexserializes clear requests only; normalget_or_setpaths use_mutexand can create blocks whileclear_file_cache_sync()waits. The new detached HTTP thread also introduces a new execution context. - Lifecycle/static initialization: No new static initialization issue found. The detached thread relies on
HttpRequestlifetime being held bywait_finish_send_reply(), but the thread context itself still needs BE runtime initialization. - Config/incompatible changes/protocol: No new config, storage format, or FE-BE protocol compatibility issue found.
- Parallel paths: Async clear, sync clear, recycle GC, and direct test-only clear were checked. The sync path now differs materially from old production direct clear under concurrent readers.
- Special conditions: Cancellation is checked in the wait loop, but not enough to address the two issues below.
- Test coverage/results: Added tests cover several serial and holder scenarios, but miss concurrent cache insertion during sync clear and memory-context requirements for the new thread.
- Observability: New result logging is useful; no additional observability blocker found.
- Transaction/persistence/data writes: File-cache metadata deletion fences were considered; no separate persistence blocker found beyond the concurrent clear semantics.
- Memory safety: The detached
std::threadenters BE work withoutSCOPED_INIT_THREAD_CONTEXT, leaving allocations in that worker outside the expected memory-tracking context. - Performance: The wait loop is bounded by 10ms sleeps and drains recycle work; no additional performance blocker found.
User focus: no additional user-provided review focus was specified.
### What problem does this PR solve?
Issue Number: N/A
Problem Summary: Sync file cache clear could finish while normal readers created new cache blocks during the clear window, and the HTTP sync clear worker did not initialize a BE thread context. This change makes cache misses and holes return transient SKIP_CACHE blocks while sync clear is draining, waits for async cache loading before the sync clear barrier, and initializes thread context in the detached HTTP worker.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:HttpRequestAsyncReplyTest.*'
- Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='FileCacheActionTest.*'
- Manual test: git diff --check
- Manual test: PATH=/mnt/disk1/zhangzhengyu/build-dep/ldb_toolchain.back/bin:$PATH build-support/check-format.sh
- Behavior changed: Yes. During sync file cache clear, cache misses skip caching until the clear barrier finishes.
- Does this need documentation: No
|
run buildall |
|
/review |
There was a problem hiding this comment.
Review conclusion: request changes. The current sync-clear design fixes the previously raised new-block insertion window and now initializes the detached HTTP worker thread context, but it introduces a separate liveness failure when cache async loading terminates with an error.
Critical checkpoint conclusions:
- Goal and proof: The PR aims to make sync file-cache clear wait for async cleanup. The normal success path is covered by new BE unit tests, but async-load failure before clear is not covered and can hang.
- Scope and clarity: The change is mostly focused on file-cache clear and HTTP async reply handling.
- Concurrency: The new clear path uses
_clear_mutex,_mutex,_recycle_keys_mutex, and atomic cancellation. Lock ordering is mostly consistent, but the pre-barrier wait on_async_open_donehas no failure terminal state and can block forever while serializing all clears. - Lifecycle: Background cache loading can exit through the catch path without setting
_async_open_done; sync clear now depends on that flag for progress. - Configuration: No new config items.
- Compatibility: No storage/protocol compatibility issue found.
- Parallel paths: Async clear and BE-test direct clear remain separate; the sync path now depends on async-load completion.
- Special checks: Cancellation is handled for HTTP disconnects, but a normal waiting client or direct caller has no timeout/failure exit when async loading failed.
- Test coverage: New unit tests cover normal wait, cancellation, recycle queue draining, meta-store fence, and skip-cache during clear. Missing negative coverage for async-load failure.
- Observability: Existing logs show async-load exceptions, but sync clear does not surface that state to the caller.
- Transaction/persistence/data-write concerns: Not applicable beyond file-cache metadata cleanup; meta delete fences are waited on in the normal path.
- Performance: No blocking performance issue found beyond the liveness problem.
- User focus: No additional user-provided review focus was supplied.
this will not happen as _async_open_done will always succeed
TPC-H: Total hot run time: 29443 ms |
TPC-H: Total hot run time: 29287 ms |
TPC-DS: Total hot run time: 169529 ms |
TPC-DS: Total hot run time: 170473 ms |
BE UT Coverage ReportIncrement line coverage Increment coverage report
|
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
|
run nonConcurrent |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
### What problem does this PR solve?
Issue Number: N/A
Problem Summary: The sync clear drain path needs to wait for file cache meta delete fences, but normal background GC should keep the original async metadata delete behavior. This change makes recycle-key removal wait for the meta delete fence only when called from clear_file_cache_sync(), while run_background_gc() keeps using the normal async remove path. It also adds a non-clear regression test for remove_if_cached_async() with a held deleting block.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:BlockFileCacheTest.remove_if_cached_async_recycles_held_deleting_block:HttpRequestAsyncReplyTest.*'
- Manual test: git diff --check
- Manual test: PATH=/mnt/disk1/zhangzhengyu/build-dep/ldb_toolchain.back/bin:$PATH build-support/check-format.sh
- Behavior changed: No. Normal background GC keeps async metadata deletion; sync file cache clear still waits for drained recycle-key metadata fences.
- Does this need documentation: No
|
run buildall |
|
/review |
There was a problem hiding this comment.
I found one additional blocking issue beyond the already-open sync-clear threads.
Critical checkpoint conclusions:
- Goal/test evidence: The PR aims to make sync file-cache clear wait for outstanding deletes while avoiding newly cached blocks. The added tests cover new misses during clear, but not new readers hitting old blocks that were already marked deleting.
- Scope/focus: The implementation is focused on file-cache clear behavior, but the clear barrier is incomplete because existing deleting blocks remain visible to new get_or_set callers.
- Concurrency/lifecycle: This path is explicitly concurrent with normal cache readers. Existing review threads already cover the async-open terminal state, the original new-insertion race, and the detached-thread memory context; I did not duplicate those. The new issue below is a distinct reader-lifecycle problem for old deleting blocks.
- Configuration/compatibility: No new config item or serialization compatibility concern found in the current PR files.
- Parallel paths: Background GC keeps async metadata deletion while sync clear waits for fences; that split is intentional and covered.
- Test coverage/results: Missing coverage for readers that arrive after the clear barrier and hit a block already marked deleting.
- Observability/performance: Logs/metrics are adequate for the modified paths, but sync clear can be starved by hot readers.
User focus: No additional user-provided review focus was specified.
### What problem does this PR solve?
Issue Number: N/A
Problem Summary: New readers could still get existing file cache blocks after the sync clear barrier opened, including blocks that clear_file_cache_sync was trying to drain. Continuous traffic to a hot held block could therefore keep extending the clear wait. This change simplifies the barrier behavior: while sync clear is running, get_or_set() returns transient SKIP_CACHE blocks for the whole requested range and does not read or insert _files entries. Existing pre-barrier holders are still waited on, but post-barrier readers cannot pin blocks being cleared.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:BlockFileCacheTest.remove_if_cached_async_recycles_held_deleting_block:HttpRequestAsyncReplyTest.*'
- Manual test: git diff --check
- Manual test: PATH=/mnt/disk1/zhangzhengyu/build-dep/ldb_toolchain.back/bin:$PATH build-support/check-format.sh
- Behavior changed: Yes. While sync file cache clear is running, new get_or_set() calls bypass the cache and return transient SKIP_CACHE blocks.
- Does this need documentation: No
|
run buildall |
|
/review |
There was a problem hiding this comment.
PR goal: make synchronous file-cache clear use the normal async cleanup lifecycle and wait until the cache data/meta cleanup is complete. The main path is focused and has useful BE unit coverage for held blocks, cancellation, meta fences, recycle draining, and skip-cache behavior during the barrier.
Critical checkpoint conclusions:
- Goal/test: mostly covered, but one concurrent path still lets sync clear return before metadata cleanup, so the goal is not fully achieved.
- Scope: changes are generally focused on file-cache clear and HTTP cancellation.
- Concurrency: clear/read insertion is guarded by
_mutexand_clear_mutex, but recycle-key draining races with background GC as noted inline. - Lifecycle/static initialization: no new cross-TU static lifecycle issue found. Detached HTTP worker now initializes thread context.
- Config/compatibility: no new config, storage format, or FE/BE protocol compatibility issue found.
- Parallel paths: background GC is a parallel recycle-key consumer and must preserve sync clear's fence guarantee.
- Tests: added tests cover the direct sync clear drain path, but not the race where background GC dequeues during sync clear.
- Observability/performance: existing logs/metrics remain sufficient for this path; no additional blocking performance issue found beyond the race.
- Transaction/persistence/data writes: file-cache metadata persistence is affected; sync clear can report completion before RocksDB meta delete if background GC consumes the key.
User focus: no additional user-provided review focus was specified.
### What problem does this PR solve?
Issue Number: N/A
Problem Summary: Background GC shared the same file-cache recycle queue with sync clear and could consume a key while clear_file_cache_sync was running. Because background GC intentionally uses async metadata delete, sync clear could observe the recycle queue as idle and return before the meta delete fence completed. This change keeps recycle keys in the queue for the sync clear drain path while sync clear is active, so clear_file_cache_sync removes them with metadata delete fences. Normal background GC keeps the async remove behavior outside the sync clear window.
### Release note
None
### Check List (For Author)
- Test:
- Unit Test: DORIS_TOOLCHAIN=clang DISABLE_BE_JAVA_EXTENSIONS=ON ENABLE_INJECTION_POINT=ON ENABLE_CACHE_LOCK_DEBUG=0 ENABLE_PCH=0 sh run-be-ut.sh --run --filter='BlockFileCacheTest.clear_file_cache_sync*:BlockFileCacheTest.remove_if_cached_async_recycles_held_deleting_block:HttpRequestAsyncReplyTest.*'
- Manual test: git diff --check
- Manual test: PATH=/mnt/disk1/zhangzhengyu/build-dep/ldb_toolchain.back/bin:$PATH build-support/check-format.sh
- Behavior changed: Yes. While sync file cache clear is running, background GC leaves recycle keys for the sync clear drain path.
- Does this need documentation: No
|
run buildall |
|
/review |
There was a problem hiding this comment.
I reviewed the live PR files reported by GitHub for PR 63602 and did not find an additional distinct inline issue beyond the already-open review threads.
Critical checkpoint conclusions:
- Goal/test proof: the goal is to make synchronous file-cache clear wait for held blocks/recycle cleanup while supporting request cancellation. The added BE tests cover held deleting blocks, skip-cache during clear, recycle draining, metadata fences, and HTTP cancellation.
- Scope/focus: the live GitHub PR is focused on BE file-cache/http changes. Existing review threads already cover the main remaining sync-clear risks and should be resolved before merge.
- Concurrency: the latest head prevents background GC from stealing recycle keys during sync clear by making
try_dequeue_recycle_key(..., false)return false while_clear_file_cache_sync_runningis active. Existing threads still cover other concurrency/lifecycle concerns and should not be duplicated here. - Lifecycle: the detached sync-clear HTTP worker now uses
SCOPED_INIT_THREAD_CONTEXT; request cancellation waits throughHttpRequest::wait_finish_send_reply(). The existing async-open terminal-state thread remains known review context. - Configuration/compatibility: no new configuration or storage-format compatibility issue found in the live PR files.
- Parallel paths: clear paths for all cache instances go through
clear_file_cache_sync()or async clear consistently. - Tests: BE unit tests were added for the important sync-clear lifecycle cases. I did not run tests in this review runner.
- Observability/performance: no additional observability or hot-path performance blocker found beyond the existing comments.
- Transaction/persistence/data correctness: the new metadata delete fence waits for RocksDB delete completion on the sync-clear drain path; the already-open thread covers ensuring all relevant async meta deletes participate in that contract.
User focus: no additional user-provided review focus was specified.
TPC-H: Total hot run time: 29403 ms |
TPC-H: Total hot run time: 29459 ms |
TPC-DS: Total hot run time: 168902 ms |
TPC-DS: Total hot run time: 169292 ms |
BE Regression && UT Coverage ReportIncrement line coverage Increment coverage report
|
Problem Summary: The old sync file cache clear path bypassed the normal FileBlock lifecycle and was not safe with concurrent cache users, while the async clear path only marked or enqueued deletes and returned before held blocks, recycle-queue deletes, and file-cache meta deletes were finished. This change makes the factory sync clear path use a synchronous wrapper over the async clear semantics. The wrapper serializes clear operations, pauses the TTL manager during clear, marks existing blocks for deletion, drains recycled blocks, waits for held deleting blocks to be released, and waits for file-cache meta delete fences before reporting completion. The direct clear helper is kept only for BE tests. HTTP sync clear now runs through a cancellable async reply path so client disconnects can cancel waiting without adding a clear timeout config.
Release note
None
Check List (For Author)
Test
Behavior changed:
Does this need documentation?
Check List (For Reviewer who merge this PR)