fix: lookup cache retirement implemented#19548
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 2 |
| P2 | 0 |
| P3 | 0 |
| Total | 2 |
Reviewed 20 of 20 changed files.
This is an automated review by Codex GPT-5.5
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 1 |
| P2 | 0 |
| P3 | 0 |
| Total | 1 |
Reviewed 22 of 22 changed files.
This is an automated review by Codex GPT-5.5
| { | ||
| final LookupExtractorFactory lookupExtractorFactory = container.getLookupExtractorFactory(); | ||
| final Optional<RetainedLookupExtractor> retainedLookupExtractor = | ||
| lookupExtractorFactory.acquireRetainedLookupExtractor(); |
There was a problem hiding this comment.
[P1] Close retained lookup joinables deterministically
After the incremental change, buildLookupJoinable acquires a RetainedLookupExtractor once and stores it in the joinable, but LookupJoinable.acquireReference() now returns only a no-op closeable. JoinDataSource.createSegmentMapFunction has no close hook for the captured joinable list, so HashJoinSegment.close() never closes this retained extractor; it is released only by the Cleaner after GC. For cached namespace lookups, one retained retired version is enough to make CacheScheduler skip further refreshes with the default maxRetiredCacheEntries=1, so lookup-join queries can stall later lookup updates until GC or timeout. Please attach the retained handle to an explicit query/joinable lifecycle or use a real ref-counted closeable instead of the no-op reference.
There was a problem hiding this comment.
Yeah, acquireReference() is per mapped segment, but the pinned lookup lives at the joinable/query level, so closing it there is too narrow and can be wrong if the joinable is reused.
Deterministic cleanup would need a real close hook for the Joinable/JoinableClause lifecycle, not just wiring the retained handle into LookupJoinable.acquireReference(), which will mean the join api change and I'm not sure if we want it under this pull request, would it be reasonable to leave the Cleaner fallback here and handle deterministic cleanup as a follow up?
There was a problem hiding this comment.
I agree that wiring this into LookupJoinable.acquireReference() would be the wrong lifecycle, but I don't think the Cleaner-only path is reasonable for lookup joins as-is. The current code still retains the namespace cache in LookupJoinableFactory and the join path has no deterministic close for that retained extractor, while the scheduler intentionally skips further loads once a retired retained cache reaches the default max of 1. So a lookup-join query can keep later namespace updates from loading until GC or the retired-cache timeout. I'd either keep joins on the non-retained get() path until the joinable/query lifecycle can own a close hook, or include that lifecycle fix in this PR.
Reviewed 13 files total, including 12 of 22 changed files.
There was a problem hiding this comment.
Fair, leaving it to GC isn’t ideal. But I also don’t want to leave joins without lookup retirement since that’s kinda the goal of this PR, so I added deterministic closure for those joinables and tried to keep the diff minimal, wdyt?
There was a problem hiding this comment.
Thanks, this addresses the joinable lifecycle issue I was worried about. The retained lookup is now owned by the query-scoped SegmentMapFunction, and the server, realtime, and MSQ callers I checked close that function after the mapped segments/segment references, so the join path no longer appears to rely on the Cleaner for normal completion.
Reviewed 32 of 32 changed files.
This is an automated review by Codex GPT-5.5
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 1 |
| P2 | 0 |
| P3 | 0 |
| Total | 1 |
Reviewed 32 of 32 changed files.
This is an automated review by Codex GPT-5.5
|
|
||
| // ExtractionFn has no close hook. The RetainedLookupExtractor cleaner releases this reference when the | ||
| // LookupExtractionFn that owns it becomes unreachable. | ||
| return retainedLookupExtractor.<LookupExtractor>map(retained -> retained).orElseGet(lookupExtractorFactory); |
There was a problem hiding this comment.
[P1] Do not retain lookup dimension extraction functions without a close path
getExtractionFn() now acquires a RetainedLookupExtractor and returns it inside a plain LookupExtractionFn, but that API has no close hook. Several callers do not cache or close the returned function, for example TopNQueryQueryToolChest calls topNQuery.getDimensionSpec().getExtractionFn().apply(...) while post-processing each result value, and GroupByQueryQueryToolChest.extractionsToRewrite calls it just to inspect getExtractionType(). For named lookup dimensions, each call increments the namespace cache reference and then depends on Cleaner/GC to release it, so with the new default maxRetiredCacheEntries=1 a query can make later namespace refreshes skip until GC or the 15 minute timeout. Please avoid retaining from LookupDimensionSpec.getExtractionFn() unless the returned handle has a deterministic query/result lifecycle, or update these call sites to use a scoped retained wrapper that is explicitly closed.
Description
Fixed a race condition in off-heap cached lookups
This PR fixes a race between lookup cache refresh and concurrent query execution for cached-global lookups, using off-heap mmap-backed storage.
Previously, query paths could obtain a
LookupExtractorbacked by the current cache version while a lookup refresh concurrently replaced and deleted that cache version. In off-heap mode, this could leave in-flight query work referencing disposed mmap-backed data.This PR adds lookup cache retirement and retained lookup extractors. When a cache refresh replaces a lookup cache version, the old version is retired instead of closed immediately. Retired cache versions remain alive while retained references exist, and are disposed once those references are released or after a configured timeout.
Added retained lookup extractor lifecycle support
LookupExtractorFactorynow has an optionalacquireRetainedLookupExtractor()method. Factories that support retained backing resources can return aRetainedLookupExtractor, which wraps the actualLookupExtractorand closes the retained backing reference when query work is finished.RetainedLookupExtractorsupports explicitclose()for query paths with a lifecycle hook, and also uses aCleanerfallback for paths that hand lookup extractors to APIs without an explicit close hook.Retained lookup extractors are now used across lookup query paths, including:
Added cache retirement to cached-global lookups
CacheScheduler.VersionedCachenow tracks cache lifecycle states: live, retired, and disposed. Cache versions can be retained by active query work. When a refresh swaps in a new cache version, the previous version is retired and tracked until it can be safely disposed.To avoid unbounded retention, refreshes are skipped if the number of retained retired cache versions reaches the configured limit. Retired cache entries may also be forcibly disposed after a timeout to prevent abandoned references from blocking future refreshes indefinitely.
Two new cached-global lookup configuration properties were added:
druid.lookup.namespace.maxRetiredCacheEntries1druid.lookup.namespace.retiredCacheEntryTimeoutMillis900,000Release note
Cached-global lookups now retain retired cache versions while in-flight queries are still using them, preventing queries from referencing disposed off-heap lookup cache data during concurrent lookup refreshes.
Two new configuration properties control this behavior:
druid.lookup.namespace.maxRetiredCacheEntriesanddruid.lookup.namespace.retiredCacheEntryTimeoutMillis.Key changed/added classes in this PR
RetainedLookupExtractorLookupExtractorFactoryNamespaceLookupExtractorFactoryCacheSchedulerNamespaceExtractionConfigLookupDimensionSpecRegisteredLookupExtractionFnLookupSegmentLookupJoinableLookupJoinableFactoryThis PR has: