fix: lookup cache retirement implemented#19548
Conversation
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 2 |
| P2 | 0 |
| P3 | 0 |
| Total | 2 |
Reviewed 20 of 20 changed files.
This is an automated review by Codex GPT-5.5
FrankChen021
left a comment
There was a problem hiding this comment.
| Severity | Findings |
|---|---|
| P0 | 0 |
| P1 | 1 |
| P2 | 0 |
| P3 | 0 |
| Total | 1 |
Reviewed 22 of 22 changed files.
This is an automated review by Codex GPT-5.5
| { | ||
| final LookupExtractorFactory lookupExtractorFactory = container.getLookupExtractorFactory(); | ||
| final Optional<RetainedLookupExtractor> retainedLookupExtractor = | ||
| lookupExtractorFactory.acquireRetainedLookupExtractor(); |
There was a problem hiding this comment.
[P1] Close retained lookup joinables deterministically
After the incremental change, buildLookupJoinable acquires a RetainedLookupExtractor once and stores it in the joinable, but LookupJoinable.acquireReference() now returns only a no-op closeable. JoinDataSource.createSegmentMapFunction has no close hook for the captured joinable list, so HashJoinSegment.close() never closes this retained extractor; it is released only by the Cleaner after GC. For cached namespace lookups, one retained retired version is enough to make CacheScheduler skip further refreshes with the default maxRetiredCacheEntries=1, so lookup-join queries can stall later lookup updates until GC or timeout. Please attach the retained handle to an explicit query/joinable lifecycle or use a real ref-counted closeable instead of the no-op reference.
There was a problem hiding this comment.
Yeah, acquireReference() is per mapped segment, but the pinned lookup lives at the joinable/query level, so closing it there is too narrow and can be wrong if the joinable is reused.
Deterministic cleanup would need a real close hook for the Joinable/JoinableClause lifecycle, not just wiring the retained handle into LookupJoinable.acquireReference(), which will mean the join api change and I'm not sure if we want it under this pull request, would it be reasonable to leave the Cleaner fallback here and handle deterministic cleanup as a follow up?
There was a problem hiding this comment.
I agree that wiring this into LookupJoinable.acquireReference() would be the wrong lifecycle, but I don't think the Cleaner-only path is reasonable for lookup joins as-is. The current code still retains the namespace cache in LookupJoinableFactory and the join path has no deterministic close for that retained extractor, while the scheduler intentionally skips further loads once a retired retained cache reaches the default max of 1. So a lookup-join query can keep later namespace updates from loading until GC or the retired-cache timeout. I'd either keep joins on the non-retained get() path until the joinable/query lifecycle can own a close hook, or include that lifecycle fix in this PR.
Reviewed 13 files total, including 12 of 22 changed files.
Description
Fixed a race condition in off-heap cached lookups
This PR fixes a race between lookup cache refresh and concurrent query execution for cached-global lookups, using off-heap mmap-backed storage.
Previously, query paths could obtain a
LookupExtractorbacked by the current cache version while a lookup refresh concurrently replaced and deleted that cache version. In off-heap mode, this could leave in-flight query work referencing disposed mmap-backed data.This PR adds lookup cache retirement and retained lookup extractors. When a cache refresh replaces a lookup cache version, the old version is retired instead of closed immediately. Retired cache versions remain alive while retained references exist, and are disposed once those references are released or after a configured timeout.
Added retained lookup extractor lifecycle support
LookupExtractorFactorynow has an optionalacquireRetainedLookupExtractor()method. Factories that support retained backing resources can return aRetainedLookupExtractor, which wraps the actualLookupExtractorand closes the retained backing reference when query work is finished.RetainedLookupExtractorsupports explicitclose()for query paths with a lifecycle hook, and also uses aCleanerfallback for paths that hand lookup extractors to APIs without an explicit close hook.Retained lookup extractors are now used across lookup query paths, including:
Added cache retirement to cached-global lookups
CacheScheduler.VersionedCachenow tracks cache lifecycle states: live, retired, and disposed. Cache versions can be retained by active query work. When a refresh swaps in a new cache version, the previous version is retired and tracked until it can be safely disposed.To avoid unbounded retention, refreshes are skipped if the number of retained retired cache versions reaches the configured limit. Retired cache entries may also be forcibly disposed after a timeout to prevent abandoned references from blocking future refreshes indefinitely.
Two new cached-global lookup configuration properties were added:
druid.lookup.namespace.maxRetiredCacheEntries1druid.lookup.namespace.retiredCacheEntryTimeoutMillis900,000Release note
Cached-global lookups now retain retired cache versions while in-flight queries are still using them, preventing queries from referencing disposed off-heap lookup cache data during concurrent lookup refreshes.
Two new configuration properties control this behavior:
druid.lookup.namespace.maxRetiredCacheEntriesanddruid.lookup.namespace.retiredCacheEntryTimeoutMillis.Key changed/added classes in this PR
RetainedLookupExtractorLookupExtractorFactoryNamespaceLookupExtractorFactoryCacheSchedulerNamespaceExtractionConfigLookupDimensionSpecRegisteredLookupExtractionFnLookupSegmentLookupJoinableLookupJoinableFactoryThis PR has: