Skip to content

fix: lookup cache retirement implemented#19548

Open
nozjkoitop wants to merge 2 commits into
apache:masterfrom
deep-bi:feature-fix-lookup-synchronization
Open

fix: lookup cache retirement implemented#19548
nozjkoitop wants to merge 2 commits into
apache:masterfrom
deep-bi:feature-fix-lookup-synchronization

Conversation

@nozjkoitop

Copy link
Copy Markdown
Contributor

Description

Fixed a race condition in off-heap cached lookups

This PR fixes a race between lookup cache refresh and concurrent query execution for cached-global lookups, using off-heap mmap-backed storage.

Previously, query paths could obtain a LookupExtractor backed by the current cache version while a lookup refresh concurrently replaced and deleted that cache version. In off-heap mode, this could leave in-flight query work referencing disposed mmap-backed data.

This PR adds lookup cache retirement and retained lookup extractors. When a cache refresh replaces a lookup cache version, the old version is retired instead of closed immediately. Retired cache versions remain alive while retained references exist, and are disposed once those references are released or after a configured timeout.

Added retained lookup extractor lifecycle support

LookupExtractorFactory now has an optional acquireRetainedLookupExtractor() method. Factories that support retained backing resources can return a RetainedLookupExtractor, which wraps the actual LookupExtractor and closes the retained backing reference when query work is finished.

RetainedLookupExtractor supports explicit close() for query paths with a lifecycle hook, and also uses a Cleaner fallback for paths that hand lookup extractors to APIs without an explicit close hook.

Retained lookup extractors are now used across lookup query paths, including:

  • registered lookup extraction functions
  • lookup dimension specs
  • lookup joins
  • lookup segment iteration

Added cache retirement to cached-global lookups

CacheScheduler.VersionedCache now tracks cache lifecycle states: live, retired, and disposed. Cache versions can be retained by active query work. When a refresh swaps in a new cache version, the previous version is retired and tracked until it can be safely disposed.

To avoid unbounded retention, refreshes are skipped if the number of retained retired cache versions reaches the configured limit. Retired cache entries may also be forcibly disposed after a timeout to prevent abandoned references from blocking future refreshes indefinitely.

Two new cached-global lookup configuration properties were added:

Property Description Default
druid.lookup.namespace.maxRetiredCacheEntries The maximum number of retired lookup cache versions to keep while they are still retained by in-flight query work. Further cache refreshes are skipped until the retired cache versions are released or time out. 1
druid.lookup.namespace.retiredCacheEntryTimeoutMillis The amount of time after retirement before a retained cache version may be disposed even if its retained references have not been closed yet. This prevents abandoned retained references from blocking future cache refreshes indefinitely. 900,000

Release note

Cached-global lookups now retain retired cache versions while in-flight queries are still using them, preventing queries from referencing disposed off-heap lookup cache data during concurrent lookup refreshes.

Two new configuration properties control this behavior: druid.lookup.namespace.maxRetiredCacheEntries and druid.lookup.namespace.retiredCacheEntryTimeoutMillis.


Key changed/added classes in this PR
  • RetainedLookupExtractor
  • LookupExtractorFactory
  • NamespaceLookupExtractorFactory
  • CacheScheduler
  • NamespaceExtractionConfig
  • LookupDimensionSpec
  • RegisteredLookupExtractionFn
  • LookupSegment
  • LookupJoinable
  • LookupJoinableFactory

This PR has:

  • been self-reviewed.
  • added documentation for new or modified features or behaviors.
  • a release note entry in the PR description.
  • added Javadocs for most classes and all non-trivial methods. Linked related entities via Javadoc links.
  • added or updated version, license, or notice information in licenses.yaml
  • added comments explaining the "why" and the intent of the code wherever would not be obvious for an unfamiliar reader.
  • added unit tests or modified existing tests to cover new code paths, ensuring the threshold for code coverage is met.
  • added integration tests.
  • been tested in a test Druid cluster.

@nozjkoitop nozjkoitop changed the title lookup cache retirement implemented fix: lookup cache retirement implemented Jun 3, 2026

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 2
P2 0
P3 0
Total 2

Reviewed 20 of 20 changed files.


This is an automated review by Codex GPT-5.5

Comment thread server/src/main/java/org/apache/druid/segment/join/LookupJoinableFactory.java Outdated
@Nullable
default ExtractionFn getExtractionFnForMetadata()
{
return getExtractionFn();

@FrankChen021 FrankChen021 left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Severity Findings
P0 0
P1 1
P2 0
P3 0
Total 1

Reviewed 22 of 22 changed files.


This is an automated review by Codex GPT-5.5

{
final LookupExtractorFactory lookupExtractorFactory = container.getLookupExtractorFactory();
final Optional<RetainedLookupExtractor> retainedLookupExtractor =
lookupExtractorFactory.acquireRetainedLookupExtractor();

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[P1] Close retained lookup joinables deterministically

After the incremental change, buildLookupJoinable acquires a RetainedLookupExtractor once and stores it in the joinable, but LookupJoinable.acquireReference() now returns only a no-op closeable. JoinDataSource.createSegmentMapFunction has no close hook for the captured joinable list, so HashJoinSegment.close() never closes this retained extractor; it is released only by the Cleaner after GC. For cached namespace lookups, one retained retired version is enough to make CacheScheduler skip further refreshes with the default maxRetiredCacheEntries=1, so lookup-join queries can stall later lookup updates until GC or timeout. Please attach the retained handle to an explicit query/joinable lifecycle or use a real ref-counted closeable instead of the no-op reference.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, acquireReference() is per mapped segment, but the pinned lookup lives at the joinable/query level, so closing it there is too narrow and can be wrong if the joinable is reused.

Deterministic cleanup would need a real close hook for the Joinable/JoinableClause lifecycle, not just wiring the retained handle into LookupJoinable.acquireReference(), which will mean the join api change and I'm not sure if we want it under this pull request, would it be reasonable to leave the Cleaner fallback here and handle deterministic cleanup as a follow up?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that wiring this into LookupJoinable.acquireReference() would be the wrong lifecycle, but I don't think the Cleaner-only path is reasonable for lookup joins as-is. The current code still retains the namespace cache in LookupJoinableFactory and the join path has no deterministic close for that retained extractor, while the scheduler intentionally skips further loads once a retired retained cache reaches the default max of 1. So a lookup-join query can keep later namespace updates from loading until GC or the retired-cache timeout. I'd either keep joins on the non-retained get() path until the joinable/query lifecycle can own a close hook, or include that lifecycle fix in this PR.

Reviewed 13 files total, including 12 of 22 changed files.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants