Description:
Following up on the recently merged ReaderUtil.partitionByLeaf API (#15507), we're planning a second task to build a generic utility class for efficiently retrieving doc-values for an int[] of global doc IDs.
The doc-values retriever would:
-
Take an IndexReader, int[] globalDocIds, and DoubleValuesSource[] (or abstract LeafVisitor factory)
-
Take an Executor/TaskExecutor for concurrent per-leaf retrieval
-
Return values in 1:1 correspondence with the input doc ID order
This requires a scatter/gather pattern:
-
Scatter: Partition global doc IDs by leaf (using partitionByLeaf)
-
Map: Retrieve doc-values per leaf (concurrent)
-
Gather: Reassemble results back to original input order
The current partitionByLeaf(ScoreDoc[], leaves) API doesn't track original positions, so we can't reassemble results in input order. We need ordinal tracking.
Design options:
- Two separate APIs (current prototype):
// Existing - no ordinals (for callers who don't need gather and only take in ScoreDoc[])
public static int[][] partitionByLeaf(ScoreDoc[] hits, List<LeafReaderContext> leaves)
// New - with ordinals for scatter/gather
public record PartitionedHits(int[][] docIdsByLeaf, int[] ordinals) {}
public static PartitionedHits partitionByLeafWithOrdinals(int[] globalDocIds, List<LeafReaderContext> leaves)
// Private - helper method for duplicate partition logic
private static int[][] partitionSortedDocIds(int[] sortedDocIds, List<LeafReaderContext> leaves)
Pros: Clear intent, no wasted compute, type-safe
Cons: Two entry points
- Single API with boolean flag:
public record PartitionedHits(int[][] docIdsByLeaf, int[] ordinals) {}
public static PartitionedHits partitionByLeaf(..., boolean includeOrdinals)
Pros: Single entry point
Cons: ordinals is null when not requested, still need to handle different input types (ScoreDoc[] vs int[])
- Always return ordinals:
public static PartitionedHits partitionByLeaf(int[] globalDocIds, List<LeafReaderContext> leaves)
Pros: Simplest API
Cons: Wastes compute for callers who don't need ordinals
Questions for the community:
-
Should we add an int[] input method? The merged PR only has ScoreDoc[] input
(last PR comment), but the scatter/gather use case needs raw doc IDs.
-
Which API design is preferred for ordinal tracking? Or any better ideas?
Description:
Following up on the recently merged ReaderUtil.partitionByLeaf API (#15507), we're planning a second task to build a generic utility class for efficiently retrieving doc-values for an int[] of global doc IDs.
The doc-values retriever would:
Take an
IndexReader,int[] globalDocIds, andDoubleValuesSource[](or abstract LeafVisitor factory)Take an Executor/TaskExecutor for concurrent per-leaf retrieval
Return values in 1:1 correspondence with the input doc ID order
This requires a scatter/gather pattern:
Scatter: Partition global doc IDs by leaf (using
partitionByLeaf)Map: Retrieve doc-values per leaf (concurrent)
Gather: Reassemble results back to original input order
The current
partitionByLeaf(ScoreDoc[], leaves)API doesn't track original positions, so we can't reassemble results in input order. We need ordinal tracking.Design options:
Pros: Clear intent, no wasted compute, type-safe
Cons: Two entry points
Pros: Single entry point
Cons: ordinals is null when not requested, still need to handle different input types (ScoreDoc[] vs int[])
Pros: Simplest API
Cons: Wastes compute for callers who don't need ordinals
Questions for the community:
Should we add an
int[]input method? The merged PR only hasScoreDoc[]input(last PR comment), but the scatter/gather use case needs raw doc IDs.
Which API design is preferred for ordinal tracking? Or any better ideas?