ReaderUtil.partitionByLeaf API design: ordinal tracking for scatter/gather pattern

## Description:

Following up on the recently merged ReaderUtil.partitionByLeaf API (https://github.com/apache/lucene/pull/15507), we're planning a second task to build a generic utility class for efficiently retrieving doc-values for an int[] of global doc IDs.

The doc-values retriever would:

- Take an `IndexReader`, `int[] globalDocIds`, and `DoubleValuesSource[]` (or abstract LeafVisitor factory)

- Take an Executor/TaskExecutor for concurrent per-leaf retrieval

- Return values in 1:1 correspondence with the input doc ID order

This requires a scatter/gather pattern:

- Scatter: Partition global doc IDs by leaf (using `partitionByLeaf`)

- Map: Retrieve doc-values per leaf (concurrent)

- Gather: Reassemble results back to original input order

The current `partitionByLeaf(ScoreDoc[], leaves)` API doesn't track original positions, so we can't reassemble results in input order. We need ordinal tracking.

## Design options:

1. Two separate APIs (current prototype):
```
// Existing - no ordinals (for callers who don't need gather and only take in ScoreDoc[])
public static int[][] partitionByLeaf(ScoreDoc[] hits, List<LeafReaderContext> leaves)

// New - with ordinals for scatter/gather
public record PartitionedHits(int[][] docIdsByLeaf, int[] ordinals) {}
public static PartitionedHits partitionByLeafWithOrdinals(int[] globalDocIds, List<LeafReaderContext> leaves)

// Private - helper method for duplicate partition logic 
private static int[][] partitionSortedDocIds(int[] sortedDocIds, List<LeafReaderContext> leaves) 
```
Pros: Clear intent, no wasted compute, type-safe

Cons: Two entry points

2. Single API with boolean flag:
```
public record PartitionedHits(int[][] docIdsByLeaf, int[] ordinals) {}
public static PartitionedHits partitionByLeaf(..., boolean includeOrdinals)
```
Pros: Single entry point

Cons: ordinals is null when not requested,  still need to handle different input types (ScoreDoc[] vs int[])

3. Always return ordinals:
```
public static PartitionedHits partitionByLeaf(int[] globalDocIds, List<LeafReaderContext> leaves)
```
Pros: Simplest API

Cons: Wastes compute for callers who don't need ordinals

## Questions for the community:

1. Should we add an `int[]` input method? The merged PR only has `ScoreDoc[]` input 
   ([last PR comment](https://github.com/apache/lucene/pull/15803#discussion_r2960596460)), but the scatter/gather use case needs raw doc IDs.

2. Which API design is preferred for ordinal tracking? Or any better ideas?



Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ReaderUtil.partitionByLeaf API design: ordinal tracking for scatter/gather pattern #15905

Description:

Design options:

Questions for the community:

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

ReaderUtil.partitionByLeaf API design: ordinal tracking for scatter/gather pattern #15905

Description

Description:

Design options:

Questions for the community:

Metadata

Metadata

Assignees

Labels

Type

Fields

Projects

Milestone

Relationships

Development

Issue actions