Skip to content

ReaderUtil.partitionByLeaf API design: ordinal tracking for scatter/gather pattern #15905

@zihanx

Description

@zihanx

Description:

Following up on the recently merged ReaderUtil.partitionByLeaf API (#15507), we're planning a second task to build a generic utility class for efficiently retrieving doc-values for an int[] of global doc IDs.

The doc-values retriever would:

  • Take an IndexReader, int[] globalDocIds, and DoubleValuesSource[] (or abstract LeafVisitor factory)

  • Take an Executor/TaskExecutor for concurrent per-leaf retrieval

  • Return values in 1:1 correspondence with the input doc ID order

This requires a scatter/gather pattern:

  • Scatter: Partition global doc IDs by leaf (using partitionByLeaf)

  • Map: Retrieve doc-values per leaf (concurrent)

  • Gather: Reassemble results back to original input order

The current partitionByLeaf(ScoreDoc[], leaves) API doesn't track original positions, so we can't reassemble results in input order. We need ordinal tracking.

Design options:

  1. Two separate APIs (current prototype):
// Existing - no ordinals (for callers who don't need gather and only take in ScoreDoc[])
public static int[][] partitionByLeaf(ScoreDoc[] hits, List<LeafReaderContext> leaves)

// New - with ordinals for scatter/gather
public record PartitionedHits(int[][] docIdsByLeaf, int[] ordinals) {}
public static PartitionedHits partitionByLeafWithOrdinals(int[] globalDocIds, List<LeafReaderContext> leaves)

// Private - helper method for duplicate partition logic 
private static int[][] partitionSortedDocIds(int[] sortedDocIds, List<LeafReaderContext> leaves) 

Pros: Clear intent, no wasted compute, type-safe

Cons: Two entry points

  1. Single API with boolean flag:
public record PartitionedHits(int[][] docIdsByLeaf, int[] ordinals) {}
public static PartitionedHits partitionByLeaf(..., boolean includeOrdinals)

Pros: Single entry point

Cons: ordinals is null when not requested, still need to handle different input types (ScoreDoc[] vs int[])

  1. Always return ordinals:
public static PartitionedHits partitionByLeaf(int[] globalDocIds, List<LeafReaderContext> leaves)

Pros: Simplest API

Cons: Wastes compute for callers who don't need ordinals

Questions for the community:

  1. Should we add an int[] input method? The merged PR only has ScoreDoc[] input
    (last PR comment), but the scatter/gather use case needs raw doc IDs.

  2. Which API design is preferred for ordinal tracking? Or any better ideas?

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions