Skip to content

GH-3994: Improved TDB2 index access for single pattern queries#3995

Open
Aklakan wants to merge 1 commit into
apache:mainfrom
Aklakan:20260613-index-for-basic-distinct
Open

GH-3994: Improved TDB2 index access for single pattern queries#3995
Aklakan wants to merge 1 commit into
apache:mainfrom
Aklakan:20260613-index-for-basic-distinct

Conversation

@Aklakan

@Aklakan Aklakan commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

GitHub issue resolved #3994

Pull request Description: Adds skip scan evaluation for single-pattern queries to TDB2. Added special code paths to OpExecutorTDB2 for OpDistinct and OpGroupBy.

Execution time to find all distinct predicates on Wikidata Truthy (8B triples) becomes between 1 and 100 seconds (warm-cold caches)
Also works for simple group by patterns such as
SELECT ?p (COUNT(DISTINCT ?g) AS ?c) { GRAPH ?g { ?s ?p ?o } } GROUP BY ?p

Need yet to add tests - such as create permutations of patterns and compare results with a reference engine (eg. ARQ).


  • Tests are included.
  • [ ] Documentation change and updates are provided for the Apache Jena website
  • Commits have been squashed to remove intermediate development commit messages.
  • Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.


See the Apache Jena "Contributing" guide.

@Aklakan Aklakan marked this pull request as draft June 16, 2026 10:27
@Aklakan Aklakan force-pushed the 20260613-index-for-basic-distinct branch 8 times, most recently from 1f3a682 to adf134b Compare June 28, 2026 17:44
@Aklakan Aklakan marked this pull request as ready for review June 28, 2026 17:45
@Aklakan

Aklakan commented Jun 28, 2026

Copy link
Copy Markdown
Contributor Author

The code is ready for review.

  • The most critical changes to existing code are those in BPTreeDistinctKeyPrefixIterator which are meant to restrict skip scans / index scans to custom sub-ranges.

    When no custom range is specified, then the range used to be bptRootNode.{minRecord, maxRecord}, which indicates that maxRecord is inclusive for page-level iteration. However, when doing record-level interation using records.getRecordBuffer().iterator(minRecord, maxRecord) then maxRecord is exclusive. So this is something @rvesse and/or @afs might want to look at whether this is correct and how it should be.
  • The class TestOpExecutorTDB2SkipScan generates SELECT DISTINCT ... and SELECT ... (COUNT(DISTINCT ?v) AS ?c) { ... } GROUP BY ... queries with different combinations and permutations.
  • There is the SystemTDB.symSkipScan symbol to disable the optimization.

public static <T> Iterator<T> dropWhile(Iterator<T> iter, Predicate<T> predicate) {
PeekIterator<T> iter2 = new PeekIterator<>(iter);
for(;;) {
T elt = iter2.peek();

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Endless loop because iterator is never advanced.
This method was used during development before settling on
iter = records.getRecordBuffer().iterator(minRecord, maxRecordPlusOne); in BPTreeDistinctKeyPrefixIterator.

@rvesse rvesse left a comment

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hey @Aklakan this looks like a really great potential improvement. I had always thought what I'd done previously for prefix scanning had potential to be applied to other kinds of queries but never any chance to explore that further so thanks for taking this on

I have various comments across the PR, these covers three main things (plus the odd typo):

  1. Whether this can/should handle other aggregates like MAX(DISTNCT ?var) since it would seem applying the distinct in the scan first wouldn't change their outputs either.
  2. Improvements to testing, particularly around regression and scale
  3. Leftover TODOs which may/may not remain relevant


private static Var distinctVarOrNull(Aggregator agg) {
Var v = null;
if (agg instanceof AggCountVarDistinct acvd) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are other aggregates, e.g. SUM(), AVG(), MAX(), MIN(), that have a single value expression so should probably cover those cases as well.

AFAICT having the distinct applied first in the scan shouldn't change the results of those DISTINCT aggregates

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For MIN and MAX this can be added as an incremental improvement. An even better solution would be if these aggregators would not scan all distinct values but just pick the lowest/highest value from the index by trying the NodeIdTypes (decimal, double, int, ptr) in the right order.

}

private static Aggregator convertToNonDistinct(Aggregator agg) {
if (agg instanceof AggCountVarDistinct acd) {

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other distinct aggregators should be handled as well

if (filterExpr != null) {
qIter = new QueryIterFilterExpr(qIter, filterExpr, execCxt);

// XXX QueryIterDistinguishedVars?

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Future TODO/optimisation point?

Comment thread jena-tdb2/src/main/java/org/apache/jena/tdb2/solver/OpExecutorTDB2.java Outdated
)
""");
return referenceDsg;
}

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a pretty trivial dataset for testing an optimisation that has the potential to break query execution semantics, IMO 6 quads is far too trivial to properly exercise and validate a feature like this.

There's several improvements I'd like to see with tests around this optimisation:

  1. The tests should actively compare executing the queries with the new optimisation on vs with the new optimisation off to verify that there aren't any regression cases. This is in line with how we test most other query execution optimisations across Jena
  2. There should be tests that generate a reasonable size test dataset with some known levels of distinctness that are used to regression test as well. This doesn't have to be massive, for example could be 10k quads with 1k unique subjects, 20 unique predicates and 100 unique objects over the dataset (basically pick some constant values that yield differing levels of distinctness for different queries)
  3. As noted in earlier comments it seems like this optimisation should also be valid for GROUP BY queries that use other aggregators, e.g. MIN(DISTINCT ?var), and I'd like to see that validated as well

@Aklakan Aklakan force-pushed the 20260613-index-for-basic-distinct branch from bd29358 to e3ba4d2 Compare June 29, 2026 12:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TDB2 Skip scan for single pattern queries.

2 participants