GH-3994: Improved TDB2 index access for single pattern queries by Aklakan · Pull Request #3995 · apache/jena

Aklakan · 2026-06-16T10:20:35Z

GitHub issue resolved #3994

Pull request Description: Adds skip scan evaluation for single-pattern queries to TDB2. Added special code paths to OpExecutorTDB2 for OpDistinct and OpGroupBy.

Execution time to find all distinct predicates on Wikidata Truthy (8B triples) becomes between 1 and 100 seconds (warm-cold caches)
Also works for simple group by patterns such as
SELECT ?p (COUNT(DISTINCT ?g) AS ?c) { GRAPH ?g { ?s ?p ?o } } GROUP BY ?p

Need yet to add tests - such as create permutations of patterns and compare results with a reference engine (eg. ARQ).

Tests are included.
~~[ ] Documentation change and updates are provided for the Apache Jena website~~
Commits have been squashed to remove intermediate development commit messages.
Key commit messages start with the issue number (GH-xxxx)

By submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.

See the Apache Jena "Contributing" guide.

Aklakan · 2026-06-28T17:45:04Z

The code is ready for review.

The most critical changes to existing code are those in BPTreeDistinctKeyPrefixIterator which are meant to restrict skip scans / index scans to custom sub-ranges.

When no custom range is specified, then the range used to be bptRootNode.{minRecord, maxRecord}, which indicates that maxRecord is inclusive for page-level iteration. However, when doing record-level interation using records.getRecordBuffer().iterator(minRecord, maxRecord) then maxRecord is exclusive. So this is something @rvesse and/or @afs might want to look at whether this is correct and how it should be.
The class TestOpExecutorTDB2SkipScan generates SELECT DISTINCT ... and SELECT ... (COUNT(DISTINCT ?v) AS ?c) { ... } GROUP BY ... queries with different combinations and permutations.
There is the SystemTDB.symSkipScan symbol to disable the optimization.

Aklakan · 2026-06-28T19:15:32Z

    public static <T> Iterator<T> dropWhile(Iterator<T> iter, Predicate<T> predicate) {
-        PeekIterator<T> iter2 = new PeekIterator<>(iter);
-        for(;;) {
-            T elt = iter2.peek();


Endless loop because iterator is never advanced.
This method was used during development before settling on
iter = records.getRecordBuffer().iterator(minRecord, maxRecordPlusOne); in BPTreeDistinctKeyPrefixIterator.

rvesse

Hey @Aklakan this looks like a really great potential improvement. I had always thought what I'd done previously for prefix scanning had potential to be applied to other kinds of queries but never any chance to explore that further so thanks for taking this on

I have various comments across the PR, these covers three main things (plus the odd typo):

Whether this can/should handle other aggregates like MAX(DISTNCT ?var) since it would seem applying the distinct in the scan first wouldn't change their outputs either.
Improvements to testing, particularly around regression and scale
Leftover TODOs which may/may not remain relevant

rvesse · 2026-06-29T09:12:50Z

+
+    private static Var distinctVarOrNull(Aggregator agg) {
+        Var v = null;
+        if (agg instanceof AggCountVarDistinct acvd) {


There are other aggregates, e.g. SUM(), AVG(), MAX(), MIN(), that have a single value expression so should probably cover those cases as well.

AFAICT having the distinct applied first in the scan shouldn't change the results of those DISTINCT aggregates

For MIN and MAX this can be added as an incremental improvement. An even better solution would be if these aggregators would not scan all distinct values but just pick the lowest/highest value from the index by trying the NodeIdTypes (decimal, double, int, ptr) in the right order.

rvesse · 2026-06-29T09:13:27Z

+    }
+
+    private static Aggregator convertToNonDistinct(Aggregator agg) {
+        if (agg instanceof AggCountVarDistinct acd) {


Other distinct aggregators should be handled as well

rvesse · 2026-06-29T09:14:52Z

+        if (filterExpr != null) {
+            qIter = new QueryIterFilterExpr(qIter, filterExpr, execCxt);
+
+            // XXX QueryIterDistinguishedVars?


Future TODO/optimisation point?

rvesse · 2026-06-29T09:31:16Z

+                )
+                """);
+        return referenceDsg;
+    }


This is a pretty trivial dataset for testing an optimisation that has the potential to break query execution semantics, IMO 6 quads is far too trivial to properly exercise and validate a feature like this.

There's several improvements I'd like to see with tests around this optimisation:

The tests should actively compare executing the queries with the new optimisation on vs with the new optimisation off to verify that there aren't any regression cases. This is in line with how we test most other query execution optimisations across Jena

There should be tests that generate a reasonable size test dataset with some known levels of distinctness that are used to regression test as well. This doesn't have to be massive, for example could be 10k quads with 1k unique subjects, 20 unique predicates and 100 unique objects over the dataset (basically pick some constant values that yield differing levels of distinctness for different queries)

As noted in earlier comments it seems like this optimisation should also be valid for GROUP BY queries that use other aggregators, e.g. MIN(DISTINCT ?var), and I'd like to see that validated as well

Aklakan marked this pull request as draft June 16, 2026 10:27

Aklakan force-pushed the 20260613-index-for-basic-distinct branch 8 times, most recently from 1f3a682 to adf134b Compare June 28, 2026 17:44

Aklakan marked this pull request as ready for review June 28, 2026 17:45

Aklakan commented Jun 28, 2026

View reviewed changes

rvesse requested changes Jun 29, 2026

View reviewed changes

apacheGH-3994: Improved TDB2 index access for single pattern queries

e3ba4d2

Aklakan force-pushed the 20260613-index-for-basic-distinct branch from bd29358 to e3ba4d2 Compare June 29, 2026 12:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

GH-3994: Improved TDB2 index access for single pattern queries#3995

GH-3994: Improved TDB2 index access for single pattern queries#3995
Aklakan wants to merge 1 commit into
apache:mainfrom
Aklakan:20260613-index-for-basic-distinct

Aklakan commented Jun 16, 2026 •

edited

Loading

Uh oh!

Aklakan commented Jun 28, 2026 •

edited

Loading

Uh oh!

Aklakan Jun 28, 2026

Uh oh!

rvesse left a comment

Uh oh!

Uh oh!

Uh oh!

rvesse Jun 29, 2026

Uh oh!

Aklakan Jun 29, 2026

Uh oh!

rvesse Jun 29, 2026

Uh oh!

rvesse Jun 29, 2026

Uh oh!

Uh oh!

Uh oh!

rvesse Jun 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Uh oh!

Conversation

Aklakan commented Jun 16, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aklakan commented Jun 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Aklakan Jun 28, 2026

Choose a reason for hiding this comment

Uh oh!

rvesse left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rvesse Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Aklakan Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

rvesse Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

rvesse Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

rvesse Jun 29, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Aklakan commented Jun 16, 2026 •

edited

Loading

Aklakan commented Jun 28, 2026 •

edited

Loading