GH-3994: Improved TDB2 index access for single pattern queries#3995
GH-3994: Improved TDB2 index access for single pattern queries#3995Aklakan wants to merge 1 commit into
Conversation
1f3a682 to
adf134b
Compare
|
The code is ready for review.
|
| public static <T> Iterator<T> dropWhile(Iterator<T> iter, Predicate<T> predicate) { | ||
| PeekIterator<T> iter2 = new PeekIterator<>(iter); | ||
| for(;;) { | ||
| T elt = iter2.peek(); |
There was a problem hiding this comment.
Endless loop because iterator is never advanced.
This method was used during development before settling on
iter = records.getRecordBuffer().iterator(minRecord, maxRecordPlusOne); in BPTreeDistinctKeyPrefixIterator.
rvesse
left a comment
There was a problem hiding this comment.
Hey @Aklakan this looks like a really great potential improvement. I had always thought what I'd done previously for prefix scanning had potential to be applied to other kinds of queries but never any chance to explore that further so thanks for taking this on
I have various comments across the PR, these covers three main things (plus the odd typo):
- Whether this can/should handle other aggregates like
MAX(DISTNCT ?var)since it would seem applying the distinct in the scan first wouldn't change their outputs either. - Improvements to testing, particularly around regression and scale
- Leftover TODOs which may/may not remain relevant
|
|
||
| private static Var distinctVarOrNull(Aggregator agg) { | ||
| Var v = null; | ||
| if (agg instanceof AggCountVarDistinct acvd) { |
There was a problem hiding this comment.
There are other aggregates, e.g. SUM(), AVG(), MAX(), MIN(), that have a single value expression so should probably cover those cases as well.
AFAICT having the distinct applied first in the scan shouldn't change the results of those DISTINCT aggregates
There was a problem hiding this comment.
For MIN and MAX this can be added as an incremental improvement. An even better solution would be if these aggregators would not scan all distinct values but just pick the lowest/highest value from the index by trying the NodeIdTypes (decimal, double, int, ptr) in the right order.
| } | ||
|
|
||
| private static Aggregator convertToNonDistinct(Aggregator agg) { | ||
| if (agg instanceof AggCountVarDistinct acd) { |
There was a problem hiding this comment.
Other distinct aggregators should be handled as well
| if (filterExpr != null) { | ||
| qIter = new QueryIterFilterExpr(qIter, filterExpr, execCxt); | ||
|
|
||
| // XXX QueryIterDistinguishedVars? |
| ) | ||
| """); | ||
| return referenceDsg; | ||
| } |
There was a problem hiding this comment.
This is a pretty trivial dataset for testing an optimisation that has the potential to break query execution semantics, IMO 6 quads is far too trivial to properly exercise and validate a feature like this.
There's several improvements I'd like to see with tests around this optimisation:
- The tests should actively compare executing the queries with the new optimisation on vs with the new optimisation off to verify that there aren't any regression cases. This is in line with how we test most other query execution optimisations across Jena
- There should be tests that generate a reasonable size test dataset with some known levels of distinctness that are used to regression test as well. This doesn't have to be massive, for example could be 10k quads with 1k unique subjects, 20 unique predicates and 100 unique objects over the dataset (basically pick some constant values that yield differing levels of distinctness for different queries)
- As noted in earlier comments it seems like this optimisation should also be valid for
GROUP BYqueries that use other aggregators, e.g.MIN(DISTINCT ?var), and I'd like to see that validated as well
bd29358 to
e3ba4d2
Compare
GitHub issue resolved #3994
Pull request Description: Adds skip scan evaluation for single-pattern queries to TDB2. Added special code paths to
OpExecutorTDB2forOpDistinctandOpGroupBy.Execution time to find all distinct predicates on Wikidata Truthy (8B triples) becomes between 1 and 100 seconds (warm-cold caches)
Also works for simple group by patterns such as
SELECT ?p (COUNT(DISTINCT ?g) AS ?c) { GRAPH ?g { ?s ?p ?o } } GROUP BY ?pNeed yet to add tests - such as create permutations of patterns and compare results with a reference engine (eg. ARQ).
[ ] Documentation change and updates are provided for the Apache Jena websiteBy submitting this pull request, I acknowledge that I am making a contribution to the Apache Software Foundation under the terms and conditions of the Contributor's Agreement.
See the Apache Jena "Contributing" guide.