Fix: revive subquery decorrelator and fix failing test#19577
Draft
duongcongtoai wants to merge 241 commits intoapache:mainfrom
Draft
Fix: revive subquery decorrelator and fix failing test#19577duongcongtoai wants to merge 241 commits intoapache:mainfrom
duongcongtoai wants to merge 241 commits intoapache:mainfrom
Conversation
a4b660e to
5a0c64b
Compare
…-decorrelate-revive
…-decorrelate-revive
…-decorrelate-revive
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Which issue does this PR close?
#16059 has completed, but the result is not persisted into datafusion.
Some sqllogic test is also failing => This PR brings back all the changes inside the GSOC work and a complete POC.
I'll try to break it down into smaller components and bring them into datafusion:
Prerequiste
Major part
Rationale for this change
DependentJoin
As per paper 2
There is a need to detect non-trivial dependent join (i.e dependent joins where the RHS accesses columns provided by the LHS) and annotate metadata before decorrelation begins.
The paper suggest the usage of index algebra, however, for now we go forward without such data structure
"Using an indexed algebra is ideal for this phase because it can do every LCA computation in 𝑂(log 𝑛) without any additional data structures. If the DBMS does not support that functionality, the same information can be computed with worse asymptotic complexity by keeping track of the column sets that are available in the different parts of the tree."
Given this query
According to the paper, this query is constructed into this trees with some additional annotations

"we consider every column access and compute the lowest common ancestor (LCA) of the operator o1 that accesses the column and the operator o2 that provides the column. If the LCA is not o1 , it must be a dependent join 3 and we annotate 3 with the fact that o1 is accessing the left-hand side of 3"
Explanation:
T3.a=T1.a, with T1.a is a column provided by some operator/relation outside its current context (in Datafusion we call them OuterRefColumn). Now we need to annotate this access with extra information:then which descendant of this node "provides" the column T1.a. In this case Node2 (not Node3) is the provider of column T1.a
We introduce a new struct in Datafusion to contain these annotations
The goal of this traversal is to link between the accessor and the providers. There will be intermediate state persisted during the traversal
let's walk through the logical plan from the paper above. Execution sequences happens like below
Now pay attention to
f_down(6)andf_down(2).f_down(6)marks an appearance ofouter_ref(customer.c_custkey). The accessed stack will be [1,2,3,4,5,6].f_down(2)marks the first logical plan that knows about the expressioncustomer.c_custkeyand will resolve the previous column access, when this happens the traversal stack is [1,2,12]. The LCA (lowest common ancestor) of the two stacks according to the algorithm is [2] and thus 2 should be converted into a dependent join logical plan later on. The same can be applied for the couple off_down(10)andf_down(13).Correlated subqueries are rewritten into dependent join nodes as followed
[NOTE TO SELF]: looks like the provider of column c_custkey was not correctly detected (in paper it should be the filter node above table scan of customer), but in current implementation it is the table scan. This difference will yield significant performance on delim scan later on
The collections of nodes that provides the columns will be persisted and passed to the next round of decorrelation optimizor (to construct delim_get). More details on this [TBU]