Rewrite SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr) #20665
Rewrite SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr) #20665alamb wants to merge 2 commits intoapache:mainfrom
SUM(expr + scalar) --> SUM(expr) + scalar*COUNT(expr) #20665Conversation
|
I was curious about overflow behavior which I think is right, but I have noticed one difference (at least when going through SQL): statement ok
CREATE TABLE IF NOT EXISTS tbl (val INTEGER UNSIGNED);
statement ok
INSERT INTO tbl VALUES (4294967295);
statement ok
INSERT INTO tbl VALUES (4294967295);
query II
SELECT SUM(val + 1), SUM(val + 2) FROM tbl;
----
8589934592 8589934594
query TT
EXPLAIN SELECT SUM(val + 1), SUM(val + 2) FROM tbl;
----
logical_plan
01)Aggregate: groupBy=[[]], aggr=[[sum(__common_expr_1 AS tbl.val + Int64(1)), sum(__common_expr_1 AS tbl.val + Int64(2))]]
02)--Projection: CAST(tbl.val AS Int64) AS __common_expr_1
03)----TableScan: tbl projection=[val]
physical_plan
01)AggregateExec: mode=Single, gby=[], aggr=[sum(tbl.val + Int64(1)), sum(tbl.val + Int64(2))]
02)--ProjectionExec: expr=[CAST(val@0 AS Int64) as __common_expr_1]
03)----DataSourceExec: partitions=1, partition_sizes=[2]
query RR
SELECT SUM(val) + 1 * COUNT(val), SUM(val) + 2 * COUNT(val) FROM tbl;
----
8589934592 8589934594
query TT
EXPLAIN SELECT SUM(val) + 1 * COUNT(val), SUM(val) + 2 * COUNT(val) FROM tbl;
----
logical_plan
01)Projection: __common_expr_1 + CAST(count(tbl.val) AS Decimal128(20, 0)) AS sum(tbl.val) + Int64(1) * count(tbl.val), __common_expr_1 AS sum(tbl.val) + CAST(Int64(2) * count(tbl.val) AS Decimal128(20, 0))
02)--Projection: CAST(sum(tbl.val) AS Decimal128(20, 0)) AS __common_expr_1, count(tbl.val)
03)----Aggregate: groupBy=[[]], aggr=[[sum(CAST(tbl.val AS UInt64)), count(tbl.val)]]
04)------TableScan: tbl projection=[val]
physical_plan
01)ProjectionExec: expr=[__common_expr_1@0 + CAST(count(tbl.val)@1 AS Decimal128(20, 0)) as sum(tbl.val) + Int64(1) * count(tbl.val), __common_expr_1@0 + CAST(2 * count(tbl.val)@1 AS Decimal128(20, 0)) as sum(tbl.val) + Int64(2) * count(tbl.val)]
02)--ProjectionExec: expr=[CAST(sum(tbl.val)@0 AS Decimal128(20, 0)) as __common_expr_1, count(tbl.val)@1 as count(tbl.val)]
03)----AggregateExec: mode=Single, gby=[], aggr=[sum(tbl.val), count(tbl.val)]
04)------DataSourceExec: partitions=1, partition_sizes=[2]
statement ok
DROP TABLE IF EXISTS tbl;The "rewritten" form returns floats for some reason? not sure what that is about |
Thank you -- I will investigate |
## Which issue does this PR close? - Part of #18489 - Related to #20180 ## Rationale for this change This looks like a monster PR but I think it will be quite easy to review (it just adds some new `EXPLAIN` tests). If it would be helpful I can break it into smaller pieces I want to improve the plans for ClickBench Query 29 However, the plans for the ClickBench queries are not in our tests anywhere (so when I make the improvements in #20665 no explain plan tests change) So to start, let's start with adding the explain plans for all the queries in clickbench.slt to so it is clear what our current plans look like as well as to make it clear what the change of plans are ## What changes are included in this PR? Add explain plans to some .slt tests ## Are these changes tested? Only tests ## Are there any user-facing changes? No, this only adds tests
4ff7980 to
b4725d1
Compare
b4725d1 to
6f76ea8
Compare
I added these tests in |
4fca8f4 to
4d28038
Compare
4d28038 to
c048aa0
Compare
…Impl::simplify` (#20712) ## Rationale for this change While working on #20665 I was somewhat confused about how the AggregateUDF simplify worked. So I added some more clarifying comments ## What changes are included in this PR? 1. Improve documentation strings ## Are these changes tested? Yes by CI ## Are there any user-facing changes? Better API docs --------- Co-authored-by: Yongting You <2010youy01@gmail.com>
8509263 to
6245420
Compare
|
run benchmark clickbench_partitioned |
|
run benchmarks |
|
🤖 |
|
🤖: Benchmark completed Details
|
|
🤖 |
|
🤖: Benchmark completed Details
|
|
|
|
@alamb btw this was the query I was talking about...
When running locally, I don't have the problem... |
|
run benchmarks |
|
🤖 |
@Dandandan I think it is this one: Last tiem @adriangb looked into that I think he concluded that a bunch of the variability is related to if the TopK finds the top values quickly enough to skip some of the files entirely but that depends on a race to read the "right" rows. Q23 is also the one that gets stupidly faster (like 20x) when we turn on filter pushdown |
|
run benchmark clickbench_partitioned |
|
🤖: Benchmark completed Details
|
b031939 to
25350af
Compare
|
🤖 |
My hope is that after #20362 we can get #20304 across the line which should nuke the issue of a race with randomly ordered files. In theory the filter pushdown should only be essential when there is no stats sorting between files (i.e. the values being ordered by are randomly distributed). |
|
🤖: Benchmark completed Details
|
|
While I have the tests passing and I think this PR is functioanlly good now, I really don't like the API as it makes things more complicated and slow for what is basically a one-off optimization. I have an idea of how to restructure the APIs to be more isolated, which I think befits such a special optimization |
I agree - I saw also some huge variability in speedups for #20481 - I think with some great (row group or otherwise) ordering, some queries could have 3x+ speedups |
|
I have a better API in a new PR Closing this one for the time being |
Draft until:
AggregateUdfImpl::simplifyandWindowUDFImpl::simplify#20712Which issue does this PR close?
SUM(..)clauses #15524Rationale for this change
I want DataFusion to be the fastest paruqet engine on ClickBench. One of the queries where DataFusion is significantly slower is Query 29 which has a very strange pattern of many aggregate functions that are offset by a constant:
datafusion/benchmarks/queries/clickbench/queries/q29.sql
Line 4 in 0ca9d65
This is not a pattern I have ever seen in a real query, but it seems like the engine currently at the top of the ClickBench leaderboard has a special case for this pattern. See
SUM(..)clauses #15524Thus I reluctantly conclude that we should have one too.
What changes are included in this PR?
SUM(expr + scalar)-->SUM(expr) + scalar*COUNT(expr)This is implemented as a
AggregateUDF::simplifyrule as discussed on #20180 (comment) and suggested by @UBarneyNote there are quite a few other ideas to potentially make this more general on #15524 but I am going with the simple thing of making it work for the usecase we have in hand (ClickBench)
Are these changes tested?
Yes, new tests are added
Are there any user-facing changes?
Faster performance