Prune implicit FD group keys in SQL aggregates#23028
Conversation
|
Thank you for opening this pull request! Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch). Details |
neilconway
left a comment
There was a problem hiding this comment.
Thanks for reporting this!
I believe optimize_projections should already do this (which is why the newly added SLTs pass without the changes in this PR). From digging around a little bit, it looks like the TPC-DS query is running into a different bug: if there's an intermediate Projection, it seems like the FDs aren't being propagated correctly due to a bug in calc_func_dependencies_for_project when a projection list contains a mix of computed and passthrough columns. Claude found this test case:
#[test]
fn projection_with_leading_computed_column_preserves_pk() {
// input: [id (PK), name, amount]
let input_fds = FunctionalDependencies::new_from_constraints(
Some(&Constraints::new_unverified(vec![Constraint::PrimaryKey(vec![0])])),
3,
);
// output of a CSE-style projection: [__common_expr_1 (computed), id, name, amount]
// -> proj_indices must be [MAX, 0, 1, 2], NOT [0, 1, 2]
let proj_indices = vec![usize::MAX, 0, 1, 2];
let projected = input_fds.project_functional_dependencies(&proj_indices, 4);
// determinant must remap to output position 1 (`id`), not 0 (`__common_expr_1`)
assert_eq!(projected[0].source_indices, vec![1]);
}Would you like to take a look at fixing that bug? I think if we do that then the existing projection optimization rewrite should apply here to the TPC-DS query.
| # Functional-dependency targets should only be added to aggregate | ||
| # GROUP BY outputs when the SQL query actually needs them after | ||
| # aggregation. |
There was a problem hiding this comment.
These tests pass without the code changes in this PR.
Which issue does this PR close?
Rationale for this change
After primary key constraints were added to TPC-DS schemas, SQL aggregate planning could expand grouped primary key columns with all functionally dependent columns from the input schema.
For queries such as TPC-DS q39, many of those dependent columns are not needed after aggregation. Carrying them as aggregate group keys widens the grouping payload, forces extra scan/join projections, and can cause a large performance regression.
Functional dependencies should still allow selecting columns determined by grouped keys, but aggregate planning only needs to add dependent columns that are actually referenced after aggregation.
What changes are included in this PR?
Are these changes tested?
Yes.
New sqllogictest coverage was added in
group_by_fd_prune.sltfor:Verification run locally:
Are there any user-facing changes?
No SQL syntax or public API changes.
Users may observe narrower optimized plans and improved performance for affected aggregate queries that group by determinant keys with functional dependencies.