Prune implicit FD group keys in SQL aggregates by hhhizzz · Pull Request #23028 · apache/datafusion

hhhizzz · 2026-06-18T18:05:24Z

Which issue does this PR close?

Closes TPC-DS q39 regression after adding primary key constraints: aggregate GROUP BY includes many unreferenced FD columns #23027.

Rationale for this change

After primary key constraints were added to TPC-DS schemas, SQL aggregate planning could expand grouped primary key columns with all functionally dependent columns from the input schema.

For queries such as TPC-DS q39, many of those dependent columns are not needed after aggregation. Carrying them as aggregate group keys widens the grouping payload, forces extra scan/join projections, and can cause a large performance regression.

Functional dependencies should still allow selecting columns determined by grouped keys, but aggregate planning only needs to add dependent columns that are actually referenced after aggregation.

What changes are included in this PR?

Changes SQL aggregate planning to add functionally dependent group expressions only when they are required by post-aggregate expressions.
Tracks required columns from SELECT, HAVING, QUALIFY, ORDER BY, and DISTINCT ON expressions, ignoring columns referenced only inside aggregate functions.
Keeps referenced functionally dependent columns available for post-aggregate projection/filter/sort/distinct behavior.
Avoids adding unreferenced functionally dependent columns to aggregate group keys.
Adds focused sqllogictests for FD group key pruning and required-column retention.
Updates existing group-by error message expectations for the narrower aggregate output.

Are these changes tested?

Yes.

New sqllogictest coverage was added in group_by_fd_prune.slt for:

unreferenced functionally dependent columns are not appended to aggregate group keys
SELECT references keep required FD columns available
HAVING references keep required FD columns available
ORDER BY hidden references keep required FD columns available
DISTINCT ON references keep required FD columns available
ordinal GROUP BY resolution still works

Verification run locally:

cargo metadata --format-version 1 --locked
cargo fmt --all -- --check
cargo test --test sqllogictests -- group_by_fd_prune
cargo test -p datafusion --test tpcds_planning tpcds_logical_q39

Are there any user-facing changes?

No SQL syntax or public API changes.

Users may observe narrower optimized plans and improved performance for affected aggregate queries that group by determinant keys with functional dependencies.

github-actions · 2026-06-19T16:27:55Z

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details

     Cloning apache/main
    Building datafusion-sql v54.0.0 (current)
       Built [  43.223s] (current)
     Parsing datafusion-sql v54.0.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-sql v54.0.0 (baseline)
       Built [  42.253s] (baseline)
     Parsing datafusion-sql v54.0.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion-sql v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.237s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/inherent_method_missing.ron

Failed in:
  SqlToRel::take_warnings, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/229699db0dd05312c530e37c144be4e87a6c9d34/datafusion/sql/src/planner.rs:493

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  86.891s] datafusion-sql
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 178.647s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 178.664s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.024s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.093s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 359.795s] datafusion-sqllogictest

neilconway

Thanks for reporting this!

I believe optimize_projections should already do this (which is why the newly added SLTs pass without the changes in this PR). From digging around a little bit, it looks like the TPC-DS query is running into a different bug: if there's an intermediate Projection, it seems like the FDs aren't being propagated correctly due to a bug in calc_func_dependencies_for_project when a projection list contains a mix of computed and passthrough columns. Claude found this test case:

#[test]
fn projection_with_leading_computed_column_preserves_pk() {
    // input: [id (PK), name, amount]
    let input_fds = FunctionalDependencies::new_from_constraints(
        Some(&Constraints::new_unverified(vec![Constraint::PrimaryKey(vec![0])])),
        3,
    );
    // output of a CSE-style projection: [__common_expr_1 (computed), id, name, amount]
    // -> proj_indices must be [MAX, 0, 1, 2], NOT [0, 1, 2]
    let proj_indices = vec![usize::MAX, 0, 1, 2];
    let projected = input_fds.project_functional_dependencies(&proj_indices, 4);
    // determinant must remap to output position 1 (`id`), not 0 (`__common_expr_1`)
    assert_eq!(projected[0].source_indices, vec![1]);
}

Would you like to take a look at fixing that bug? I think if we do that then the existing projection optimization rewrite should apply here to the TPC-DS query.

neilconway · 2026-06-19T16:48:14Z

+# Functional-dependency targets should only be added to aggregate
+# GROUP BY outputs when the SQL query actually needs them after
+# aggregation.


These tests pass without the code changes in this PR.

Prune implicit FD group keys in SQL aggregates

442dc11

github-actions Bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jun 18, 2026

Merge branch 'main' into codex/datafusion-fd-groupby-prune-20260618

9decb7f

github-actions Bot added the auto detected api change Auto detected API change label Jun 19, 2026

neilconway reviewed Jun 19, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prune implicit FD group keys in SQL aggregates#23028

Prune implicit FD group keys in SQL aggregates#23028
hhhizzz wants to merge 2 commits into
apache:mainfrom
hhhizzz:codex/datafusion-fd-groupby-prune-20260618

hhhizzz commented Jun 18, 2026

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

neilconway left a comment

Uh oh!

neilconway Jun 19, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

hhhizzz commented Jun 18, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

Uh oh!

github-actions Bot commented Jun 19, 2026

Uh oh!

neilconway left a comment

Choose a reason for hiding this comment

Uh oh!

neilconway Jun 19, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants