Skip to content

Prune implicit FD group keys in SQL aggregates#23028

Open
hhhizzz wants to merge 2 commits into
apache:mainfrom
hhhizzz:codex/datafusion-fd-groupby-prune-20260618
Open

Prune implicit FD group keys in SQL aggregates#23028
hhhizzz wants to merge 2 commits into
apache:mainfrom
hhhizzz:codex/datafusion-fd-groupby-prune-20260618

Conversation

@hhhizzz

@hhhizzz hhhizzz commented Jun 18, 2026

Copy link
Copy Markdown

Which issue does this PR close?

Rationale for this change

After primary key constraints were added to TPC-DS schemas, SQL aggregate planning could expand grouped primary key columns with all functionally dependent columns from the input schema.

For queries such as TPC-DS q39, many of those dependent columns are not needed after aggregation. Carrying them as aggregate group keys widens the grouping payload, forces extra scan/join projections, and can cause a large performance regression.

Functional dependencies should still allow selecting columns determined by grouped keys, but aggregate planning only needs to add dependent columns that are actually referenced after aggregation.

What changes are included in this PR?

  • Changes SQL aggregate planning to add functionally dependent group expressions only when they are required by post-aggregate expressions.
  • Tracks required columns from SELECT, HAVING, QUALIFY, ORDER BY, and DISTINCT ON expressions, ignoring columns referenced only inside aggregate functions.
  • Keeps referenced functionally dependent columns available for post-aggregate projection/filter/sort/distinct behavior.
  • Avoids adding unreferenced functionally dependent columns to aggregate group keys.
  • Adds focused sqllogictests for FD group key pruning and required-column retention.
  • Updates existing group-by error message expectations for the narrower aggregate output.

Are these changes tested?

Yes.

New sqllogictest coverage was added in group_by_fd_prune.slt for:

  • unreferenced functionally dependent columns are not appended to aggregate group keys
  • SELECT references keep required FD columns available
  • HAVING references keep required FD columns available
  • ORDER BY hidden references keep required FD columns available
  • DISTINCT ON references keep required FD columns available
  • ordinal GROUP BY resolution still works

Verification run locally:

cargo metadata --format-version 1 --locked
cargo fmt --all -- --check
cargo test --test sqllogictests -- group_by_fd_prune
cargo test -p datafusion --test tpcds_planning tpcds_logical_q39

Are there any user-facing changes?

No SQL syntax or public API changes.

Users may observe narrower optimized plans and improved performance for affected aggregate queries that group by determinant keys with functional dependencies.

@github-actions github-actions Bot added sql SQL Planner sqllogictest SQL Logic Tests (.slt) labels Jun 18, 2026
@github-actions

Copy link
Copy Markdown

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning apache/main
    Building datafusion-sql v54.0.0 (current)
       Built [  43.223s] (current)
     Parsing datafusion-sql v54.0.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-sql v54.0.0 (baseline)
       Built [  42.253s] (baseline)
     Parsing datafusion-sql v54.0.0 (baseline)
      Parsed [   0.031s] (baseline)
    Checking datafusion-sql v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.237s] 223 checks: 222 pass, 1 fail, 0 warn, 30 skip

--- failure inherent_method_missing: pub method removed or renamed ---

Description:
A publicly-visible method or associated fn is no longer available under its prior name. It may have been renamed or removed entirely.
        ref: https://doc.rust-lang.org/cargo/reference/semver.html#item-remove
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.48.0/src/lints/inherent_method_missing.ron

Failed in:
  SqlToRel::take_warnings, previously in file /home/runner/work/datafusion/datafusion/target/semver-checks/git-apache_main/229699db0dd05312c530e37c144be4e87a6c9d34/datafusion/sql/src/planner.rs:493

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  86.891s] datafusion-sql
    Building datafusion-sqllogictest v54.0.0 (current)
       Built [ 178.647s] (current)
     Parsing datafusion-sqllogictest v54.0.0 (current)
      Parsed [   0.023s] (current)
    Building datafusion-sqllogictest v54.0.0 (baseline)
       Built [ 178.664s] (baseline)
     Parsing datafusion-sqllogictest v54.0.0 (baseline)
      Parsed [   0.024s] (baseline)
    Checking datafusion-sqllogictest v54.0.0 -> v54.0.0 (no change; assume patch)
     Checked [   0.093s] 223 checks: 223 pass, 30 skip
     Summary no semver update required
    Finished [ 359.795s] datafusion-sqllogictest

@github-actions github-actions Bot added the auto detected api change Auto detected API change label Jun 19, 2026

@neilconway neilconway left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for reporting this!

I believe optimize_projections should already do this (which is why the newly added SLTs pass without the changes in this PR). From digging around a little bit, it looks like the TPC-DS query is running into a different bug: if there's an intermediate Projection, it seems like the FDs aren't being propagated correctly due to a bug in calc_func_dependencies_for_project when a projection list contains a mix of computed and passthrough columns. Claude found this test case:

#[test]
fn projection_with_leading_computed_column_preserves_pk() {
    // input: [id (PK), name, amount]
    let input_fds = FunctionalDependencies::new_from_constraints(
        Some(&Constraints::new_unverified(vec![Constraint::PrimaryKey(vec![0])])),
        3,
    );
    // output of a CSE-style projection: [__common_expr_1 (computed), id, name, amount]
    // -> proj_indices must be [MAX, 0, 1, 2], NOT [0, 1, 2]
    let proj_indices = vec![usize::MAX, 0, 1, 2];
    let projected = input_fds.project_functional_dependencies(&proj_indices, 4);
    // determinant must remap to output position 1 (`id`), not 0 (`__common_expr_1`)
    assert_eq!(projected[0].source_indices, vec![1]);
}

Would you like to take a look at fixing that bug? I think if we do that then the existing projection optimization rewrite should apply here to the TPC-DS query.

Comment on lines +18 to +20
# Functional-dependency targets should only be added to aggregate
# GROUP BY outputs when the SQL query actually needs them after
# aggregation.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These tests pass without the code changes in this PR.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

auto detected api change Auto detected API change sql SQL Planner sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

TPC-DS q39 regression after adding primary key constraints: aggregate GROUP BY includes many unreferenced FD columns

2 participants