Skip to content

HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389

Open
deniskuzZ wants to merge 4 commits intoapache:masterfrom
deniskuzZ:HIVE-25948
Open

HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389
deniskuzZ wants to merge 4 commits intoapache:masterfrom
deniskuzZ:HIVE-25948

Conversation

@deniskuzZ
Copy link
Copy Markdown
Member

@deniskuzZ deniskuzZ commented Mar 24, 2026

What changes were proposed in this pull request?

Cost-based selection between Fanout and Clustered writers

Why are the changes needed?

Perf optimization

Does this PR introduce any user-facing change?

No

How was this patch tested?

mvn test -Dtest=TestIcebergCliDriver -Dqfile=dynamic_partition_writes.q

┌───────────────────────┬───────────────────────────┐
│       Scenario        │         Expected          │
├───────────────────────┼───────────────────────────┤
│ threshold=0 (default) │ no sort (NDV<MAX_WRITERS) │
├───────────────────────┼───────────────────────────┤
│ threshold=-1          │ no sort                   │
├───────────────────────┼───────────────────────────┤
│ threshold=1           │ sort                      │
├───────────────────────┼───────────────────────────┤
│ threshold=2           │ sort (NDV>2)              │
├───────────────────────┼───────────────────────────┤
│ threshold=100         │ no sort (NDV<=100)        │
├───────────────────────┼───────────────────────────┤
│ fanout=false          │ sort                      │
└───────────────────────┴───────────────────────────┘

@deniskuzZ deniskuzZ changed the title HIVE-25948: Enable cost-based selection between FanoutWriter and ClusteredWriter for Iceberg tables based on column stats NDV HIVE-25948: Iceberg: Enable cost-based selection between FanoutWriter and ClusteredWriter based on column stats NDV Mar 24, 2026
@deniskuzZ deniskuzZ changed the title HIVE-25948: Iceberg: Enable cost-based selection between FanoutWriter and ClusteredWriter based on column stats NDV HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV Mar 24, 2026
Copy link
Copy Markdown
Contributor

@okumin okumin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only minor suggestions


@Override
public StatEstimator getStatEstimator() {
return new BucketStatEstimator();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would simply use numBuckets as return new BucketStatEstimator(). But the current approach also works correctly. I would not mind if you would like to keep the current one.

if (numBuckets <= 0) {
return Optional.empty();
}
ColStatistics result = inputStats.clone();
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess we shouldn't inherit all the stats from the first argument. We can create an empty ColStatics and set countDistinct, numNulls, range, and isEstimated.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

other estimators also use clone.

Comment on lines +875 to +883
case -1:
return false;
case 0:
break;
case 1:
return true;
default:
MAX_WRITERS = threshold;
break;
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The original one is aligned with our checkstyle: https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6389&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true

Suggested change
case -1:
return false;
case 0:
break;
case 1:
return true;
default:
MAX_WRITERS = threshold;
break;
case -1:
return false;
case 0:
break;
case 1:
return true;
default:
MAX_WRITERS = threshold;
break;

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch and case had the same indentation, so i've added spaces

Copy link
Copy Markdown
Member Author

@deniskuzZ deniskuzZ Apr 7, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

switch and case had the same indentation, so i've added spaces

@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud bot commented Apr 7, 2026

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants