HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389
HIVE-25948: Iceberg: Enable cost-based selection between Fanout and Clustered writers using column stats NDV#6389deniskuzZ wants to merge 4 commits intoapache:masterfrom
Conversation
…lustered writers using column stats NDV
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
Outdated
Show resolved
Hide resolved
ql/src/java/org/apache/hadoop/hive/ql/optimizer/SortedDynPartitionOptimizer.java
Outdated
Show resolved
Hide resolved
|
|
||
| @Override | ||
| public StatEstimator getStatEstimator() { | ||
| return new BucketStatEstimator(); |
There was a problem hiding this comment.
I would simply use numBuckets as return new BucketStatEstimator(). But the current approach also works correctly. I would not mind if you would like to keep the current one.
| if (numBuckets <= 0) { | ||
| return Optional.empty(); | ||
| } | ||
| ColStatistics result = inputStats.clone(); |
There was a problem hiding this comment.
I guess we shouldn't inherit all the stats from the first argument. We can create an empty ColStatics and set countDistinct, numNulls, range, and isEstimated.
There was a problem hiding this comment.
other estimators also use clone.
| case -1: | ||
| return false; | ||
| case 0: | ||
| break; | ||
| case 1: | ||
| return true; | ||
| default: | ||
| MAX_WRITERS = threshold; | ||
| break; |
There was a problem hiding this comment.
The original one is aligned with our checkstyle: https://sonarcloud.io/project/issues?id=apache_hive&pullRequest=6389&issueStatuses=OPEN,CONFIRMED&sinceLeakPeriod=true
| case -1: | |
| return false; | |
| case 0: | |
| break; | |
| case 1: | |
| return true; | |
| default: | |
| MAX_WRITERS = threshold; | |
| break; | |
| case -1: | |
| return false; | |
| case 0: | |
| break; | |
| case 1: | |
| return true; | |
| default: | |
| MAX_WRITERS = threshold; | |
| break; |
There was a problem hiding this comment.
switch and case had the same indentation, so i've added spaces
There was a problem hiding this comment.
switch and case had the same indentation, so i've added spaces
|



What changes were proposed in this pull request?
Cost-based selection between Fanout and Clustered writers
Why are the changes needed?
Perf optimization
Does this PR introduce any user-facing change?
No
How was this patch tested?
mvn test -Dtest=TestIcebergCliDriver -Dqfile=dynamic_partition_writes.q