ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0#2580
ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0#2580QianyongY wants to merge 1 commit intoapache:mainfrom
Conversation
dongjoon-hyun
left a comment
There was a problem hiding this comment.
Thank you for the feedback, @QianyongY . I understand that you want to disable this feature completely. However, may I ask if that is a general observation instead of a corner case? In general, the default value is not a silver bullet and you can tune this value in your workload.
For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.
|
When writing ORC data files using ORC-1986, we observed an increase in the size of some tables from 1.0 TB to 1.2 TB. A random inspection of one ORC file showed that the number of Stripes grew from the original 180 to 527. This resulted in a lower compression ratio and significantly slower read performance for downstream jobs, increasing the execution time from 1 hour to 2 hours and 20 minutes. Therefore, this might be a regression issue. Setting it to 0 can avoid this problem, and users who need it can enable this parameter in the cluster by default. |
|
@dongjoon-hyun We rolled out this optimization to 10% of our tables. Among them, 5 tables had partitions grow by more than 15% in size. The worst case was a single partition growing from 140GB to 660GB+, which is concerning. It's not a universal issue, but some tables were indeed affected. |
What changes were proposed in this pull request?
Set default of
orc.stripe.size.check.ratioandorc.dictionary.max.size.bytesto 0Why are the changes needed?
After enabling the optimizations related to orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes, we observed that ORC files written with the current defaults are about 10%–20% larger than before. For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.
How was this patch tested?
Local test
With orc.dictionary.max.size.bytes=16777216 or orc.stripe.size.check.ratio=2.0, the written ORC data grows to 1.2 TB (data inflation).
With orc.dictionary.max.size.bytes=0 and orc.stripe.size.check.ratio=0.0, the data size remains at the expected 1.0 TB.
Was this patch authored or co-authored using generative AI tooling?
No