Skip to content

ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0#2580

Open
QianyongY wants to merge 1 commit intoapache:mainfrom
QianyongY:features/ORC-2131
Open

ORC-2131: Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0#2580
QianyongY wants to merge 1 commit intoapache:mainfrom
QianyongY:features/ORC-2131

Conversation

@QianyongY
Copy link
Contributor

@QianyongY QianyongY commented Mar 17, 2026

What changes were proposed in this pull request?

Set default of orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes to 0

Why are the changes needed?

After enabling the optimizations related to orc.stripe.size.check.ratio and orc.dictionary.max.size.bytes, we observed that ORC files written with the current defaults are about 10%–20% larger than before. For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.

How was this patch tested?

Local test

With orc.dictionary.max.size.bytes=16777216 or orc.stripe.size.check.ratio=2.0, the written ORC data grows to 1.2 TB (data inflation).

           1         6665      1300347279057 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_2/d=2026-03-15

With orc.dictionary.max.size.bytes=0 and orc.stripe.size.check.ratio=0.0, the data size remains at the expected 1.0 TB.

           1         6665      1143347882367 hdfs://ns/user/hive/warehouse/tmp_sandbox_xxx.db/tmp_test_123_1/d=2026-03-15

Was this patch authored or co-authored using generative AI tooling?

No

Copy link
Member

@dongjoon-hyun dongjoon-hyun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the feedback, @QianyongY . I understand that you want to disable this feature completely. However, may I ask if that is a general observation instead of a corner case? In general, the default value is not a silver bullet and you can tune this value in your workload.

For example, datasets that were previously ~1.0–1.1 TB grow to ~1.2 TB with the current defaults, causing noticeable storage cost increase.

cc @wankunde and @cxzl25 from the original PR.

@cxzl25
Copy link
Contributor

cxzl25 commented Mar 18, 2026

When writing ORC data files using ORC-1986, we observed an increase in the size of some tables from 1.0 TB to 1.2 TB. A random inspection of one ORC file showed that the number of Stripes grew from the original 180 to 527. This resulted in a lower compression ratio and significantly slower read performance for downstream jobs, increasing the execution time from 1 hour to 2 hours and 20 minutes.

Therefore, this might be a regression issue. Setting it to 0 can avoid this problem, and users who need it can enable this parameter in the cluster by default.

@QianyongY
Copy link
Contributor Author

@dongjoon-hyun We rolled out this optimization to 10% of our tables. Among them, 5 tables had partitions grow by more than 15% in size. The worst case was a single partition growing from 140GB to 660GB+, which is concerning. It's not a universal issue, but some tables were indeed affected.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants