Skip to content

[TASK-431] Byte-sized cap instead of hard one for ArrowWriter#443

Open
fresh-borzoni wants to merge 2 commits intoapache:mainfrom
fresh-borzoni:byte-sized-cap
Open

[TASK-431] Byte-sized cap instead of hard one for ArrowWriter#443
fresh-borzoni wants to merge 2 commits intoapache:mainfrom
fresh-borzoni:byte-sized-cap

Conversation

@fresh-borzoni
Copy link
Contributor

Summary

closes #431

Arrow log batches were previously capped at a hard 256-record limit regardless of actual data size, meaning a batch of 256 tiny ints (~1KB) was treated the same as 256 large rows (~10MB). This replaces that fixed cap with byte-size-based fullness matching Java's ArrowWriter, so batches now fill to the configured writer_batch_size (default 2MB).

The fullness check uses a threshold-based optimization to avoid computing sizes on every append, it estimates how many records should fit, skips checks until that count is reached, then recalculates. Size estimation reads buffer lengths directly from Arrow builders (O(num_columns), zero allocation), with a pre-computed IPC overhead constant that captures metadata and body framing for the schema.

Compression is accounted for via an adaptive ArrowCompressionRatioEstimator shared across batches for the same table. It starts at 1.0 (assume no compression) and adjusts asymmetrically after each batch is serialized, quick to increase when compression worsens, slow to decrease when it improves. This matches Java's conservative approach to avoid underestimating batch sizes.

Also aligns VARIABLE_WIDTH_AVG_BYTES (64 -> 8) and INITIAL_ROW_CAPACITY (256 -> 1024) with Java Arrow defaults.

Writer pooling (ArrowWriterPool) and DynamicWriteBatchSizeEstimator are out of scope -> TODOs added.

@fresh-borzoni
Copy link
Contributor Author

@fresh-borzoni
Copy link
Contributor Author

Created issues for TODOs: #444 #445

Copy link
Contributor

@charlesdong1991 charlesdong1991 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very nice PR! leave some minor comments/questions

@fresh-borzoni
Copy link
Contributor Author

@charlesdong1991 Ty for the review, addressed comments
PTAL 🙏

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Switch DEFAULT_MAX_RECORD hard cap to byte-size

2 participants