fix: accept UTC timestamps in parquet writer#2477
Open
xanderbailey wants to merge 3 commits into
Open
Conversation
xanderbailey
commented
May 20, 2026
| type R = ParquetWriter; | ||
|
|
||
| async fn build(&self, output_file: OutputFile) -> Result<Self::R> { | ||
| let arrow_schema: ArrowSchemaRef = Arc::new(self.schema.as_ref().try_into()?); |
Contributor
Author
There was a problem hiding this comment.
Avoids converting to arrow for each batch
blackmwk
reviewed
May 21, 2026
Contributor
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @xanderbailey for this pr! I think the feature is reasonable, but we should not maintain it in parquet write module. I think the root cause is misalignment of arrow data types with iceberg schema. We did sth similar in here, and I think we should merge with it.
Contributor
Author
|
Ah very nice, thanks for taking a look! That's very nice, I didn't see that code |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Which issue does this PR close?
When writing to an iceberg table with an arrow-batch that has timestamp UTC we fail with the following error.
+00:00andUTCare semantically identical so we should just arrow cast (essentially a no-op) before we write.I've tried to make this as cheap as possible without introducing the invariant that we assume all incoming arrow batches have the same schema. If we did make that assumption we could store the index that need to be cast and make this a little cheaper but in reality this is an O(n_cols) per batch which I think.. is okay?
DataFusion has a similar mechanism for semantically identical timestamps but at the planning phase they do the cast which is a little cheaper
Opened a PR in arrow-rs to fix this lower in the stack but it was suggested that this should be handled by the application code so I'm open a PR here to open up that discussion.
What changes are included in this PR?
Are these changes tested?
Yes