Skip to content

fix: accept UTC timestamps in parquet writer#2477

Open
xanderbailey wants to merge 3 commits into
apache:mainfrom
xanderbailey:xb/parquet_writer_accepts_UTC
Open

fix: accept UTC timestamps in parquet writer#2477
xanderbailey wants to merge 3 commits into
apache:mainfrom
xanderbailey:xb/parquet_writer_accepts_UTC

Conversation

@xanderbailey
Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey commented May 20, 2026

Which issue does this PR close?

When writing to an iceberg table with an arrow-batch that has timestamp UTC we fail with the following error.

 ArrowError(
    "Incompatible type. Field 'timestamp' has type Timestamp(µs, \"+00:00\"), array has type Timestamp(µs, \"UTC\")",
 ),

+00:00 and UTC are semantically identical so we should just arrow cast (essentially a no-op) before we write.

I've tried to make this as cheap as possible without introducing the invariant that we assume all incoming arrow batches have the same schema. If we did make that assumption we could store the index that need to be cast and make this a little cheaper but in reality this is an O(n_cols) per batch which I think.. is okay?

DataFusion has a similar mechanism for semantically identical timestamps but at the planning phase they do the cast which is a little cheaper

Opened a PR in arrow-rs to fix this lower in the stack but it was suggested that this should be handled by the application code so I'm open a PR here to open up that discussion.

What changes are included in this PR?

Are these changes tested?

Yes

type R = ParquetWriter;

async fn build(&self, output_file: OutputFile) -> Result<Self::R> {
let arrow_schema: ArrowSchemaRef = Arc::new(self.schema.as_ref().try_into()?);
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Avoids converting to arrow for each batch

Copy link
Copy Markdown
Contributor

@blackmwk blackmwk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @xanderbailey for this pr! I think the feature is reasonable, but we should not maintain it in parquet write module. I think the root cause is misalignment of arrow data types with iceberg schema. We did sth similar in here, and I think we should merge with it.

@xanderbailey
Copy link
Copy Markdown
Contributor Author

Ah very nice, thanks for taking a look! That's very nice, I didn't see that code

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Writing UTC timestamp to an iceberg table fails

2 participants