Skip to content

Allow UTC & +00:00 to be written in parquet writer#9994

Open
xanderbailey wants to merge 7 commits into
apache:mainfrom
xanderbailey:xb/relax_tz
Open

Allow UTC & +00:00 to be written in parquet writer#9994
xanderbailey wants to merge 7 commits into
apache:mainfrom
xanderbailey:xb/relax_tz

Conversation

@xanderbailey
Copy link
Copy Markdown
Contributor

@xanderbailey xanderbailey commented May 18, 2026

Which issue does this PR close?

Working towards: #3199

  • Closes #NNN.

Rationale for this change

types_compatible will error today like so:

 ArrowError(
    "Incompatible type. Field 'timestamp' has type Timestamp(µs, \"+00:00\"), array has type Timestamp(µs, \"UTC\")",
 ),

When datafusion creates a UTC timestamp but iceberg-rust is expecting to see a "+00:00". These two are semantically the same so we should allow the parquet write.

What changes are included in this PR?

Slacken the types_compatible to allow UTC equivalent timestamps when writing parquet.

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions Bot added the parquet Changes to the parquet crate label May 18, 2026
/// native array
fn types_compatible(a: &DataType, b: &DataType) -> bool {
// if the Arrow data types are equal, the types are deemed compatible
if a.equals_datatype(b) {
Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Open question if this new logic belongs in equals_datatype

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moving to equals_datatype gets us the recursive checks for free so I think this might make more sense?

@github-actions github-actions Bot added the arrow Changes to the arrow crate label May 18, 2026
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for this PR @xanderbailey

})
})
}
// Timestamps with matching unit whose timezones are both recognized UTC
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems to me that this cast / comparsion should be done at the application layer (aka the thing calling the parquet writer). My rationale is that there are many other places in the codbase that compare arrays based on exact equality of DataType, and don't do this timezone normalization.

Perhaps in the iceberg code you could add a call to cast for all columns before you pass it to the parquet writer . The cast is a noop for arrays of the already correct type

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah I started drafting that but when I realised DataFusion has to do something similar I thought it might be worth pushing it lower in the stack.

Is there a case where we wouldn't want UTC to equal +00:00?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave it a go in iceberg-rust here apache/iceberg-rust#2477 to see what it would look like

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

arrow Changes to the arrow crate parquet Changes to the parquet crate

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants