feat: Honor compression settings for metadata.json on write#1876
feat: Honor compression settings for metadata.json on write#1876emkornfield wants to merge 62 commits intoapache:mainfrom
Conversation
Previously these properties where not honored on tabel properties. - Adds table properties for these values. - Plumbs them through for writers.
| key: &str, | ||
| default: T, | ||
| ) -> Result<T, anyhow::Error> | ||
| ) -> crate::error::Result<T> |
There was a problem hiding this comment.
| ) -> crate::error::Result<T> | |
| ) -> Result<T> |
Unnecessary fully qualified name.
| /// The target file size for files. | ||
| pub write_target_file_size_bytes: usize, | ||
| /// Compression codec for metadata files (JSON) | ||
| pub metadata_compression_codec: String, |
There was a problem hiding this comment.
This should be sth like Option<String>, none should be None in memory.
|
|
||
| // Modify filename to add .gz before .metadata.json | ||
| let location = metadata_location.as_ref(); | ||
| let new_location = if location.ends_with(".metadata.json") { |
There was a problem hiding this comment.
I don't think we should simply replace the suffix with this approach. Currently all metadata location generated logic are in MetadataLocation, we should add a new method in like following and deprecate old method as following:
impl MetadataLocation {
#[deprecate("Use new_with_table instead.")]
pub fn new_with_table_location(table_location: impl ToString) -> Self {
}
pub fn new_with_table(table: &Table) -> Self {
....
}There was a problem hiding this comment.
The table option seemed a little awkward with the control flow on how it is used istead I added a method that takes table properties. Also added a similar method for incrementing the number.
I modified this logic a bit and kept. My concern here is if we only deprecate the new_with_table_location then users can write out metadata.json files in a format that would be unreadable by other readers.
kevinjqliu
left a comment
There was a problem hiding this comment.
Generally LGTM, a few nit comments.
would be good to get a second pair of 👀
|
|
||
| /// Creates a completely new metadata location starting at version 0, | ||
| /// with compression settings from the table's properties. | ||
| /// Only used for creating a new table. For updates, see `with_next_version`. |
There was a problem hiding this comment.
| /// Only used for creating a new table. For updates, see `with_next_version`. | |
| /// Only used for creating a new table. For updates, see `with_next_version_and_properties`. |
since with_next_version is also deprecated
There was a problem hiding this comment.
I'll wait till I understand @liurenjie1024 concerns about with_next_version_and_properties to update this.
liurenjie1024
left a comment
There was a problem hiding this comment.
Thanks @emkornfield for this pr, looks much better now, still need some refinements.
| /// Creates a new metadata location for an updated metadata file. | ||
| /// Takes table properties to determine compression settings, which may have changed | ||
| /// from the previous version. | ||
| pub fn with_next_version_and_properties(&self, properties: &HashMap<String, String>) -> Self { |
There was a problem hiding this comment.
Can you clarify what you mean by we don't need it? I updated crates/iceberg/src/catalog/mod.rs to use it, if new properties change compression settings, I'm not sure how this would be reflected otherwise? I might have missed it but I don't think there is anything in the spec that prevents users from changing compression properties.
There was a problem hiding this comment.
If properties has been changes, we should build a new instance of this MetadataLocation struct.
There was a problem hiding this comment.
If properties has been changes, we should build a new instance of this MetadataLocation struct.
My main concern here, is I don't think this is something that clients of MetadataLocation would actually understand and check for a new version of the metadata (i.e. the path of least resistance is to call with_next_version on whatever metadata they used for creation. As a solution, I've created a static factory next_version which combines the two steps for the only call-site that exists for with_next_version. If you think it is better to not deprecated with_next_version I can also revert this change.
Co-authored-by: Kevin Liu <kevinjqliu@users.noreply.github.com>
…nfield/iceberg-rust into feat/metadata-compression
|
|
||
| /// Compression codec for metadata files (JSON). | ||
| #[derive(Debug, PartialEq)] | ||
| pub enum MetadataCompressionCodec { |
There was a problem hiding this comment.
We already have a CompressionCodec, we should move it to spec module and reuse it.
There was a problem hiding this comment.
Note that, that compression codec doesn't actually include GZIP which means extra validation for this code path. I'm not opposed to it but given that it will take extra logic to validate valid options for both this PR and the existing code in puffin to not allow writing GZIP and we also have a different concept for Avro files in #1851 it seems better do the refactoring change in a separate PR once everything is merged (conveniently reaching the rule of 3?
There was a problem hiding this comment.
I have concerns with this approach. OSS software is somehow different enterprise software development. If everyone just move forward with a lot of diverges and left the cleanup things to others, the codebase will become a mess. I think a better approach would be to have another pr to move the compression codec enum into spec module. And then we can move on with this pr after it got merged.
There was a problem hiding this comment.
OK, thank you for the guidance.
There was a problem hiding this comment.
Updated to use that code.
) ## Which issue does this PR close? As discussed in the PR for [allowing compressed metadata.json](#1876 (comment)) writes we want CompressionCodec in a central place so it can be used in puffin and for other compression use-cases. Happy to move it to `spec/` as suggested in the original comment but this seemed more natural. ## What changes are included in this PR? This moves compression.rs to a top level (seems like the best location for common code but please let me know if there is a better place. - It adds Gzip as a compression option and replaces current direct use the GzEncoder package. - It adds validation to puffin that Gzip is not currently supported (per spec) ## Are these changes tested? Added unit tests to cover additions and existing tests cover the rest.
blackmwk
left a comment
There was a problem hiding this comment.
Thanks @emkornfield for this pr, we are quite close !
|
@blackmwk thanks for the review I think we need to close on a direct for this: https://github.com/apache/iceberg-rust/pull/1876/files#r2752454670
If we can make breaking changes then:
|
|
Hi, @emkornfield
I think this is the right direction to go.
This is the place where we have to make a breaking change, so I think it's fine. |
|
|
||
| #[test] | ||
| fn test_parse_metadata_file_compression_valid() { | ||
| use super::parse_metadata_file_compression; |
There was a problem hiding this comment.
need to move these imports out and clean up this code a little bit.
|
@blackmwk thanks for the review, I believe I addressed comments. |
Which issue does this PR close?
Split off from #1851
What changes are included in this PR?
This change honors the compression setting for metadata.json file (
write.metadata.compression-codec).Are these changes tested?
Add unit test to verify files are gzipped when the flag is enabled.
BREAKING CHANGE: Make
write_totakeMetadataLocation