[Draft RFC] Update guidance on versioning to use parquet-format version#588
[Draft RFC] Update guidance on versioning to use parquet-format version#588emkornfield wants to merge 4 commits into
Conversation
| @@ -131,62 +138,67 @@ For the purposes of this discussion we classify features into the following buck | |||
|
|
|||
| 3. Forward incompatible. A file written under a newer version of the format with | |||
| the feature enabled cannot be read under an older version of the format (e.g. | |||
There was a problem hiding this comment.
"read under an older version of the format" is a bit confusing here. Presumably it means "by a library/app only written to work with earlier versions"
| | Field | Type | Offset | Description | | ||
| |----------------|---------|--------|-------------------------------------------------------------------| | ||
| | `metadata_len` | i32 LE | 0 | Byte length of the Thrift-encoded `FileMetaData` block | | ||
| | `flags` | u32 LE | 4 | Feature flags (see [Feature Flags](#feature-flags)) | |
There was a problem hiding this comment.
That's a permanent restriction to 32 flags; using a u64 would add minimal cost at this point and avoid complications if 33 flags were ever needed many years in the future
There was a problem hiding this comment.
I had the same thought, but I can also understand keeping this simple for now, as I think the intent is to use this bitfield sparingly. Most changes will be communicated with a major version bump, so it's only structural changes like a new footer format that will require a new bit position. I think any new footer we design should itself be extensible without requiring the use of additional bits here. As to clean-ups of the existing thrift structure, perhaps we should roll up the path_in_schema change with other changes that have been suggested before (remove encodings from ColumnMetaData, remove file_offset and file_path from ColumnChunk, I think @alkis had others).
If we are concerned about running out of bits, then we could instead say the footer isn't fixed in size, and use continuation bits to make the bitfield arbitrary in size. Or keep it fixed, and say bit 32 signifies an additional bitfield located in a TBD location in the footer.
There was a problem hiding this comment.
I don't feel too strongly. I think even the ones listed only gets us less then 10 bits. And if we end up going with a new footer hopefully that would take 1 bit and hopefully keep this pretty low.
I do think it is a good idea to update the wording that this might not be fixed size (could grow in the future, but that would be detectable via inspecting the bitmap). Another option, given we expect this field to not evolve too quickly past an initial set of changes is to have bit 31, effectively be a version bit, which indicates all structural changes as of a certain date are always true.
There was a problem hiding this comment.
One nice thing about 32 bits is it makes the current implementation a little easier since a valid parquet file today must always be at least 16 bytes (2*PAR1, 4 bytes required fields in thirft, 4 bytes metadata length).
|
|
||
| ## Integrity Check | ||
|
|
||
| The `crc32` field holds a CRC-32 (ISO 3309 / ITU-T V.42 polynomial, the same used for page level CRC values) |
There was a problem hiding this comment.
crc32s is where integrity checksums should be going, because x86 and ARM parts can do in a single instruction. java has a native method for this now.
There was a problem hiding this comment.
I went back and forth on this. I don't feel too strongly but having the same CRC used across the file ultimately seemed better, then having a one off difference.
etseidl
left a comment
There was a problem hiding this comment.
Thanks @emkornfield, I really like this idea.
| released. It is recommended that changing the default value for a forward | ||
| incompatible feature flag should be clearly advertised to consumers (e.g. via | ||
| 3. Major version upgrades should not be enabled by default | ||
| until 2 years after the parquet-java implementation for the specification has been |
There was a problem hiding this comment.
Given the structural and procedural changes here, I wonder if the 2 year suggestion can be softened. Maybe just say these changes should not be enabled as defaults until sufficient time has been given for downstream consumers to adapt.
There was a problem hiding this comment.
My main concern here is there we don't have full participation from everybody that has a parquet reader so knowing an exact time to release is still somewhat a guess (especially because some people never update there libraries). I think setting expectations that all readers have two years before the potentially start breaking change become widespread is a reasonable balance.
I'm open to other wording here or other views on a policy that actually enables us to upgrade in a reasonable time-frame but minimizes breakages.
|
|
||
| | Bit Index| Name | Description | | ||
| |----------|-------------------------|-------------------------------------------------------------------------------------------------------------| | ||
| | 0 | `ENCRYPTED_FOOTER` | The `FileMetaData` block is encrypted (equivalent to the `PARE` format). | |
There was a problem hiding this comment.
Should we add these to parquet.thrift as constants?
There was a problem hiding this comment.
Seems reasonable, I'm not clear on the support for constants in thrift, I'll do some research if we can align with this approach.
| @@ -0,0 +1,89 @@ | |||
| # PARX Parquet Format Specification | |||
There was a problem hiding this comment.
I think we should separate discussion of PARX and decouple it from parquet format versioning. The two are separate and I'm not sure we need to align on this in order to move forward with the format versioning.
There was a problem hiding this comment.
The conversation that spurred this thread and a bunch of the discussion involved reader/writer coordination.
In particular the original doc posed these questions:
Is a deliberately curated capability level — "V3" — the right vehicle for setting shared reader/writer expectations, building on issue #384?
How far can and should we go in aligning the magic number, footer version, and release version?
The PARX format is one means of solving this coordination, which I think satisfies requirements from the community, so I'm not sure it makes sense to decouple it completely?
There was a problem hiding this comment.
I think if we're going to design a new magic number it should be done independently and then we use that representation to effect a change like this. I don't think they should be bundled together.
There was a problem hiding this comment.
If we are talking specifically about the change to path_in_schema, I agree, we can decouple these once we get alignment on the overall direction. The reason for coupling them in this PR is to illustrate how it would be used in practice.
| /** Path in schema | ||
| * | ||
| * Made optional in parquet-format 3.0. If not written | ||
| * PARX magic number must be used (PATH_IN_SCHEMA_OMITTED, bit 1 in the | ||
| * feature flag bitmap, must be set). | ||
| **/ | ||
| 3: optional list<string> path_in_schema |
There was a problem hiding this comment.
If we're changing this now, does that mean that we can't make other changes without releasing 3 first?
It feels like this is something we would change prior to a v3 release not make the change, still continue to release v2 features (e.g. new types) and later do a v3 release.
There was a problem hiding this comment.
If we're changing this now, does that mean that we can't make other changes without releasing 3 first?
Yes, once merged the next release off of main would be V3, and I expect given community bandwidth this is the most likely outcome
It feels like this is something we would change prior to a v3 release not make the change, still continue to release v2 features (e.g. new types) and later do a v3 release.
My proposal here is we can backport forward compatible changes. I added the following the proposal:
If there is contributor bandwidth forwards-compatible features may be backported to previous major versions of the specification and released as a new minor version.
I think this gets us out of the trap of trying curate releases, and still allows writers that don't want to upgrade a path to back-porting if there is enough demand.
Concrete proposal for discussion: https://lists.apache.org/thread/8qjyqrbkwj14hrpk8f806bwy2k3m3ds1