Skip to content

[Draft RFC] Update guidance on versioning to use parquet-format version#588

Draft
emkornfield wants to merge 4 commits into
apache:masterfrom
emkornfield:versioning
Draft

[Draft RFC] Update guidance on versioning to use parquet-format version#588
emkornfield wants to merge 4 commits into
apache:masterfrom
emkornfield:versioning

Conversation

@emkornfield

Copy link
Copy Markdown
Contributor

@emkornfield emkornfield changed the title Update guidance on versioning to use parquet-format version [Draft RFC] Update guidance on versioning to use parquet-format version Jun 22, 2026
@emkornfield emkornfield marked this pull request as draft June 22, 2026 06:51

@steveloughran steveloughran left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

minor comments

Comment thread CONTRIBUTING.md
@@ -131,62 +138,67 @@ For the purposes of this discussion we classify features into the following buck

3. Forward incompatible. A file written under a newer version of the format with
the feature enabled cannot be read under an older version of the format (e.g.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"read under an older version of the format" is a bit confusing here. Presumably it means "by a library/app only written to work with earlier versions"

Comment thread ParxMagicNumber.md Outdated
Comment thread ParxMagicNumber.md
| Field | Type | Offset | Description |
|----------------|---------|--------|-------------------------------------------------------------------|
| `metadata_len` | i32 LE | 0 | Byte length of the Thrift-encoded `FileMetaData` block |
| `flags` | u32 LE | 4 | Feature flags (see [Feature Flags](#feature-flags)) |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a permanent restriction to 32 flags; using a u64 would add minimal cost at this point and avoid complications if 33 flags were ever needed many years in the future

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had the same thought, but I can also understand keeping this simple for now, as I think the intent is to use this bitfield sparingly. Most changes will be communicated with a major version bump, so it's only structural changes like a new footer format that will require a new bit position. I think any new footer we design should itself be extensible without requiring the use of additional bits here. As to clean-ups of the existing thrift structure, perhaps we should roll up the path_in_schema change with other changes that have been suggested before (remove encodings from ColumnMetaData, remove file_offset and file_path from ColumnChunk, I think @alkis had others).

If we are concerned about running out of bits, then we could instead say the footer isn't fixed in size, and use continuation bits to make the bitfield arbitrary in size. Or keep it fixed, and say bit 32 signifies an additional bitfield located in a TBD location in the footer.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel too strongly. I think even the ones listed only gets us less then 10 bits. And if we end up going with a new footer hopefully that would take 1 bit and hopefully keep this pretty low.

I do think it is a good idea to update the wording that this might not be fixed size (could grow in the future, but that would be detectable via inspecting the bitmap). Another option, given we expect this field to not evolve too quickly past an initial set of changes is to have bit 31, effectively be a version bit, which indicates all structural changes as of a certain date are always true.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One nice thing about 32 bits is it makes the current implementation a little easier since a valid parquet file today must always be at least 16 bytes (2*PAR1, 4 bytes required fields in thirft, 4 bytes metadata length).

Comment thread ParxMagicNumber.md

## Integrity Check

The `crc32` field holds a CRC-32 (ISO 3309 / ITU-T V.42 polynomial, the same used for page level CRC values)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

crc32s is where integrity checksums should be going, because x86 and ARM parts can do in a single instruction. java has a native method for this now.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went back and forth on this. I don't feel too strongly but having the same CRC used across the file ultimately seemed better, then having a one off difference.

@etseidl etseidl left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @emkornfield, I really like this idea.

Comment thread CONTRIBUTING.md
released. It is recommended that changing the default value for a forward
incompatible feature flag should be clearly advertised to consumers (e.g. via
3. Major version upgrades should not be enabled by default
until 2 years after the parquet-java implementation for the specification has been

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given the structural and procedural changes here, I wonder if the 2 year suggestion can be softened. Maybe just say these changes should not be enabled as defaults until sufficient time has been given for downstream consumers to adapt.

@emkornfield emkornfield Jun 22, 2026

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main concern here is there we don't have full participation from everybody that has a parquet reader so knowing an exact time to release is still somewhat a guess (especially because some people never update there libraries). I think setting expectations that all readers have two years before the potentially start breaking change become widespread is a reasonable balance.

I'm open to other wording here or other views on a policy that actually enables us to upgrade in a reasonable time-frame but minimizes breakages.

Comment thread ParxMagicNumber.md

| Bit Index| Name | Description |
|----------|-------------------------|-------------------------------------------------------------------------------------------------------------|
| 0 | `ENCRYPTED_FOOTER` | The `FileMetaData` block is encrypted (equivalent to the `PARE` format). |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should we add these to parquet.thrift as constants?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems reasonable, I'm not clear on the support for constants in thrift, I'll do some research if we can align with this approach.

Comment thread ParxMagicNumber.md
@@ -0,0 +1,89 @@
# PARX Parquet Format Specification

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we should separate discussion of PARX and decouple it from parquet format versioning. The two are separate and I'm not sure we need to align on this in order to move forward with the format versioning.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The conversation that spurred this thread and a bunch of the discussion involved reader/writer coordination.

In particular the original doc posed these questions:

Is a deliberately curated capability level — "V3" — the right vehicle for setting shared reader/writer expectations, building on issue #384?

How far can and should we go in aligning the magic number, footer version, and release version?

The PARX format is one means of solving this coordination, which I think satisfies requirements from the community, so I'm not sure it makes sense to decouple it completely?

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think if we're going to design a new magic number it should be done independently and then we use that representation to effect a change like this. I don't think they should be bundled together.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we are talking specifically about the change to path_in_schema, I agree, we can decouple these once we get alignment on the overall direction. The reason for coupling them in this PR is to illustrate how it would be used in practice.

Comment on lines +883 to +889
/** Path in schema
*
* Made optional in parquet-format 3.0. If not written
* PARX magic number must be used (PATH_IN_SCHEMA_OMITTED, bit 1 in the
* feature flag bitmap, must be set).
**/
3: optional list<string> path_in_schema

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're changing this now, does that mean that we can't make other changes without releasing 3 first?

It feels like this is something we would change prior to a v3 release not make the change, still continue to release v2 features (e.g. new types) and later do a v3 release.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we're changing this now, does that mean that we can't make other changes without releasing 3 first?

Yes, once merged the next release off of main would be V3, and I expect given community bandwidth this is the most likely outcome

It feels like this is something we would change prior to a v3 release not make the change, still continue to release v2 features (e.g. new types) and later do a v3 release.

My proposal here is we can backport forward compatible changes. I added the following the proposal:

If there is contributor bandwidth forwards-compatible features may be backported to previous major versions of the specification and released as a new minor version.

I think this gets us out of the trap of trying curate releases, and still allows writers that don't want to upgrade a path to back-porting if there is enough demand.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants