[Parquet]: GH-563: Make path_in_schema optional#9678
Conversation
|
less bloat for the win! |
|
A lot of other parquet implementations require this field, due to their generated thrift parser, even if they do not actually use the field for anything. I would be totally in favor of deprecating and skipping the field, but maybe a more compatible alternative would be to write the field as an empty list instead. |
Well, parquet-java actually uses the field to populate its version of |
See also related mailing list discussion: |
alamb
left a comment
There was a problem hiding this comment.
TLDR is I think this is a great idea. I also think it woudl be ok to merge this into arrow-rs even if there is not a broader consensus on the mailing list that we should do it in the format itself
My thinking is that some usecases are basically using parquet with the same read/writer and compatibility with older java based implementations is not important. This is the same thing for some of the newer encodings too
Letting users choose between "better/faster/stronger" and "more compatible" I think is very much a good idea
| /// | ||
| /// The `path_in_schema` field in the Thrift metadata is redundant and wastes a great | ||
| /// deal of space. Parquet file footers can be made much smaller by omitting this field. | ||
| /// Because the field was originally a mandatory field, this property defaults to `true` |
There was a problem hiding this comment.
If we choose to go this way I think it would help to give some more context here on what types of readers would be affected (basically all the parquet-java based readers prior to late 2026)
We could also perhaps provide a link to the discussion: https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt
| /// Should the writer should emit the `path_in_schema` element of the | ||
| /// `ColumnMetaData` Thrift struct. |
There was a problem hiding this comment.
I think it is worth making it more apparently that this will cause incompatibilities with some older readers:
| /// Should the writer should emit the `path_in_schema` element of the | |
| /// `ColumnMetaData` Thrift struct. | |
| /// Should the writer should emit the `path_in_schema` element of the | |
| /// `ColumnMetaData` Thrift struct. WARNING: this will make the parquet | |
| /// file unreadable by some older parquet readers. See LINK HERE for details |
|
I added a list of related PRs to this issue's description: |
|
I wonder if we should just merge this into the Rust implementation of Parquet (with a caveat that this will make metadata smaller / more effiicent at the cost of incompatibility with older / java based readers)? |
|
I'm game. I don't see a reason not to since it's opt in |
| { | ||
| let buf = TrackedWrite::new(&mut buffer); | ||
| let writer = ParquetMetaDataWriter::new_with_tracked(buf, &metadata); | ||
| let writer = ParquetMetaDataWriter::new_with_tracked(buf, &metadata) |
There was a problem hiding this comment.
Should we perhaps make a new benchmark so we can evaluate the performance both in default mode and wihtout this part of the metadata?
There was a problem hiding this comment.
Yes, that's probably worthwhile. I had made this modification for the purposes of the parquet-format submission.
ac323f0 to
1d4eb04
Compare
|
run benchmark metadata |
|
🤖 Arrow criterion benchmark running (GKE) | trigger CPU Details (lscpu)Comparing deprecate_path_in_schema (feaae15) to f725bc9 (merge-base) diff File an issue against this benchmark runner |
|
🤖 Arrow criterion benchmark completed (GKE) | trigger Instance: CPU Details (lscpu)Details
Resource Usagebase (merge-base)
branch
File an issue against this benchmark runner |
alamb
left a comment
There was a problem hiding this comment.
I think this PR is ready except for:
- Don't change the existing writer benchmark (otherwise we are not benchmarking with default settings)
- Update the docs a bit more to explain the implications of not writing path_in_schema
I tried to address these in 2a06837. PTAL |
|
I had a few comments that github won't seem to let me post It might help to explain what files it helps the most for: It might help to be a little more specific about what readers this effects |
|
Thanks @alamb. I added some more motivation for it's use. Please take another look 🙏 |
I've been having quite a few github problems lately :( |
Perhaps related to this: https://github.blog/news-insights/company-news/an-update-on-github-availability/
(aka everyone is having them 😆 ) |
|
Thanks @etseidl |
@etseidl @alamb This is true, but if you follow this logic to is conclusion, then each implementation could come up with new encodings themselves without broader consensus in the ecosystem? Is that the direction the arrow-rs implementation aims to take? IIUC this would allow writing files that would fail parsing from other readers due to the field currently being required? If so, while it seems like there is general consensus on the parquet mailing list to transition to this, doing before it is adopted in parquet-format seems like a small risk of fragmenting the ecosystem? CC @wgtmac |
Well, some other readers based on older java would fail but some (e.g. DuckDB, polars) can read the files just fine.
This is true. though the same argument can be applied to V2 Data pages and other encodings like byte stream split that are not supported by other implementations. In my mind there are basically two benefits to using the (Rust) implementation of parquet:
I think there are a bunch of use cases where systems use parquet internally and either
Many of the early adopters of Vortex fall into this second category (as does InfluxData). My goal with this setting is to cater to the second category and allow people to take advantage of the all the engineering in the Rust implementation |
|
@emkornfield I understand your concern, and would not normally be in favor of putting the cart this far in front of the horse. But in this specific instance, we have a situation where we do not internally depend on the field, and for the last two releases have not even parsed it. We felt that the potential gains were worth making available to downstream users as an experimental feature, albeit with very prominent warnings that use of the feature will break compatibility. Call me cautiously optimistic that the spec will follow soon. |

Which issue does this PR close?
none
Rationale for this change
This is a proof of concept implementation for apache/parquet-format#563
What changes are included in this PR?
Since version 57.0.0, this crate has been tolerant of a missing
path_in_schema. This PR adds options to cease writing the field as well. The option defaults to continuing to write the field.See related discussion on parquet mailing list: https://lists.apache.org/thread/czm2bk45wwtkhhpqxqvmx9dk5wkwk1kt
Are these changes tested?
Yes
Are there any user-facing changes?
No, this only adds an optional behavior change that defaults to no change
Related PRs
path_in_schemaoptional parquet-format#563ColumnMetaData.path_in_schemaoptional parquet-format#564path_in_schemaoptional parquet-java#3470