MINOR: Add summary table of encodings and supported types (in Encodings.md) (#550)#552
MINOR: Add summary table of encodings and supported types (in Encodings.md) (#550)#552alamb merged 6 commits intoapache:masterfrom
Conversation
Encodings.md
Outdated
|
|
||
| ### Supported Encodings | ||
|
|
||
| | Encoding type | Encoding enum | Encoding Targets <br> (Parquet 2.0.0+) | Encoding Targets <br> (Parquet 1.0.0+) | |
There was a problem hiding this comment.
I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon.
I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec
I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?
There was a problem hiding this comment.
@alamb
Thank you for the review!
I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon.
I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec
I agree on focusing on current versions spec. At some point it would be great to make the parquet site able to see the previous versions easily. For the table I will remove the last column and rename the thrid one.
And just a question, would Data Page V2 (header?) would be a better term in this case?
I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?
For PLAIN_DICTIONARY and RLE_DICTIONARY, I will merge the rows and mark PLAIN_DICTIONARY enum as deprecated.
For BIT_PACKED, since the deprecated encodings are still explained in the document and it is linked by other encodings , I thought it should be in the table and linked to the details. I think there are few options.
- Remove BIT_PACKED encoding from the table (your suggestion)
- Remove BIT_PACKED encoding description from the page and from the table (this may break links).
- Seperate currently supported and deprecated encodings as seperate tables, and change the layout of the page.
- Layout A:
supported encodings table
deprecated encodings table (only BIT_PACKED)
supported + deprecated encodings descriptions (current order) - Layout B:
supported encodings table
supported encodings descriptions (current order with out BIT_PACKED)
deprecated encodings table (only BIT_PACKED)
deprecated encodings descriptions (only BIT_PACKED) - Layout C:
supported encodings table
deprecated encodings table (only BIT_PACKED)
supported encodings descriptions (current order with out BIT_PACKED)
deprecated encodings descriptions (only BIT_PACKED)
Also about Encoding Targets column should I just list the physical types? removing other encoding targets (e.g. Repetition and definition levels)
There was a problem hiding this comment.
I removed v1 columns and seperated the table. If the deprecated encodings table is not needed I will remove it.
Link to the rendered page: https://github.com/nkaki/parquet-format/blob/master/Encodings.md
There was a problem hiding this comment.
I agree on focusing on current versions spec. At some point it would be great to make the parquet site able to see the previous versions easily.
I do not think there is consensus on what constitutes a "version" of the spec -- so unfortunately I think adding versions will be blocked until we can agree on what they mean. There are a bunch of discussions on the parquet mailing list if you want more of the backstory.
…gs.md) (apache#550) - remove v1 related column, and seperate tables for supported and deprecated encodings
alamb
left a comment
There was a problem hiding this comment.
Thank you @nkaki -- this looks like a great improvement to the documentation to me
FYI @emkornfield @wgtmac @julienledem in case you would also like to review
Add Dictionary indices to encoding targets Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org>
|
Thanks @nkaki -- I plan to leave this open for another few days in case anyone else would like a chance to comment. I plan to merge it sometime next week |
| used with any page type. | ||
|
|
||
| ### Supported Encodings | ||
|
|
There was a problem hiding this comment.
it might be good to add a note/link to the implementation status page to understand current support for each.
There was a problem hiding this comment.
@emkornfield
Thank you for the review!
I added note/link to the implementation status page.
fix typo Co-authored-by: Gang Wu <ustcwg@gmail.com>
|
Thank you so much @nkaki and thanks to @emkornfield and @wgtmac for the review -- I think this makes the encodings page significantly easier to navigate. We can continue to improve the documentation as follow on PRs if needed |
|
I wanted to follow up here and point out this change is now live on the parquet website: https://parquet.apache.org/docs/file-format/data-pages/encodings/ I think it looks quite nice; Thanks again @nkaki |
Rationale for this change
To make it easier for readers to get an overall picture of Parquet encodings.
What changes are included in this PR?
Adds a summary table to Encodings.md that lists the encoding types (each linked to its description), enums and targets for different Parquet format versions.
See rendered format here: https://github.com/nkaki/parquet-format/blob/master/Encodings.md
Do these changes have PoC implementations?
No - Documentation change only