-
Notifications
You must be signed in to change notification settings - Fork 483
MINOR: Add summary table of encodings and supported types (in Encodings.md) (#550) #552
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from 2 commits
bb33d64
2d23f22
5752d1d
d9c9c5e
7dfd068
cfd0c82
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
|
|
@@ -25,6 +25,20 @@ This file contains the specification of all supported encodings. | |
| Unless otherwise stated in page or encoding documentation, any encoding can be | ||
| used with any page type. | ||
|
|
||
| ### Supported Encodings | ||
|
|
||
| | Encoding type | Encoding enum | Encoding Targets <br> (Parquet 2.0.0+) | Encoding Targets <br> (Parquet 1.0.0+) | | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon. I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. @alamb
I agree on focusing on current versions spec. At some point it would be great to make the parquet site able to see the previous versions easily. For the table I will remove the last column and rename the thrid one. And just a question, would Data Page V2 (header?) would be a better term in this case?
For PLAIN_DICTIONARY and RLE_DICTIONARY, I will merge the rows and mark PLAIN_DICTIONARY enum as deprecated. For BIT_PACKED, since the deprecated encodings are still explained in the document and it is linked by other encodings , I thought it should be in the table and linked to the details. I think there are few options.
Also about Encoding Targets column should I just list the physical types? removing other encoding targets (e.g. Repetition and definition levels)
Contributor
Author
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. I removed v1 columns and seperated the table. If the deprecated encodings table is not needed I will remove it. Link to the rendered page: https://github.com/nkaki/parquet-format/blob/master/Encodings.md
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
I do not think there is consensus on what constitutes a "version" of the spec -- so unfortunately I think adding versions will be blocked until we can agree on what they mean. There are a bunch of discussions on the parquet mailing list if you want more of the backstory. |
||
| | ------------------------------------------------ | --------------------------------- | ----------------------------------------------------------------------------------- | -------------------------------------- | | ||
| | [Plain](#PLAIN) | PLAIN = 0 | All Physical Types <br> Dictionary entries in dictionary page | All Physical Types | | ||
| | [Dictionary Encoding (Plain)](#DICTIONARY) | PLAIN_DICTIONARY = 2 | Deprecated | All Physical Types | | ||
| | [Run Length Encoding / Bit-Packing Hybrid](#RLE) | RLE = 3 | BOOLEAN <br> Repetition and definition levels <br> Dictionary indices in data pages | Repetition and definition levels | | ||
| | [Bit-packed](#BITPACKED) | BIT_PACKED = 4 | Deprecated | Repetition and definition levels | | ||
| | [Delta Encoding](#DELTAENC) | DELTA_BINARY_PACKED = 5 | INT32, INT64 | N/A | | ||
| | [Delta-length byte array](#DELTALENGTH) | DELTA_LENGTH_BYTE_ARRAY = 6 | BYTE_ARRAY | N/A | | ||
| | [Delta Strings](#DELTASTRING) | DELTA_BYTE_ARRAY = 7 | BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY | N/A | | ||
| | [Dictionary Encoding (RLE)](#DICTIONARY) | RLE_DICTIONARY = 8 | All Physical Types | N/A | | ||
| | [Byte Stream Split](#BYTESTREAMSPLIT) | BYTE_STREAM_SPLIT = 9 | FLOAT, DOUBLE (2.8.0+) <br> INT32, INT64, FIXED_LEN_BYTE_ARRAY (2.11.0+) | N/A | | ||
|
|
||
| <a name="PLAIN"></a> | ||
| ### Plain: (PLAIN = 0) | ||
|
|
||
|
|
@@ -50,6 +64,7 @@ For native types, this outputs the data as little endian. Floating | |
| For the byte array type, it encodes the length as a 4 byte little | ||
| endian, followed by the bytes. | ||
|
|
||
| <a name="DICTIONARY"></a> | ||
| ### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8) | ||
| The dictionary encoding builds a dictionary of values encountered in a given column. The | ||
| dictionary will be stored in a dictionary page per column chunk. The values are stored as integers | ||
|
|
@@ -295,6 +310,7 @@ The encoded data is | |
| This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE) encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page. | ||
| The delta encoding algorithm described above stores a bit width per miniblock and is less sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding as a block containing all the same values will be bit packed to a zero bit width thus being only a header. | ||
|
|
||
| <a name="DELTALENGTH"></a> | ||
| ### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6) | ||
|
|
||
| Supported Types: BYTE_ARRAY | ||
|
|
@@ -317,6 +333,7 @@ then the encoded data would be comprised of the following segments: | |
| - DeltaEncoding(5, 5, 6, 6) (the string lengths) | ||
| - "HelloWorldFoobarABCDEF" | ||
|
|
||
| <a name="DELTASTRING"></a> | ||
| ### Delta Strings: (DELTA_BYTE_ARRAY = 7) | ||
|
|
||
| Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY | ||
|
|
@@ -338,6 +355,7 @@ then the encoded data would be comprised of the following segments: | |
|
|
||
| Note that, even for FIXED_LEN_BYTE_ARRAY, all lengths are encoded despite the redundancy. | ||
|
|
||
| <a name="BYTESTREAMSPLIT"></a> | ||
| ### Byte Stream Split: (BYTE_STREAM_SPLIT = 9) | ||
|
|
||
| Supported Types: FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY | ||
|
|
||
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it might be good to add a note/link to the implementation status page to understand current support for each.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@emkornfield
Thank you for the review!
I added note/link to the implementation status page.