Skip to content
Merged
Changes from 2 commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions Encodings.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,20 @@ This file contains the specification of all supported encodings.
Unless otherwise stated in page or encoding documentation, any encoding can be
used with any page type.

### Supported Encodings

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it might be good to add a note/link to the implementation status page to understand current support for each.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@emkornfield
Thank you for the review!
I added note/link to the implementation status page.

| Encoding type | Encoding enum | Encoding Targets <br> (Parquet 2.0.0+) | Encoding Targets <br> (Parquet 1.0.0+) |
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon.

I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec

I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@alamb
Thank you for the review!

I think we have been trying to avoid the nomenclature of "parquet 2.0" as its definition is not universally agreed upon. 
I recommend we remove the separate columns and instead focus on helping people navigate the current version of the spec

I agree on focusing on current versions spec. At some point it would be great to make the parquet site able to see the previous versions easily. For the table I will remove the last column and rename the thrid one.

And just a question, would Data Page V2 (header?) would be a better term in this case?

I am also not sure about the differences in different encoding targets (e.g. PLAIN_DICTIONARY) --- maybe we can simply not include that in the table as it has been deprecated?

For PLAIN_DICTIONARY and RLE_DICTIONARY, I will merge the rows and mark PLAIN_DICTIONARY enum as deprecated.

For BIT_PACKED, since the deprecated encodings are still explained in the document and it is linked by other encodings , I thought it should be in the table and linked to the details. I think there are few options.

  1. Remove BIT_PACKED encoding from the table (your suggestion)
  2. Remove BIT_PACKED encoding description from the page and from the table (this may break links).
  3. Seperate currently supported and deprecated encodings as seperate tables, and change the layout of the page.
  • Layout A:
    supported encodings table
    deprecated encodings table (only BIT_PACKED)
    supported + deprecated encodings descriptions (current order)
  • Layout B:
    supported encodings table
    supported encodings descriptions (current order with out BIT_PACKED)
    deprecated encodings table (only BIT_PACKED)
    deprecated encodings descriptions (only BIT_PACKED)
  • Layout C:
    supported encodings table
    deprecated encodings table (only BIT_PACKED)
    supported encodings descriptions (current order with out BIT_PACKED)
    deprecated encodings descriptions (only BIT_PACKED)

Also about Encoding Targets column should I just list the physical types? removing other encoding targets (e.g. Repetition and definition levels)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I removed v1 columns and seperated the table. If the deprecated encodings table is not needed I will remove it.

Link to the rendered page: https://github.com/nkaki/parquet-format/blob/master/Encodings.md

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree on focusing on current versions spec. At some point it would be great to make the parquet site able to see the previous versions easily.

I do not think there is consensus on what constitutes a "version" of the spec -- so unfortunately I think adding versions will be blocked until we can agree on what they mean. There are a bunch of discussions on the parquet mailing list if you want more of the backstory.

| ------------------------------------------------ | --------------------------------- | ----------------------------------------------------------------------------------- | -------------------------------------- |
| [Plain](#PLAIN) | PLAIN = 0 | All Physical Types <br> Dictionary entries in dictionary page | All Physical Types |
| [Dictionary Encoding (Plain)](#DICTIONARY) | PLAIN_DICTIONARY = 2 | Deprecated | All Physical Types |
| [Run Length Encoding / Bit-Packing Hybrid](#RLE) | RLE = 3 | BOOLEAN <br> Repetition and definition levels <br> Dictionary indices in data pages | Repetition and definition levels |
| [Bit-packed](#BITPACKED) | BIT_PACKED = 4 | Deprecated | Repetition and definition levels |
| [Delta Encoding](#DELTAENC) | DELTA_BINARY_PACKED = 5 | INT32, INT64 | N/A |
| [Delta-length byte array](#DELTALENGTH) | DELTA_LENGTH_BYTE_ARRAY = 6 | BYTE_ARRAY | N/A |
| [Delta Strings](#DELTASTRING) | DELTA_BYTE_ARRAY = 7 | BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY | N/A |
| [Dictionary Encoding (RLE)](#DICTIONARY) | RLE_DICTIONARY = 8 | All Physical Types | N/A |
| [Byte Stream Split](#BYTESTREAMSPLIT) | BYTE_STREAM_SPLIT = 9 | FLOAT, DOUBLE (2.8.0+) <br> INT32, INT64, FIXED_LEN_BYTE_ARRAY (2.11.0+) | N/A |

<a name="PLAIN"></a>
### Plain: (PLAIN = 0)

Expand All @@ -50,6 +64,7 @@ For native types, this outputs the data as little endian. Floating
For the byte array type, it encodes the length as a 4 byte little
endian, followed by the bytes.

<a name="DICTIONARY"></a>
### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
The dictionary encoding builds a dictionary of values encountered in a given column. The
dictionary will be stored in a dictionary page per column chunk. The values are stored as integers
Expand Down Expand Up @@ -295,6 +310,7 @@ The encoded data is
This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE) encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page.
The delta encoding algorithm described above stores a bit width per miniblock and is less sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding as a block containing all the same values will be bit packed to a zero bit width thus being only a header.

<a name="DELTALENGTH"></a>
### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)

Supported Types: BYTE_ARRAY
Expand All @@ -317,6 +333,7 @@ then the encoded data would be comprised of the following segments:
- DeltaEncoding(5, 5, 6, 6) (the string lengths)
- "HelloWorldFoobarABCDEF"

<a name="DELTASTRING"></a>
### Delta Strings: (DELTA_BYTE_ARRAY = 7)

Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
Expand All @@ -338,6 +355,7 @@ then the encoded data would be comprised of the following segments:

Note that, even for FIXED_LEN_BYTE_ARRAY, all lengths are encoded despite the redundancy.

<a name="BYTESTREAMSPLIT"></a>
### Byte Stream Split: (BYTE_STREAM_SPLIT = 9)

Supported Types: FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY
Expand Down