Skip to content

Commit 38818fa

Browse files
nkakialambwgtmac
authored
MINOR: Add summary table of encodings and supported types (in Encodings.md) (#550) (#552)
* MINOR: Add summary table of encodings and supported types (in Encodings.md) (#550) * MINOR: Add summary table of encodings and supported types (in Encodings.md) (#550); * MINOR: Add summary table of encodings and supported types (in Encodings.md) (#550) - remove v1 related column, and seperate tables for supported and deprecated encodings * Update Encodings.md Add Dictionary indices to encoding targets Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> * Update Encodings.md fix typo Co-authored-by: Gang Wu <ustcwg@gmail.com> * added note/link to the implementation status page --------- Co-authored-by: Andrew Lamb <andrew@nerdnetworks.org> Co-authored-by: Gang Wu <ustcwg@gmail.com>
1 parent 9621f8c commit 38818fa

1 file changed

Lines changed: 25 additions & 0 deletions

File tree

Encodings.md

Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -25,6 +25,27 @@ This file contains the specification of all supported encodings.
2525
Unless otherwise stated in page or encoding documentation, any encoding can be
2626
used with any page type.
2727

28+
### Supported Encodings
29+
30+
For details on current implementation status, see the [Implementation Status](https://parquet.apache.org/docs/file-format/implementationstatus/#encodings) page.
31+
32+
| Encoding type | Encoding enum | Supported Types |
33+
| ------------------------------------------------ | --------------------------------------------------------- | ------------------------------------------------- |
34+
| [Plain](#PLAIN) | PLAIN = 0 | All Physical Types |
35+
| [Dictionary Encoding](#DICTIONARY) | PLAIN_DICTIONARY = 2 (Deprecated) <br> RLE_DICTIONARY = 8 | All Physical Types |
36+
| [Run Length Encoding / Bit-Packing Hybrid](#RLE) | RLE = 3 | BOOLEAN, Dictionary Indices |
37+
| [Delta Encoding](#DELTAENC) | DELTA_BINARY_PACKED = 5 | INT32, INT64 |
38+
| [Delta-length byte array](#DELTALENGTH) | DELTA_LENGTH_BYTE_ARRAY = 6 | BYTE_ARRAY |
39+
| [Delta Strings](#DELTASTRING) | DELTA_BYTE_ARRAY = 7 | BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY |
40+
| [Byte Stream Split](#BYTESTREAMSPLIT) | BYTE_STREAM_SPLIT = 9 | INT32, INT64, FLOAT, DOUBLE, FIXED_LEN_BYTE_ARRAY |
41+
42+
### Deprecated Encodings
43+
44+
| Encoding type | Encoding enum |
45+
| ------------------------------------- | -------------- |
46+
| [Bit-packed (Deprecated)](#BITPACKED) | BIT_PACKED = 4 |
47+
48+
2849
<a name="PLAIN"></a>
2950
### Plain: (PLAIN = 0)
3051

@@ -50,6 +71,7 @@ For native types, this outputs the data as little endian. Floating
5071
For the byte array type, it encodes the length as a 4 byte little
5172
endian, followed by the bytes.
5273

74+
<a name="DICTIONARY"></a>
5375
### Dictionary Encoding (PLAIN_DICTIONARY = 2 and RLE_DICTIONARY = 8)
5476
The dictionary encoding builds a dictionary of values encountered in a given column. The
5577
dictionary will be stored in a dictionary page per column chunk. The values are stored as integers
@@ -295,6 +317,7 @@ The encoded data is
295317
This encoding is similar to the [RLE/bit-packing](#RLE) encoding. However the [RLE/bit-packing](#RLE) encoding is specifically used when the range of ints is small over the entire page, as is true of repetition and definition levels. It uses a single bit width for the whole page.
296318
The delta encoding algorithm described above stores a bit width per miniblock and is less sensitive to variations in the size of encoded integers. It is also somewhat doing RLE encoding as a block containing all the same values will be bit packed to a zero bit width thus being only a header.
297319

320+
<a name="DELTALENGTH"></a>
298321
### Delta-length byte array: (DELTA_LENGTH_BYTE_ARRAY = 6)
299322

300323
Supported Types: BYTE_ARRAY
@@ -317,6 +340,7 @@ then the encoded data would be comprised of the following segments:
317340
- DeltaEncoding(5, 5, 6, 6) (the string lengths)
318341
- "HelloWorldFoobarABCDEF"
319342

343+
<a name="DELTASTRING"></a>
320344
### Delta Strings: (DELTA_BYTE_ARRAY = 7)
321345

322346
Supported Types: BYTE_ARRAY, FIXED_LEN_BYTE_ARRAY
@@ -338,6 +362,7 @@ then the encoded data would be comprised of the following segments:
338362

339363
Note that, even for FIXED_LEN_BYTE_ARRAY, all lengths are encoded despite the redundancy.
340364

365+
<a name="BYTESTREAMSPLIT"></a>
341366
### Byte Stream Split: (BYTE_STREAM_SPLIT = 9)
342367

343368
Supported Types: FLOAT, DOUBLE, INT32, INT64, FIXED_LEN_BYTE_ARRAY

0 commit comments

Comments
 (0)