GH-430: Add vector logical type and a VECTOR repetition type#592
Conversation
2bfd191 to
94d5ced
Compare
| `VECTOR` is used to annotate fixed-size lists (vectors) where each non-repeated | ||
| parent occurrence has exactly the same number of element positions. `VECTOR` is |
There was a problem hiding this comment.
This wording is a bit confusing to me. What's the significance of "non-repeated" here? Does this mean that the vector repetition type cannot be used on a descendant of a node that is repeated? Or is this referring to the top-level group node in the 3-level schema being required or optional, and not repeated?
|
Can you explain somewhere why the new logical type is useful? The new repetition type might be sufficient, no? |
This is required to support nullable vectors and nullable vector elements in a standardised way. Although both of those require definition levels so will come with some performance impact. This also makes vectors consistent with the existing list logical type. The new repetition type alone would be sufficient for the common use case of required vectors and required elements. |
Rationale for this change
As discussed in #430, fixed-size list / vector data in Parquet is currently encoded as standard
LIST, even when every value has the same number of elements. This forces readers and writers through variable-length list machinery and repetition-level handling for data whose shape is known from the schema.This is increasingly relevant for embeddings, tensors, images, and other fixed-shape arrays.
The topic was discussed on the mailing list and in a design doc. Several approaches were prototyped (A, B, C) and benchmarked.
What changes are included in this PR?
This PR adds a new
VECTORlogical and repetition types for fixed-size lists.Thrift additions:
FieldRepetitionType.VECTOR = 3struct VectorType {}LogicalType.VECTOR = 19SchemaElement.vector_length = 11The layout for vector data would look like so:
Key semantics:
repetition_type = VECTORand carriesSchemaElement.vector_length.Do these changes have PoC implementations?
Closes #430