Skip to content

GH-430: Add vector logical type and a VECTOR repetition type#592

Draft
rok wants to merge 1 commit into
apache:masterfrom
rok:parquet-vector-logical-type
Draft

GH-430: Add vector logical type and a VECTOR repetition type#592
rok wants to merge 1 commit into
apache:masterfrom
rok:parquet-vector-logical-type

Conversation

@rok

@rok rok commented Jun 29, 2026

Copy link
Copy Markdown
Member

Rationale for this change

As discussed in #430, fixed-size list / vector data in Parquet is currently encoded as standard LIST, even when every value has the same number of elements. This forces readers and writers through variable-length list machinery and repetition-level handling for data whose shape is known from the schema.

This is increasingly relevant for embeddings, tensors, images, and other fixed-shape arrays.

The topic was discussed on the mailing list and in a design doc. Several approaches were prototyped (A, B, C) and benchmarked.

What changes are included in this PR?

This PR adds a new VECTOR logical and repetition types for fixed-size lists.

Thrift additions:

  • FieldRepetitionType.VECTOR = 3
  • struct VectorType {}
  • LogicalType.VECTOR = 19
  • SchemaElement.vector_length = 11

The layout for vector data would look like so:

<required|optional> group emb (VECTOR) {
  vector group list [N] {
    <required|optional> <element-type> element;
  }
}

Key semantics:

  • LogicalType.VECTOR annotates the outer group.
  • The middle group has repetition_type = VECTOR and carries SchemaElement.vector_length.
  • The VECTOR repetition level contributes no definition or repetition level.
  • Nullable vectors are represented by an optional outer group.
  • Nullable elements are represented by an optional element field.
  • Each vector occurrence contributes exactly vector_length element slots / level entries.
  • For each primitive leaf in the vector element subtree, num_values counts vector element slots / level entries, not necessarily physical values.
  • Statistics remain element-level statistics.
  • Encodings remain element-level encodings.
  • Writers should not split one vector value across data pages.
  • Readers that do not support FieldRepetitionType.VECTOR should reject the file rather than interpreting the field as another repetition type.

Do these changes have PoC implementations?

Closes #430

@rok rok force-pushed the parquet-vector-logical-type branch from 2bfd191 to 94d5ced Compare June 29, 2026 17:44
Comment thread LogicalTypes.md
Comment on lines +815 to +816
`VECTOR` is used to annotate fixed-size lists (vectors) where each non-repeated
parent occurrence has exactly the same number of element positions. `VECTOR` is

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This wording is a bit confusing to me. What's the significance of "non-repeated" here? Does this mean that the vector repetition type cannot be used on a descendant of a node that is repeated? Or is this referring to the top-level group node in the 3-level schema being required or optional, and not repeated?

@pitrou

pitrou commented Jun 30, 2026

Copy link
Copy Markdown
Member

Can you explain somewhere why the new logical type is useful? The new repetition type might be sufficient, no?

@adamreeve

Copy link
Copy Markdown

Can you explain somewhere why the new logical type is useful? The new repetition type might be sufficient, no?

This is required to support nullable vectors and nullable vector elements in a standardised way. Although both of those require definition levels so will come with some performance impact. This also makes vectors consistent with the existing list logical type.

The new repetition type alone would be sufficient for the common use case of required vectors and required elements.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Format] Specify FIXED_SIZE_LIST Logical type

3 participants