Skip to content

[python] Support VARIANT type in pypaimon#7635

Open
chenghuichen wants to merge 7 commits intoapache:masterfrom
chenghuichen:python-variant
Open

[python] Support VARIANT type in pypaimon#7635
chenghuichen wants to merge 7 commits intoapache:masterfrom
chenghuichen:python-variant

Conversation

@chenghuichen
Copy link
Copy Markdown
Contributor

@chenghuichen chenghuichen commented Apr 13, 2026

Purpose

Background: #7655

This PR adds VARIANT read/write support to pypaimon, with a particular focus on shredded VARIANT.

  • Write: when variant.shreddingSchema is configured on a table, VARIANT columns are written in shredded Parquet format according to the schema.
  • Read: shredded VARIANT columns are automatically reassembled back into standard struct<value: binary, metadata: binary> form, transparent to the caller.

Shredded column pruning and predicate pushdown will be built on top of this PR.

Tests

  • Unit tests
    • pypaimon/tests/variant_test.py
  • E2E tests
    • run_java_variant_write_py_read_test
    • run_py_variant_write_java_read_test

@chenghuichen chenghuichen changed the title [python] Support VARIANT type in pypaimon [WIP][python] Support VARIANT type in pypaimon Apr 14, 2026
@chenghuichen chenghuichen changed the title [WIP][python] Support VARIANT type in pypaimon [python] Support VARIANT type in pypaimon Apr 15, 2026
@chenghuichen
Copy link
Copy Markdown
Contributor Author

The PR is ready for review now.

@JingsongLi
Copy link
Copy Markdown
Contributor

Thanks @chenghuichen , let me check this.

# Constants (matching GenericVariantUtil.java)
# ---------------------------------------------------------------------------

_PRIMITIVE = 0
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Duplicated binary constants and helpers across generic_variant.py and variant_shredding.py
Both files define _PRIMITIVE, _SHORT_STR, _OBJECT, _ARRAY, _U8_MAX, _U32_SIZE, _VERSION_MASK, _read_unsigned, _get_int_size, _object_header, _array_header, etc. This is a maintenance risk — if the spec changes, both files need updating. Consider extracting shared constants/helpers into a small _variant_binary.py module.

return bytes(buf)


def _extract_overflow_fields(overflow_bytes: bytes) -> List[Tuple[int, bytes]]:
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The comment explains that data may be laid out in a different order than the id table. The sorting-by-offset logic is correct but intricate. Consider adding a small inline example or diagram showing the case where insertion-order data differs from sorted-id order, to help future maintainers.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants