[python][ray] Optimize merge into self-merge updates on data evolution table by XiaoHongbo-Hope · Pull Request #8141 · apache/paimon

XiaoHongbo-Hope · 2026-06-06T04:11:09Z

Purpose

When source and target are the same table in merge_into (self-merge), skip the join step and read the target table directly with
_ROW_ID projection. This avoids a full table join and significantly improves performance for self-merge update scenarios on
data-evolution tables.

Tests

test_self_merge_basic
test_self_merge_with_source_condition
test_self_merge_with_target_condition
test_self_merge_blob_source_condition — blob table self-merge (skipped, blocked by [python] Fix partial update for normal column on blob/vector tables #8147)
test_self_merge_blob_target_condition_rejected

When source == target with ON ['_ROW_ID'], skip the inner join and read the table only once. Aligned with Spark's isSelfMergeOnRowId detection logic.

…artifacts - Remove incorrect len(clauses) guards reintroduced during rebase in build_matched_update_ds and build_not_matched_insert_ds - Align build_self_merge_update_ds with multi-clause fall-through pattern used by build_matched_update_ds

… helper - Allow source_col('_ROW_ID') and s/t._ROW_ID in self-merge conditions - Extract _build_matched_transform to deduplicate clause preparation logic - Remove unnecessary @skipIf from tests that don't need datafusion - Add tests for source_col('_ROW_ID'), s._ROW_ID, t._ROW_ID conditions, and multi-clause fall-through

…move extra_valid_cols - Reject NOT MATCHED in _prepare() instead of _build_datasets() (fail fast, align with Spark) - Skip _normalize_source for self-merge, derive source_col_names from target schema - _validate_source_has_target_cols accepts source_col_names set directly

Align source_col_names with the actual columns projected by the self-merge path, excluding blob columns that are not aliased.

Use full_target_field_names (not settable_field_names) for both source validation and self-merge projection so blob columns can be referenced in conditions. Update output schema remains settable columns only.

- test_self_merge_blob_source_condition: verify s.blob_col in condition passes validation and works at runtime - test_self_merge_blob_target_condition_rejected: verify t.blob_col in condition is rejected by _prepare

XiaoHongbo-Hope changed the title ~~Self merge optimize~~ [python][ray] Optimize MERGE INTO self-merge updates on dataEvolution table Jun 6, 2026

XiaoHongbo-Hope force-pushed the self_merge_optimize branch from 3e832fe to 9633e85 Compare June 6, 2026 08:50

XiaoHongbo-Hope changed the title ~~[python][ray] Optimize MERGE INTO self-merge updates on dataEvolution table~~ [python][ray] Optimize merge into self-merge updates on dataEvolution table Jun 6, 2026

[ray] Support self-merge optimization in merge_into

ad75046

When source == target with ON ['_ROW_ID'], skip the inner join and read the table only once. Aligned with Spark's isSelfMergeOnRowId detection logic.

XiaoHongbo-Hope force-pushed the self_merge_optimize branch from 9633e85 to ad75046 Compare June 6, 2026 08:59

XiaoHongbo-Hope changed the title ~~[python][ray] Optimize merge into self-merge updates on dataEvolution table~~ [python][ray] Optimize merge into self-merge updates on data evolution table Jun 6, 2026

XiaoHongbo-Hope added 3 commits June 6, 2026 17:04

XiaoHongbo-Hope marked this pull request as ready for review June 6, 2026 13:41

XiaoHongbo-Hope added 5 commits June 6, 2026 21:46

[ray] Use settable_field_names for self-merge source validation

9016687

Align source_col_names with the actual columns projected by the self-merge path, excluding blob columns that are not aliased.

[ray] Allow self-merge to read blob columns for conditions

9cdbc17

Use full_target_field_names (not settable_field_names) for both source validation and self-merge projection so blob columns can be referenced in conditions. Update output schema remains settable columns only.

[ray] Add blob-related self-merge tests

0460392

- test_self_merge_blob_source_condition: verify s.blob_col in condition passes validation and works at runtime - test_self_merge_blob_target_condition_rejected: verify t.blob_col in condition is rejected by _prepare

[ray] Restore comment explaining update_pa_schema vs insert_pa_schema

a416471

Skip test_self_merge_blob_source_condition until PR apache#8147 merges

cb1886f

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[python][ray] Optimize merge into self-merge updates on data evolution table#8141

[python][ray] Optimize merge into self-merge updates on data evolution table#8141
XiaoHongbo-Hope wants to merge 9 commits into
apache:masterfrom
XiaoHongbo-Hope:self_merge_optimize

XiaoHongbo-Hope commented Jun 6, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

XiaoHongbo-Hope commented Jun 6, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Purpose

Tests

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

XiaoHongbo-Hope commented Jun 6, 2026 •

edited

Loading