Skip to content

[python][ray] Optimize merge into self-merge updates on data evolution table#8141

Open
XiaoHongbo-Hope wants to merge 9 commits into
apache:masterfrom
XiaoHongbo-Hope:self_merge_optimize
Open

[python][ray] Optimize merge into self-merge updates on data evolution table#8141
XiaoHongbo-Hope wants to merge 9 commits into
apache:masterfrom
XiaoHongbo-Hope:self_merge_optimize

Conversation

@XiaoHongbo-Hope
Copy link
Copy Markdown
Contributor

@XiaoHongbo-Hope XiaoHongbo-Hope commented Jun 6, 2026

Purpose

When source and target are the same table in merge_into (self-merge), skip the join step and read the target table directly with
_ROW_ID projection. This avoids a full table join and significantly improves performance for self-merge update scenarios on
data-evolution tables.

Tests

@XiaoHongbo-Hope XiaoHongbo-Hope changed the title Self merge optimize [python][ray] Optimize MERGE INTO self-merge updates on dataEvolution table Jun 6, 2026
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python][ray] Optimize MERGE INTO self-merge updates on dataEvolution table [python][ray] Optimize merge into self-merge updates on dataEvolution table Jun 6, 2026
When source == target with ON ['_ROW_ID'], skip the inner join and
read the table only once. Aligned with Spark's isSelfMergeOnRowId
detection logic.
@XiaoHongbo-Hope XiaoHongbo-Hope changed the title [python][ray] Optimize merge into self-merge updates on dataEvolution table [python][ray] Optimize merge into self-merge updates on data evolution table Jun 6, 2026
…artifacts

- Remove incorrect len(clauses) guards reintroduced during rebase
  in build_matched_update_ds and build_not_matched_insert_ds
- Align build_self_merge_update_ds with multi-clause fall-through
  pattern used by build_matched_update_ds
… helper

- Allow source_col('_ROW_ID') and s/t._ROW_ID in self-merge conditions
- Extract _build_matched_transform to deduplicate clause preparation logic
- Remove unnecessary @skipIf from tests that don't need datafusion
- Add tests for source_col('_ROW_ID'), s._ROW_ID, t._ROW_ID conditions, and multi-clause fall-through
…move extra_valid_cols

- Reject NOT MATCHED in _prepare() instead of _build_datasets() (fail fast, align with Spark)
- Skip _normalize_source for self-merge, derive source_col_names from target schema
- _validate_source_has_target_cols accepts source_col_names set directly
@XiaoHongbo-Hope XiaoHongbo-Hope marked this pull request as ready for review June 6, 2026 13:41
Align source_col_names with the actual columns projected by the
self-merge path, excluding blob columns that are not aliased.
Use full_target_field_names (not settable_field_names) for both
source validation and self-merge projection so blob columns can
be referenced in conditions. Update output schema remains settable
columns only.
- test_self_merge_blob_source_condition: verify s.blob_col in
  condition passes validation and works at runtime
- test_self_merge_blob_target_condition_rejected: verify t.blob_col
  in condition is rejected by _prepare
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant