[python][ray] Optimize merge into self-merge updates on data evolution table#8141
Open
XiaoHongbo-Hope wants to merge 9 commits into
Open
[python][ray] Optimize merge into self-merge updates on data evolution table#8141XiaoHongbo-Hope wants to merge 9 commits into
XiaoHongbo-Hope wants to merge 9 commits into
Conversation
3e832fe to
9633e85
Compare
When source == target with ON ['_ROW_ID'], skip the inner join and read the table only once. Aligned with Spark's isSelfMergeOnRowId detection logic.
9633e85 to
ad75046
Compare
…artifacts - Remove incorrect len(clauses) guards reintroduced during rebase in build_matched_update_ds and build_not_matched_insert_ds - Align build_self_merge_update_ds with multi-clause fall-through pattern used by build_matched_update_ds
… helper
- Allow source_col('_ROW_ID') and s/t._ROW_ID in self-merge conditions
- Extract _build_matched_transform to deduplicate clause preparation logic
- Remove unnecessary @skipIf from tests that don't need datafusion
- Add tests for source_col('_ROW_ID'), s._ROW_ID, t._ROW_ID conditions, and multi-clause fall-through
…move extra_valid_cols - Reject NOT MATCHED in _prepare() instead of _build_datasets() (fail fast, align with Spark) - Skip _normalize_source for self-merge, derive source_col_names from target schema - _validate_source_has_target_cols accepts source_col_names set directly
Align source_col_names with the actual columns projected by the self-merge path, excluding blob columns that are not aliased.
Use full_target_field_names (not settable_field_names) for both source validation and self-merge projection so blob columns can be referenced in conditions. Update output schema remains settable columns only.
- test_self_merge_blob_source_condition: verify s.blob_col in condition passes validation and works at runtime - test_self_merge_blob_target_condition_rejected: verify t.blob_col in condition is rejected by _prepare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Purpose
When source and target are the same table in
merge_into(self-merge), skip the join step and read the target table directly with_ROW_IDprojection. This avoids a full table join and significantly improves performance for self-merge update scenarios ondata-evolution tables.
Tests
test_self_merge_basictest_self_merge_with_source_conditiontest_self_merge_with_target_conditiontest_self_merge_blob_source_condition— blob table self-merge (skipped, blocked by [python] Fix partial update for normal column on blob/vector tables #8147)test_self_merge_blob_target_condition_rejected