refactor: name build-row and matchable-map presence checks in hash join#23024
refactor: name build-row and matchable-map presence checks in hash join#23024Phoenix500526 wants to merge 2 commits into
Conversation
|
The failing I have fixed it separately in #23044; once that merges, rebasing this branch should turn the check green. |
neilconway
left a comment
There was a problem hiding this comment.
Thanks! Useful cleanup.
There was a problem hiding this comment.
How about using has_matchable_build_rows here?
d231c4d to
2f42220
Compare
|
hi @Phoenix500526 |
Sure, I opened a new PR #23053 to fix it. |
`state_after_build_ready` distinguishes two facts about the collected build side: whether it physically has rows, and whether its hash map has any matchable entries. These differ under `NullEquality::NullEqualsNothing`, where build rows with a NULL join key are omitted from the map -- the distinction that was the source of the bug fixed in apache#22893. Expose it as named methods on `JoinLeftData` (`has_build_rows` / `has_matchable_build_rows`) instead of repeating raw `batch().num_rows()` / `map().is_empty()` checks at the call site, so the invariant is visible at the API surface and future uses are harder to get wrong. No behavior change. Closes apache#23008 Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Drop the explicit intra-doc link reference on `has_matchable_build_rows`: the inline `[`NullEquality::NullEqualsNothing`]` link resolves on its own since the type is in scope, so spelling out the path again is redundant. Also use `has_matchable_build_rows()` for the empty-build-side check in `HashJoinStream` instead of a raw `map().is_empty()`, keeping the presence checks consistent. No behavior change. Signed-off-by: Jiawei Zhao <Phoenix500526@163.com>
Head branch was pushed to by a user without write access
2f42220 to
b39f8f1
Compare
Which issue does this PR close?
Rationale for this change
HashJoinStream::state_after_build_readyrelies on two different facts about the collected build side:batch().num_rows())map().is_empty())These diverge under
NullEquality::NullEqualsNothing: build rows whose join key is NULL are omitted from the map, so the map can be empty while the build side still has rows. Conflating the two was the source of the bug fixed in #22893.Today both checks are written inline as raw
batch().num_rows() == 0/map().is_empty(). Giving each a name onJoinLeftDatamakes the distinction explicit at the API surface, so future call sites are harder to get wrong.What changes are included in this PR?
JoinLeftData:has_build_rows()— build-side row presencehas_matchable_build_rows()— matchable hash-map entry presencestate_after_build_readyto use them instead of the raw checks.Pure readability refactor — no behavior change.
Are these changes tested?
Covered by the existing hash join tests;
cargo test -p datafusion-physical-plan hash_joinpasses (804 tests). Since this is a readability-only refactor with no behavior change, no new tests are added.Are there any user-facing changes?
No.