feat: Support Spark array_contains builtin function#20685
feat: Support Spark array_contains builtin function#20685comphead merged 4 commits intoapache:mainfrom
array_contains builtin function#20685Conversation
mbutrovich
left a comment
There was a problem hiding this comment.
I don't love the needle and haystack naming. I'm not sure if that idiom generalizes to all contributors.
| 2 NULL NULL | ||
| 3 false false | ||
| 4 NULL NULL | ||
| 5 true true |
There was a problem hiding this comment.
Is that what Spark does? It short circuits and returns true even if there are NULLs in the array? What if the match is after the NULL? Would it return NULL? I thought if any element of the array was NULL the outcome of the expression is NULL.
There was a problem hiding this comment.
scala> spark.sql("select array_contains(array(1, null, 2), 2)").show(false)
+------------------------------------+
|array_contains(array(1, NULL, 2), 2)|
+------------------------------------+
|true |
+------------------------------------+
scala> spark.sql("select array_contains(array(1, 2, null), 2)").show(false)
+------------------------------------+
|array_contains(array(1, 2, NULL), 2)|
+------------------------------------+
|true |
+------------------------------------+
scala> spark.sql("select array_contains(array(1, null), 2)").show(false)
+---------------------------------+
|array_contains(array(1, NULL), 2)|
+---------------------------------+
|null |
+---------------------------------+
I think explanation in https://issues.apache.org/jira/browse/SPARK-55749 is very accurately explaining this behavior
There was a problem hiding this comment.
on this data Spark run
=== Test Data ===
+----+----------------+----+
|col1|col2 |col3|
+----+----------------+----+
|1 |[1, 2, 3] |10 |
|2 |[4, null, 6] |5 |
|3 |[7, 8, 9] |10 |
|4 |null |1 |
|5 |[10, null, null]|10 |
+----+----------------+----+
=== array_contains(col2, col3) and array_contains(col2, 10) ===
+----+--------------------------+------------------------+
|col1|array_contains(col2, col3)|array_contains(col2, 10)|
+----+--------------------------+------------------------+
|1 |false |false |
|2 |null |null |
|3 |false |false |
|4 |null |null |
|5 |true |true |
+----+--------------------------+------------------------+
it is what currently used in |
| let haystack = match haystack_arg { | ||
| ColumnarValue::Array(arr) => Arc::clone(arr), | ||
| ColumnarValue::Scalar(s) => s.to_array_of_size(result.len())?, | ||
| }; |
There was a problem hiding this comment.
| let haystack = match haystack_arg { | |
| ColumnarValue::Array(arr) => Arc::clone(arr), | |
| ColumnarValue::Scalar(s) => s.to_array_of_size(result.len())?, | |
| }; | |
| let haystack = haystack_arg.to_array_of_size(result.len())?; |
| let old_validity = match result.nulls() { | ||
| Some(n) => n.inner().clone(), | ||
| None => BooleanBuffer::new_set(result.len()), | ||
| }; | ||
| let new_validity = &old_validity & &(!&nullify_mask); |
There was a problem hiding this comment.
| let old_validity = match result.nulls() { | |
| Some(n) => n.inner().clone(), | |
| None => BooleanBuffer::new_set(result.len()), | |
| }; | |
| let new_validity = &old_validity & &(!&nullify_mask); | |
| let new_validity = match result.nulls() { | |
| Some(n) => n.inner() & &(!&nullify_mask), | |
| None => !&nullify_mask, | |
| }; |
| let buf = builder.finish(); | ||
| Some(mask_with_list_nulls(buf, list.nulls())) | ||
| } | ||
| _ => None, |
There was a problem hiding this comment.
Probably better to just error/panic here to make it clear this None path is unreachable?
|
Thanks @Jefffrey and @mbutrovich for the review |
## Which issue does this PR close? <!-- We generally require a GitHub issue to be filed for all bug fixes and enhancements and this helps us generate change logs for our releases. You can link an issue to this PR using the GitHub syntax. For example `Closes #123` indicates that this PR will close issue #123. --> - Closes apache#20611 . ## Rationale for this change <!-- Why are you proposing this change? If this is already explained clearly in the issue then this section is not needed. Explaining clearly why changes are proposed helps reviewers understand your changes and offer better suggestions for fixes. --> ## What changes are included in this PR? The Spark function is actual wrapper on top of `array_has` function. After result is being produced the nulls mask is set respectively for the output indices which correspond to input rows having nulls <!-- There is no need to duplicate the description in the issue here but it is sometimes worth providing a summary of the individual changes in this PR. --> ## Are these changes tested? <!-- We typically require tests for all PRs in order to: 1. Prevent the code from being accidentally broken by subsequent changes 2. Serve as another way to document the expected behavior of the code If tests are not included in your PR, please explain why (for example, are they covered by existing tests)? --> ## Are there any user-facing changes? <!-- If there are user-facing changes then we may require documentation to be updated before approving the PR. --> <!-- If there are any breaking changes to public APIs, please add the `api change` label. --> (cherry picked from commit 953bdf4)
Which issue does this PR close?
array_contains#20611 .Rationale for this change
What changes are included in this PR?
The Spark function is actual wrapper on top of
array_hasfunction. After result is being produced the nulls mask is set respectively for the output indices which correspond to input rows having nullsAre these changes tested?
Are there any user-facing changes?