Skip to content

feat: Support Spark array_contains builtin function#20685

Merged
comphead merged 4 commits intoapache:mainfrom
comphead:spark
Mar 5, 2026
Merged

feat: Support Spark array_contains builtin function#20685
comphead merged 4 commits intoapache:mainfrom
comphead:spark

Conversation

@comphead
Copy link
Contributor

@comphead comphead commented Mar 3, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

The Spark function is actual wrapper on top of array_has function. After result is being produced the nulls mask is set respectively for the output indices which correspond to input rows having nulls

Are these changes tested?

Are there any user-facing changes?

@github-actions github-actions bot added documentation Improvements or additions to documentation development-process Related to development process of DataFusion sqllogictest SQL Logic Tests (.slt) spark labels Mar 3, 2026
@github-actions github-actions bot removed documentation Improvements or additions to documentation development-process Related to development process of DataFusion labels Mar 3, 2026
n0r0shi

This comment was marked as duplicate.

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the needle and haystack naming. I'm not sure if that idiom generalizes to all contributors.

2 NULL NULL
3 false false
4 NULL NULL
5 true true
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is that what Spark does? It short circuits and returns true even if there are NULLs in the array? What if the match is after the NULL? Would it return NULL? I thought if any element of the array was NULL the outcome of the expression is NULL.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

scala> spark.sql("select array_contains(array(1, null, 2), 2)").show(false)
+------------------------------------+
|array_contains(array(1, NULL, 2), 2)|
+------------------------------------+
|true                                |
+------------------------------------+


scala> spark.sql("select array_contains(array(1, 2, null), 2)").show(false)
+------------------------------------+
|array_contains(array(1, 2, NULL), 2)|
+------------------------------------+
|true                                |
+------------------------------------+


scala> spark.sql("select array_contains(array(1, null), 2)").show(false)
+---------------------------------+
|array_contains(array(1, NULL), 2)|
+---------------------------------+
|null                             |
+---------------------------------+

I think explanation in https://issues.apache.org/jira/browse/SPARK-55749 is very accurately explaining this behavior

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

on this data Spark run

=== Test Data ===
+----+----------------+----+                                                    
|col1|col2            |col3|
+----+----------------+----+
|1   |[1, 2, 3]       |10  |
|2   |[4, null, 6]    |5   |
|3   |[7, 8, 9]       |10  |
|4   |null            |1   |
|5   |[10, null, null]|10  |
+----+----------------+----+

=== array_contains(col2, col3) and array_contains(col2, 10) ===
+----+--------------------------+------------------------+
|col1|array_contains(col2, col3)|array_contains(col2, 10)|
+----+--------------------------+------------------------+
|1   |false                     |false                   |
|2   |null                      |null                    |
|3   |false                     |false                   |
|4   |null                      |null                    |
|5   |true                      |true                    |
+----+--------------------------+------------------------+

@comphead comphead requested a review from mbutrovich March 4, 2026 23:35
@comphead
Copy link
Contributor Author

comphead commented Mar 4, 2026

I don't love the needle and haystack naming. I'm not sure if that idiom generalizes to all contributors.

it is what currently used in array_has*, position and some other functions. But agree, not sure this is most idiomatic approach

Copy link
Contributor

@mbutrovich mbutrovich left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @comphead!

Comment on lines +91 to +94
let haystack = match haystack_arg {
ColumnarValue::Array(arr) => Arc::clone(arr),
ColumnarValue::Scalar(s) => s.to_array_of_size(result.len())?,
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let haystack = match haystack_arg {
ColumnarValue::Array(arr) => Arc::clone(arr),
ColumnarValue::Scalar(s) => s.to_array_of_size(result.len())?,
};
let haystack = haystack_arg.to_array_of_size(result.len())?;

Comment on lines +108 to +112
let old_validity = match result.nulls() {
Some(n) => n.inner().clone(),
None => BooleanBuffer::new_set(result.len()),
};
let new_validity = &old_validity & &(!&nullify_mask);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
let old_validity = match result.nulls() {
Some(n) => n.inner().clone(),
None => BooleanBuffer::new_set(result.len()),
};
let new_validity = &old_validity & &(!&nullify_mask);
let new_validity = match result.nulls() {
Some(n) => n.inner() & &(!&nullify_mask),
None => !&nullify_mask,
};

let buf = builder.finish();
Some(mask_with_list_nulls(buf, list.nulls()))
}
_ => None,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Probably better to just error/panic here to make it clear this None path is unreachable?

@comphead
Copy link
Contributor Author

comphead commented Mar 5, 2026

Thanks @Jefffrey and @mbutrovich for the review

@comphead comphead added this pull request to the merge queue Mar 5, 2026
Merged via the queue into apache:main with commit 953bdf4 Mar 5, 2026
30 checks passed
comphead added a commit to comphead/arrow-datafusion that referenced this pull request Mar 5, 2026
## Which issue does this PR close?

<!--
We generally require a GitHub issue to be filed for all bug fixes and
enhancements and this helps us generate change logs for our releases.
You can link an issue to this PR using the GitHub syntax. For example
`Closes #123` indicates that this PR will close issue #123.
-->

- Closes apache#20611 .

## Rationale for this change

<!--
Why are you proposing this change? If this is already explained clearly
in the issue then this section is not needed.
Explaining clearly why changes are proposed helps reviewers understand
your changes and offer better suggestions for fixes.
-->

## What changes are included in this PR?

The Spark function is actual wrapper on top of `array_has` function.
After result is being produced the nulls mask is set respectively for
the output indices which correspond to input rows having nulls

<!--
There is no need to duplicate the description in the issue here but it
is sometimes worth providing a summary of the individual changes in this
PR.
-->

## Are these changes tested?

<!--
We typically require tests for all PRs in order to:
1. Prevent the code from being accidentally broken by subsequent changes
2. Serve as another way to document the expected behavior of the code

If tests are not included in your PR, please explain why (for example,
are they covered by existing tests)?
-->

## Are there any user-facing changes?

<!--
If there are user-facing changes then we may require documentation to be
updated before approving the PR.
-->

<!--
If there are any breaking changes to public APIs, please add the `api
change` label.
-->

(cherry picked from commit 953bdf4)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

spark sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Create Spark builtin function array_contains

4 participants