feat: add support for array_position expression by andygrove · Pull Request #3172 · apache/datafusion-comet

andygrove · 2026-01-15T02:27:49Z

Which issue does this PR close?

Closes #3157.
Closes #3153.

Rationale for this change

Spark's array_position function is not currently accelerated by Comet. Adding native support allows this expression to run on the native execution engine.

What changes are included in this PR?

Adds native Comet support for Spark's array_position function, which returns the 1-based position of an element in an array, or 0 if not found.

This required a custom Rust implementation because DataFusion's array_position returns UInt64 and null when not found, while Spark returns Int64 (LongType) and 0.

Key implementation details:

Returns Int64 to match Spark's LongType
Returns 0 when element is not found (Spark behavior)
Returns null when array is null or search element is null
Supports both List and LargeList array types
Handles NaN equality to match Spark's ordering.equiv() semantics (NaN == NaN)
Uses flat values buffer with direct offset access for efficient element comparison

Benchmark Results

OpenJDK 64-Bit Server VM 17.0.17+10 on Mac OS X 26.2
Apple M3 Ultra
array_position - int array:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                88             92           3         11.9          83.8       1.0X
Comet (Scan)                                        128            138           7          8.2         122.0       0.7X
Comet (Scan + Exec)                                  73             76           3         14.4          69.4       1.2X

array_position - string array:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               157            163           3          6.7         149.5       1.0X
Comet (Scan)                                        269            275           4          3.9         256.9       0.6X
Comet (Scan + Exec)                                 147            150           2          7.1         140.4       1.1X

Comet native execution is 1.1-1.2X faster than Spark for this expression when using the native DataFusion scan.

How are these changes tested?

SQL file-based tests in spark/src/test/resources/sql-tests/expressions/array/array_position.sql covering:

Integer, string, boolean, tinyint, smallint, bigint, float, double, decimal, date, and timestamp arrays
Null handling (null array, null search element, null elements within arrays)
Column vs literal argument combinations
Empty arrays and duplicate elements
NaN edge cases for float and double arrays
Microbenchmark in CometArrayExpressionBenchmark

Implements Spark's array_position function which returns the 1-based position of an element in an array, returning 0 if not found. This required a custom Rust implementation because DataFusion's array_position returns UInt64 and null when not found, while Spark returns Int64 (LongType) and 0. Key implementation details: - Returns Int64 to match Spark's LongType - Returns 0 when element is not found (Spark behavior) - Returns null when array is null or search element is null - Supports both List and LargeList array types Closes apache#3153 Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>

codecov-commenter · 2026-01-15T15:40:27Z

Codecov Report

❌ Patch coverage is 76.92308% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.96%. Comparing base (f09f8af) to head (36c3320).
⚠️ Report is 907 commits behind head on main.

Files with missing lines	Patch %	Lines
...src/main/scala/org/apache/comet/serde/arrays.scala	75.00%	2 Missing and 1 partial ⚠️

Additional details and impacted files

@@             Coverage Diff              @@
##               main    #3172      +/-   ##
============================================
+ Coverage     56.12%   59.96%   +3.84%     
- Complexity      976     1462     +486     
============================================
  Files           119      175      +56     
  Lines         11743    16180    +4437     
  Branches       2251     2684     +433     
============================================
+ Hits           6591     9703    +3112     
- Misses         4012     5128    +1116     
- Partials       1140     1349     +209

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

# Conflicts: # docs/source/user-guide/latest/configs.md # native/spark-expr/src/comet_scalar_funcs.rs

andygrove · 2026-01-30T15:03:31Z

Moving this to draft until #3328 is merged

Move array_position tests from CometArrayExpressionSuite to a SQL file test and fall back to Spark when all arguments are literals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

…o feature/array-position # Conflicts: # native/spark-expr/src/array_funcs/mod.rs # native/spark-expr/src/comet_scalar_funcs.rs

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Remove stray array_repeat references from merge conflict resolution. Add NULL val row to test data and add tests for all supported array element types: boolean, tinyint, smallint, bigint, float, double, decimal, date, and timestamp. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

comphead · 2026-02-26T16:23:00Z

Just wondering can we reuse DF? the builtin function gets optimized in apache/datafusion#20532

andygrove · 2026-03-07T15:23:07Z

Just wondering can we reuse DF? the builtin function gets optimized in apache/datafusion#20532

The DF implementation isn't compatible with Spark though.

… review feedback - Use typed array downcasting instead of ScalarValue for element comparison, improving performance from 0.4X to 0.7-0.8X of Spark - Add getSupportLevel override marking as Incompatible (NaN equality) - Add NaN edge case tests for float/double arrays - Add CometArrayExpressionBenchmark microbenchmark - Make spark_array_position function private - Update docs to mark array_position as supported

Treat NaN == NaN in float/double comparisons, matching Spark's ordering.equiv() behavior. This makes array_position Compatible rather than Incompatible.

Avoid per-row subarray allocation from list_array.value(row_index). Instead, downcast the flat values buffer once and iterate using offset ranges directly. Improves from 0.7-0.8X to 0.9X of Spark.

Switches benchmark to use SCAN_NATIVE_DATAFUSION for the Comet cases, avoiding JVM parquet reader overhead. Results now show Comet is 1.1-1.2X faster than Spark.

andygrove marked this pull request as draft January 15, 2026 02:28

update docs

8cb27ec

andygrove marked this pull request as ready for review January 15, 2026 23:36

Merge remote-tracking branch 'origin/main' into feature/array-position

36c3320

# Conflicts: # docs/source/user-guide/latest/configs.md # native/spark-expr/src/comet_scalar_funcs.rs

andygrove marked this pull request as draft January 30, 2026 01:48

andygrove and others added 2 commits February 10, 2026 11:07

Merge remote-tracking branch 'apache/main' into feature/array-position

4864e9a

Migrate array_position tests to SQL file-based approach

258442e

Move array_position tests from CometArrayExpressionSuite to a SQL file test and fall back to Spark when all arguments are literals. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

andygrove marked this pull request as ready for review February 11, 2026 00:08

andygrove added 2 commits February 11, 2026 10:08

upmerge

3f7d69d

Merge branch 'main' of https://github.com/apache/datafusion-comet int…

554dcf8

…o feature/array-position # Conflicts: # native/spark-expr/src/array_funcs/mod.rs # native/spark-expr/src/comet_scalar_funcs.rs

andygrove requested review from comphead, mbutrovich and parthchandra February 18, 2026 14:41

andygrove and others added 4 commits February 18, 2026 07:44

revert whitespace change in CometArrayExpressionSuite.scala

b716584

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

rustfmt

cd3a577

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

Merge branch 'main' into feature/array-position

d9d424f

andygrove added 3 commits March 7, 2026 08:26

upmerge

c66024e

Merge remote-tracking branch 'apache/main' into feature/array-position

6172a0c

fix: remove stray merge conflict marker

e4620e9

andygrove marked this pull request as draft March 16, 2026 17:43

andygrove added 5 commits March 16, 2026 12:58

fix: explain NaN incompatibility in ArrayPosition getSupportLevel

fad07b9

fix: handle NaN equality in array_position to match Spark semantics

24318d3

Treat NaN == NaN in float/double comparisons, matching Spark's ordering.equiv() behavior. This makes array_position Compatible rather than Incompatible.

perf: use flat values buffer and offsets for array_position

b7ddad1

Avoid per-row subarray allocation from list_array.value(row_index). Instead, downcast the flat values buffer once and iterate using offset ranges directly. Improves from 0.7-0.8X to 0.9X of Spark.

perf: use native_datafusion scan in array_position benchmark

79b348f

Switches benchmark to use SCAN_NATIVE_DATAFUSION for the Comet cases, avoiding JVM parquet reader overhead. Results now show Comet is 1.1-1.2X faster than Spark.

format

41b5495

andygrove marked this pull request as ready for review March 18, 2026 12:08

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add support for array_position expression#3172

feat: add support for array_position expression#3172
andygrove wants to merge 20 commits intoapache:mainfrom
andygrove:feature/array-position

andygrove commented Jan 15, 2026 •

edited

Loading

Uh oh!

codecov-commenter commented Jan 15, 2026 •

edited

Loading

Uh oh!

andygrove commented Jan 30, 2026

Uh oh!

comphead commented Feb 26, 2026

Uh oh!

andygrove commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

andygrove commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Benchmark Results

How are these changes tested?

Uh oh!

codecov-commenter commented Jan 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

andygrove commented Jan 30, 2026

Uh oh!

comphead commented Feb 26, 2026

Uh oh!

andygrove commented Mar 7, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

andygrove commented Jan 15, 2026 •

edited

Loading

codecov-commenter commented Jan 15, 2026 •

edited

Loading