Skip to content

feat: add support for array_position expression#3172

Open
andygrove wants to merge 20 commits intoapache:mainfrom
andygrove:feature/array-position
Open

feat: add support for array_position expression#3172
andygrove wants to merge 20 commits intoapache:mainfrom
andygrove:feature/array-position

Conversation

@andygrove
Copy link
Member

@andygrove andygrove commented Jan 15, 2026

Which issue does this PR close?

Closes #3157.
Closes #3153.

Rationale for this change

Spark's array_position function is not currently accelerated by Comet. Adding native support allows this expression to run on the native execution engine.

What changes are included in this PR?

Adds native Comet support for Spark's array_position function, which returns the 1-based position of an element in an array, or 0 if not found.

This required a custom Rust implementation because DataFusion's array_position returns UInt64 and null when not found, while Spark returns Int64 (LongType) and 0.

Key implementation details:

  • Returns Int64 to match Spark's LongType
  • Returns 0 when element is not found (Spark behavior)
  • Returns null when array is null or search element is null
  • Supports both List and LargeList array types
  • Handles NaN equality to match Spark's ordering.equiv() semantics (NaN == NaN)
  • Uses flat values buffer with direct offset access for efficient element comparison

Benchmark Results

OpenJDK 64-Bit Server VM 17.0.17+10 on Mac OS X 26.2
Apple M3 Ultra
array_position - int array:               Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                                88             92           3         11.9          83.8       1.0X
Comet (Scan)                                        128            138           7          8.2         122.0       0.7X
Comet (Scan + Exec)                                  73             76           3         14.4          69.4       1.2X

array_position - string array:            Best Time(ms)   Avg Time(ms)   Stdev(ms)    Rate(M/s)   Per Row(ns)   Relative
------------------------------------------------------------------------------------------------------------------------
Spark                                               157            163           3          6.7         149.5       1.0X
Comet (Scan)                                        269            275           4          3.9         256.9       0.6X
Comet (Scan + Exec)                                 147            150           2          7.1         140.4       1.1X

Comet native execution is 1.1-1.2X faster than Spark for this expression when using the native DataFusion scan.

How are these changes tested?

SQL file-based tests in spark/src/test/resources/sql-tests/expressions/array/array_position.sql covering:

  • Integer, string, boolean, tinyint, smallint, bigint, float, double, decimal, date, and timestamp arrays
  • Null handling (null array, null search element, null elements within arrays)
  • Column vs literal argument combinations
  • Empty arrays and duplicate elements
  • NaN edge cases for float and double arrays
  • Microbenchmark in CometArrayExpressionBenchmark

Implements Spark's array_position function which returns the 1-based
position of an element in an array, returning 0 if not found.

This required a custom Rust implementation because DataFusion's
array_position returns UInt64 and null when not found, while Spark
returns Int64 (LongType) and 0.

Key implementation details:
- Returns Int64 to match Spark's LongType
- Returns 0 when element is not found (Spark behavior)
- Returns null when array is null or search element is null
- Supports both List and LargeList array types

Closes apache#3153

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as draft January 15, 2026 02:28
@codecov-commenter
Copy link

codecov-commenter commented Jan 15, 2026

Codecov Report

❌ Patch coverage is 76.92308% with 3 lines in your changes missing coverage. Please review.
✅ Project coverage is 59.96%. Comparing base (f09f8af) to head (36c3320).
⚠️ Report is 907 commits behind head on main.

Files with missing lines Patch % Lines
...src/main/scala/org/apache/comet/serde/arrays.scala 75.00% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@             Coverage Diff              @@
##               main    #3172      +/-   ##
============================================
+ Coverage     56.12%   59.96%   +3.84%     
- Complexity      976     1462     +486     
============================================
  Files           119      175      +56     
  Lines         11743    16180    +4437     
  Branches       2251     2684     +433     
============================================
+ Hits           6591     9703    +3112     
- Misses         4012     5128    +1116     
- Partials       1140     1349     +209     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@andygrove andygrove marked this pull request as ready for review January 15, 2026 23:36
# Conflicts:
#	docs/source/user-guide/latest/configs.md
#	native/spark-expr/src/comet_scalar_funcs.rs
@andygrove andygrove marked this pull request as draft January 30, 2026 01:48
@andygrove
Copy link
Member Author

Moving this to draft until #3328 is merged

andygrove and others added 2 commits February 10, 2026 11:07
Move array_position tests from CometArrayExpressionSuite to a SQL file
test and fall back to Spark when all arguments are literals.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@andygrove andygrove marked this pull request as ready for review February 11, 2026 00:08
…o feature/array-position

# Conflicts:
#	native/spark-expr/src/array_funcs/mod.rs
#	native/spark-expr/src/comet_scalar_funcs.rs
andygrove and others added 4 commits February 18, 2026 07:44
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove stray array_repeat references from merge conflict resolution.
Add NULL val row to test data and add tests for all supported array
element types: boolean, tinyint, smallint, bigint, float, double,
decimal, date, and timestamp.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@comphead
Copy link
Contributor

Just wondering can we reuse DF? the builtin function gets optimized in apache/datafusion#20532

@andygrove
Copy link
Member Author

Just wondering can we reuse DF? the builtin function gets optimized in apache/datafusion#20532

The DF implementation isn't compatible with Spark though.

@andygrove andygrove marked this pull request as draft March 16, 2026 17:43
… review feedback

- Use typed array downcasting instead of ScalarValue for element
  comparison, improving performance from 0.4X to 0.7-0.8X of Spark
- Add getSupportLevel override marking as Incompatible (NaN equality)
- Add NaN edge case tests for float/double arrays
- Add CometArrayExpressionBenchmark microbenchmark
- Make spark_array_position function private
- Update docs to mark array_position as supported
Treat NaN == NaN in float/double comparisons, matching Spark's
ordering.equiv() behavior. This makes array_position Compatible
rather than Incompatible.
Avoid per-row subarray allocation from list_array.value(row_index).
Instead, downcast the flat values buffer once and iterate using
offset ranges directly. Improves from 0.7-0.8X to 0.9X of Spark.
Switches benchmark to use SCAN_NATIVE_DATAFUSION for the Comet cases,
avoiding JVM parquet reader overhead. Results now show Comet is
1.1-1.2X faster than Spark.
@andygrove andygrove marked this pull request as ready for review March 18, 2026 12:08
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature] Support Spark expression: array_position [Feature] Support Spark expression: length_of_json_array

3 participants