feat: support StringSplitSQL for split_part by michaelmitchell-bit · Pull Request #4592 · apache/datafusion-comet

michaelmitchell-bit · 2026-06-04T15:09:54Z

Which issue does this PR close?

Closes #4561.

Rationale for this change

split_part lowers to element_at(StringSplitSQL(str, delimiter), partNum). Comet already supports element_at, but the inner StringSplitSQL expression was unsupported, so split_part fell back to Spark.

What changes are included in this PR?

Adds native support for Spark 4.x StringSplitSQL via a literal-delimiter split_sql function. Unlike split / StringSplit, this splits on the delimiter literally rather than as a regex and keeps trailing empty parts.

This also fixes ListExtract default-value typing so defaults are cast to the array element type rather than the list type. That is needed for split_part, which uses an empty-string default through element_at.

Updates expression support docs and adds SQL-file coverage for split_part, including literal delimiters, dynamic delimiters, nulls, out-of-range defaults, negative positions, and invalid zero index behavior.

How are these changes tested?

Ran:

cargo fmt -p datafusion-comet-spark-expr
cargo test -p datafusion-comet-spark-expr string_funcs::split
cargo test -p datafusion-comet-spark-expr list_extract
cargo build
./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometSqlFileTestSuite split_part" -Dscalastyle.skip=true
./mvnw compile -Pspark-4.0 -DskipTests -Dscalastyle.skip=true
./mvnw compile -Pspark-4.2 -DskipTests -Dscalastyle.skip=true
git diff --check

andygrove · 2026-06-04T21:26:10Z

Thanks @michaelmitchell-bit. This looks good overall.

StringSplitSQL is collation-aware in Spark and Comet does not support different collations yet, so it would be good to add a fallback for that case. See concat and reverse for examples.

michaelmitchell-bit · 2026-06-04T23:13:55Z

Thanks @michaelmitchell-bit. This looks good overall.

StringSplitSQL is collation-aware in Spark and Comet does not support different collations yet, so it would be good to add a fallback for that case. See concat and reverse for examples.

awesome, ty. added a fallback for non-utf8_binary collations and covered it in the split_part SQL file test.

andygrove · 2026-06-11T20:55:24Z

Thanks @michaelmitchell-bit could you fix conflicts?

michaelmitchell-bit · 2026-06-15T16:35:27Z

Thanks @michaelmitchell-bit could you fix conflicts?

just resolved & updated w/ latest main

andygrove · 2026-06-18T08:40:26Z

A few observations from a review pass. Acknowledging up front that this comment was drafted with help from AI (Claude Code).

native/spark-expr/src/string_funcs/split.rs — spark_split_sql matches (Array, Scalar), (Array, Array), and (Scalar, Scalar), but (Scalar, Array) falls through to the exec_err! arm. Could that shape ever come through, e.g. split_part('abc', d, 1) where the delimiter is a column? spark_split has the same gap so this is consistent with what's there today, but if it's reachable it's probably worth treating the same as (Array, Array) after broadcasting.
native/spark-expr/src/array_funcs/list_extract.rs — want to make sure I'm reading the cast change correctly. data_type(&schema) returns the child field's element type, so the previous code was casting the default value to the outer list type instead of the element type, which would either no-op into a typed null or fail for split_part's empty-string default. Switching to element_type lines up with how Spark's ElementAt(... Some(Literal.create("", str.dataType)) ...) wants the default typed. Is that the right reading?
spark/src/test/resources/sql-tests/expressions/string/split_part.sql — coverage looks solid. Two nice-to-haves if you feel like adding them. A column-data row with a multi-character delimiter would exercise the path where str::split advances by more than one byte per match. And a fixture where the string contains regex metacharacters like * or . paired with a literal delimiter would assert at the SQL level what test_split_sql_keeps_regex_chars_literal asserts in Rust.

feat: support StringSplitSQL for split_part

1b86767

fix: fall back for StringSplitSQL collations

a66ad75

Merge branch 'main' into support-stringsplitsql-split-part-4561

271be54

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: support StringSplitSQL for split_part#4592

feat: support StringSplitSQL for split_part#4592
michaelmitchell-bit wants to merge 3 commits into
apache:mainfrom
michaelmitchell-bit:support-stringsplitsql-split-part-4561

michaelmitchell-bit commented Jun 4, 2026

Uh oh!

andygrove commented Jun 4, 2026

Uh oh!

michaelmitchell-bit commented Jun 4, 2026

Uh oh!

andygrove commented Jun 11, 2026

Uh oh!

michaelmitchell-bit commented Jun 15, 2026

Uh oh!

andygrove commented Jun 18, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

michaelmitchell-bit commented Jun 4, 2026

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

How are these changes tested?

Uh oh!

andygrove commented Jun 4, 2026

Uh oh!

michaelmitchell-bit commented Jun 4, 2026

Uh oh!

andygrove commented Jun 11, 2026

Uh oh!

michaelmitchell-bit commented Jun 15, 2026

Uh oh!

andygrove commented Jun 18, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

andygrove commented Jun 18, 2026 •

edited

Loading