Skip to content

feat: support StringSplitSQL for split_part#4592

Open
michaelmitchell-bit wants to merge 3 commits into
apache:mainfrom
michaelmitchell-bit:support-stringsplitsql-split-part-4561
Open

feat: support StringSplitSQL for split_part#4592
michaelmitchell-bit wants to merge 3 commits into
apache:mainfrom
michaelmitchell-bit:support-stringsplitsql-split-part-4561

Conversation

@michaelmitchell-bit

Copy link
Copy Markdown
Contributor

Which issue does this PR close?

Closes #4561.

Rationale for this change

split_part lowers to element_at(StringSplitSQL(str, delimiter), partNum). Comet already supports element_at, but the inner StringSplitSQL expression was unsupported, so split_part fell back to Spark.

What changes are included in this PR?

Adds native support for Spark 4.x StringSplitSQL via a literal-delimiter split_sql function. Unlike split / StringSplit, this splits on the delimiter literally rather than as a regex and keeps trailing empty parts.

This also fixes ListExtract default-value typing so defaults are cast to the array element type rather than the list type. That is needed for split_part, which uses an empty-string default through element_at.

Updates expression support docs and adds SQL-file coverage for split_part, including literal delimiters, dynamic delimiters, nulls, out-of-range defaults, negative positions, and invalid zero index behavior.

How are these changes tested?

Ran:

  • cargo fmt -p datafusion-comet-spark-expr
  • cargo test -p datafusion-comet-spark-expr string_funcs::split
  • cargo test -p datafusion-comet-spark-expr list_extract
  • cargo build
  • ./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometSqlFileTestSuite split_part" -Dscalastyle.skip=true
  • ./mvnw compile -Pspark-4.0 -DskipTests -Dscalastyle.skip=true
  • ./mvnw compile -Pspark-4.2 -DskipTests -Dscalastyle.skip=true
  • git diff --check

@andygrove

Copy link
Copy Markdown
Member

Thanks @michaelmitchell-bit. This looks good overall.

StringSplitSQL is collation-aware in Spark and Comet does not support different collations yet, so it would be good to add a fallback for that case. See concat and reverse for examples.

@michaelmitchell-bit

Copy link
Copy Markdown
Contributor Author

Thanks @michaelmitchell-bit. This looks good overall.

StringSplitSQL is collation-aware in Spark and Comet does not support different collations yet, so it would be good to add a fallback for that case. See concat and reverse for examples.

awesome, ty. added a fallback for non-utf8_binary collations and covered it in the split_part SQL file test.

@andygrove

Copy link
Copy Markdown
Member

Thanks @michaelmitchell-bit could you fix conflicts?

@michaelmitchell-bit

Copy link
Copy Markdown
Contributor Author

Thanks @michaelmitchell-bit could you fix conflicts?

just resolved & updated w/ latest main

@andygrove

andygrove commented Jun 18, 2026

Copy link
Copy Markdown
Member

A few observations from a review pass. Acknowledging up front that this comment was drafted with help from AI (Claude Code).

  1. native/spark-expr/src/string_funcs/split.rsspark_split_sql matches (Array, Scalar), (Array, Array), and (Scalar, Scalar), but (Scalar, Array) falls through to the exec_err! arm. Could that shape ever come through, e.g. split_part('abc', d, 1) where the delimiter is a column? spark_split has the same gap so this is consistent with what's there today, but if it's reachable it's probably worth treating the same as (Array, Array) after broadcasting.

  2. native/spark-expr/src/array_funcs/list_extract.rs — want to make sure I'm reading the cast change correctly. data_type(&schema) returns the child field's element type, so the previous code was casting the default value to the outer list type instead of the element type, which would either no-op into a typed null or fail for split_part's empty-string default. Switching to element_type lines up with how Spark's ElementAt(... Some(Literal.create("", str.dataType)) ...) wants the default typed. Is that the right reading?

  3. spark/src/test/resources/sql-tests/expressions/string/split_part.sql — coverage looks solid. Two nice-to-haves if you feel like adding them. A column-data row with a multi-character delimiter would exercise the path where str::split advances by more than one byte per match. And a fixture where the string contains regex metacharacters like * or . paired with a literal delimiter would assert at the SQL level what test_split_sql_keeps_regex_chars_literal asserts in Rust.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Support StringSplitSQL to enable split_part

2 participants