feat: support StringSplitSQL for split_part#4592
Conversation
|
Thanks @michaelmitchell-bit. This looks good overall. StringSplitSQL is collation-aware in Spark and Comet does not support different collations yet, so it would be good to add a fallback for that case. See |
awesome, ty. added a fallback for non-utf8_binary collations and covered it in the split_part SQL file test. |
|
Thanks @michaelmitchell-bit could you fix conflicts? |
just resolved & updated w/ latest main |
|
A few observations from a review pass. Acknowledging up front that this comment was drafted with help from AI (Claude Code).
|
Which issue does this PR close?
Closes #4561.
Rationale for this change
split_partlowers toelement_at(StringSplitSQL(str, delimiter), partNum). Comet already supportselement_at, but the innerStringSplitSQLexpression was unsupported, sosplit_partfell back to Spark.What changes are included in this PR?
Adds native support for Spark 4.x
StringSplitSQLvia a literal-delimitersplit_sqlfunction. Unlikesplit/StringSplit, this splits on the delimiter literally rather than as a regex and keeps trailing empty parts.This also fixes
ListExtractdefault-value typing so defaults are cast to the array element type rather than the list type. That is needed forsplit_part, which uses an empty-string default throughelement_at.Updates expression support docs and adds SQL-file coverage for
split_part, including literal delimiters, dynamic delimiters, nulls, out-of-range defaults, negative positions, and invalid zero index behavior.How are these changes tested?
Ran:
cargo fmt -p datafusion-comet-spark-exprcargo test -p datafusion-comet-spark-expr string_funcs::splitcargo test -p datafusion-comet-spark-expr list_extractcargo build./mvnw test -Dtest=none -Dsuites="org.apache.comet.CometSqlFileTestSuite split_part" -Dscalastyle.skip=true./mvnw compile -Pspark-4.0 -DskipTests -Dscalastyle.skip=true./mvnw compile -Pspark-4.2 -DskipTests -Dscalastyle.skip=truegit diff --check