Skip to content

[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern#27577

Open
Myracle wants to merge 2 commits into
apache:masterfrom
Myracle:FLINK-39064-REGEXP_SPLIT
Open

[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern#27577
Myracle wants to merge 2 commits into
apache:masterfrom
Myracle:FLINK-39064-REGEXP_SPLIT

Conversation

@Myracle
Copy link
Copy Markdown
Contributor

@Myracle Myracle commented Feb 11, 2026

What is the purpose of the change

This pull request adds a new built-in function REGEXP_SPLIT to Flink SQL and Table API, which splits a string by a regular expression pattern and returns an array of substrings. This function is commonly available in other SQL engines (e.g., Spark, Presto, Hive) and provides users with more powerful string manipulation capabilities using regex patterns.

Brief change log

  • Added REGEXP_SPLIT function definition in BuiltInFunctionDefinitions with proper input/output type strategies
  • Implemented RegexpSplitFunction as a scalar function with regex pattern caching for performance optimization
  • Added regexpSplit() method to BaseExpressions for Table API support
  • Added comprehensive test cases in RegexpFunctionsITCase covering various scenarios including null handling, empty regex, invalid regex patterns, and edge cases

Verifying this change

This change added tests and can be verified as follows:

  • Added integration tests in RegexpFunctionsITCase that cover:
    • Basic regex split functionality (e.g., splitting by digit patterns [0-9]+)
    • Null input handling (both null string and null pattern)
    • Empty regex pattern (split by each character)
    • Multi-character delimiter regex patterns (e.g., [,;|])
    • Whitespace regex patterns (e.g., \\s+)
    • No match scenarios (returns original string as single-element array)
    • Invalid regex pattern handling (returns null)
    • Input validation errors for non-string type inputs
    • SQL signature validation errors

Does this pull request potentially affect one of the following parts:

  • Dependencies (does it add or upgrade a dependency): no
  • The public API, i.e., is any changed class annotated with @Public(Evolving): yes (BaseExpressions is @PublicEvolving, added regexpSplit() method)
  • The serializers: no
  • The runtime per-record code paths (performance sensitive): no (new function only, with pattern caching for optimization)
  • Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Kubernetes/Yarn, ZooKeeper: no
  • The S3 file system connector: no

Documentation

  • Does this pull request introduce a new feature? yes
  • If yes, how is the feature documented? JavaDocs (function usage examples are documented in RegexpSplitFunction class JavaDoc)

@flinkbot
Copy link
Copy Markdown
Collaborator

flinkbot commented Feb 11, 2026

CI report:

Bot commands The @flinkbot bot supports the following commands:
  • @flinkbot run azure re-run the last Azure build

}

try {
// Cache the compiled pattern to improve performance
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every other REGEXP_* function uses SqlFunctionUtils.getRegexpMatcher() which delegates to the shared REGEXP_PATTERN_CACHE (a ThreadLocalCache)

see for example https://github.com/apache/flink/blob/master/flink-table/flink-table-runtime/src/main/java/org/apache/flink/table/runtime/functions/scalar/RegexpSubstrFunction.java#L42

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@nateab Thanks for your time to review. The suggestions are valuable and I have modified the code.

import java.util.regex.PatternSyntaxException;

/**
* Implementation of {@link BuiltInFunctionDefinitions#REGEXP_SPLIT}.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also see https://issues.apache.org/jira/browse/FLINK-6810 for general instructions on what else you need to add in order to contribute builtin functions, for example which docs to add, what other considerations to make

$("f0").regexpSplit("("),
"REGEXP_SPLIT(f0, '(')",
null,
DataTypes.ARRAY(DataTypes.STRING()).notNull())
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this seems inconsistent, since we expect the return value to be null?

@github-actions github-actions Bot added the community-reviewed PR has been reviewed by the community. label Feb 11, 2026
@Myracle Myracle force-pushed the FLINK-39064-REGEXP_SPLIT branch from 7588cee to f3f463b Compare February 12, 2026 07:19
Copy link
Copy Markdown
Contributor

@nateab nateab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the fixes, almost lgtm just one comment

return new GenericArrayData(result);
}

Pattern pattern = getRegexpPattern(regexStr);
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice thanks for using the SqlFunctionUttils, but is there a reason you added getRegexpPattern instead of just using the existing getRegexpMatcher?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the review!

The reason I added getRegexpPattern() instead of using getRegexpMatcher() is that REGEXP_SPLIT needs to call Pattern.split(str, -1), and the split() method is on the Pattern class, not the Matcher class.

The existing getRegexpMatcher() returns a Matcher object which is designed for matching operations like find(), group(), etc. - this works perfectly for other REGEXP_* functions like REGEXP_SUBSTR, REGEXP_COUNT, REGEXP_INSTR that need to iterate through matches.

However, REGEXP_SPLIT doesn't need to iterate through matches - it needs to split the input string by the pattern, which requires direct access to the Pattern object.
That said, if you prefer, I could inline the cache access directly in RegexpSplitFunction to avoid adding a new utility method:

Pattern pattern;
try {
    pattern = SqlFunctionUtils.REGEXP_PATTERN_CACHE.get(regexStr);
} catch (PatternSyntaxException e) {
    return null;
}

Please let me know which approach you'd prefer:

  1. Keep getRegexpPattern() as a reusable utility (current approach) - could be useful for future functions that need direct Pattern access
  2. Inline the cache access directly in RegexpSplitFunction

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the thorough explanation! I agree that keeping it is as a reusable utility could be helpful so im good with that approach

Copy link
Copy Markdown
Contributor

@nateab nateab left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM!

@Myracle Myracle force-pushed the FLINK-39064-REGEXP_SPLIT branch from f3f463b to abfa6d6 Compare March 11, 2026 01:31
@Myracle Myracle force-pushed the FLINK-39064-REGEXP_SPLIT branch 2 times, most recently from 5cc157b to 8d75684 Compare March 19, 2026 11:02
@Myracle
Copy link
Copy Markdown
Contributor Author

Myracle commented Mar 20, 2026

@dylanhz Hello, can you help review this feature? Thanks very much.

@Myracle
Copy link
Copy Markdown
Contributor Author

Myracle commented Mar 25, 2026

@lincoln-lil Hello, can you help review this feature? Thanks very much.

@Myracle
Copy link
Copy Markdown
Contributor Author

Myracle commented Apr 9, 2026

@snuyanzin Hello, can you help review this feature? Thanks very much.

@Myracle Myracle force-pushed the FLINK-39064-REGEXP_SPLIT branch from 8d75684 to d84cb22 Compare April 9, 2026 11:13
Copy link
Copy Markdown
Contributor

@raminqaf raminqaf left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@Myracle Thanks for the contribution. Left some comments.

Comment on lines +423 to +436
>>>>>>> 8d75684590d (hotfix)
=======
Returns an `STRING` representation of the first matched substring. `NULL` if any of the arguments are `NULL` or regex is invalid or pattern is not found.
- sql: REGEXP_SPLIT(str, regex)
table: str.regexpSplit(regex)
description: |
Splits str by the regular expression regex and returns an array of strings.

E.g., REGEXP_SPLIT('Hello123World456', '[0-9]+') returns ['Hello', 'World', ''].

`str <CHAR | VARCHAR>, regex <CHAR | VARCHAR>`

Returns an `ARRAY<STRING>` of split substrings. `NULL` if any of the arguments are `NULL` or regex is invalid.
=======
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you please fix the docs? It seems the merge conflicts are not resolved correctly

Comment on lines +453 to +456
.inputTypeStrategy(
sequence(
logical(LogicalTypeFamily.CHARACTER_STRING),
logical(LogicalTypeFamily.CHARACTER_STRING)))
Copy link
Copy Markdown
Contributor

@raminqaf raminqaf Jun 1, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please refer to REGEXP_EXTRACT (#28140) and REGEXP_REPLACE (#28189) with the new validation logic during planning time. We can catch invalid regex pattern before hand.

Comment on lines +1366 to +1374
"""
Splits the string by the regular expression regex and returns an array of strings.
null if any of the arguments are null or regex is invalid.

E.g., regexp_split('Hello123World456', '[0-9]+') returns ['Hello', 'World', ''].

:param regex: A STRING expression with a matching pattern.
:return: An ARRAY<STRING> of split substrings.
"""
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ideally this should mirror the JavaDocs in BaseExpressions

"REGEXP_SPLIT(f5, '[a-z]+')",
new String[] {"12345"},
DataTypes.ARRAY(DataTypes.STRING()).notNull())
// Invalid regex - return null
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pleas add literal/non-literal invalid input tests

String strValue = str.toString();
StringData[] result = new StringData[strValue.length()];
for (int i = 0; i < strValue.length(); i++) {
result[i] = StringData.fromString(String.valueOf(strValue.charAt(i)));
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please have a look at this PR: #28264

So you split the SMP correctly. Maybe we can even extract this logic into a util.

Comment on lines +501 to +504
public static @Nullable Pattern getRegexpPattern(@Nullable String regex) {
if (regex == null) {
return null;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can also make this non-null

Suggested change
public static @Nullable Pattern getRegexpPattern(@Nullable String regex) {
if (regex == null) {
return null;
}
public static @Nullable Pattern getRegexpPattern(String regex) {

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

community-reviewed PR has been reviewed by the community.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants