[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern#27577
[FLINK-39064][Table SQL / API] Add built-in REGEXP_SPLIT function to split string by regular expression pattern#27577Myracle wants to merge 2 commits into
Conversation
| } | ||
|
|
||
| try { | ||
| // Cache the compiled pattern to improve performance |
There was a problem hiding this comment.
Every other REGEXP_* function uses SqlFunctionUtils.getRegexpMatcher() which delegates to the shared REGEXP_PATTERN_CACHE (a ThreadLocalCache)
There was a problem hiding this comment.
@nateab Thanks for your time to review. The suggestions are valuable and I have modified the code.
| import java.util.regex.PatternSyntaxException; | ||
|
|
||
| /** | ||
| * Implementation of {@link BuiltInFunctionDefinitions#REGEXP_SPLIT}. |
There was a problem hiding this comment.
also see https://issues.apache.org/jira/browse/FLINK-6810 for general instructions on what else you need to add in order to contribute builtin functions, for example which docs to add, what other considerations to make
| $("f0").regexpSplit("("), | ||
| "REGEXP_SPLIT(f0, '(')", | ||
| null, | ||
| DataTypes.ARRAY(DataTypes.STRING()).notNull()) |
There was a problem hiding this comment.
this seems inconsistent, since we expect the return value to be null?
7588cee to
f3f463b
Compare
nateab
left a comment
There was a problem hiding this comment.
Thanks for the fixes, almost lgtm just one comment
| return new GenericArrayData(result); | ||
| } | ||
|
|
||
| Pattern pattern = getRegexpPattern(regexStr); |
There was a problem hiding this comment.
nice thanks for using the SqlFunctionUttils, but is there a reason you added getRegexpPattern instead of just using the existing getRegexpMatcher?
There was a problem hiding this comment.
Thanks for the review!
The reason I added getRegexpPattern() instead of using getRegexpMatcher() is that REGEXP_SPLIT needs to call Pattern.split(str, -1), and the split() method is on the Pattern class, not the Matcher class.
The existing getRegexpMatcher() returns a Matcher object which is designed for matching operations like find(), group(), etc. - this works perfectly for other REGEXP_* functions like REGEXP_SUBSTR, REGEXP_COUNT, REGEXP_INSTR that need to iterate through matches.
However, REGEXP_SPLIT doesn't need to iterate through matches - it needs to split the input string by the pattern, which requires direct access to the Pattern object.
That said, if you prefer, I could inline the cache access directly in RegexpSplitFunction to avoid adding a new utility method:
Pattern pattern;
try {
pattern = SqlFunctionUtils.REGEXP_PATTERN_CACHE.get(regexStr);
} catch (PatternSyntaxException e) {
return null;
}
Please let me know which approach you'd prefer:
- Keep getRegexpPattern() as a reusable utility (current approach) - could be useful for future functions that need direct Pattern access
- Inline the cache access directly in RegexpSplitFunction
There was a problem hiding this comment.
thanks for the thorough explanation! I agree that keeping it is as a reusable utility could be helpful so im good with that approach
f3f463b to
abfa6d6
Compare
5cc157b to
8d75684
Compare
|
@dylanhz Hello, can you help review this feature? Thanks very much. |
|
@lincoln-lil Hello, can you help review this feature? Thanks very much. |
|
@snuyanzin Hello, can you help review this feature? Thanks very much. |
…split string by regular expression pattern
8d75684 to
d84cb22
Compare
| >>>>>>> 8d75684590d (hotfix) | ||
| ======= | ||
| Returns an `STRING` representation of the first matched substring. `NULL` if any of the arguments are `NULL` or regex is invalid or pattern is not found. | ||
| - sql: REGEXP_SPLIT(str, regex) | ||
| table: str.regexpSplit(regex) | ||
| description: | | ||
| Splits str by the regular expression regex and returns an array of strings. | ||
|
|
||
| E.g., REGEXP_SPLIT('Hello123World456', '[0-9]+') returns ['Hello', 'World', '']. | ||
|
|
||
| `str <CHAR | VARCHAR>, regex <CHAR | VARCHAR>` | ||
|
|
||
| Returns an `ARRAY<STRING>` of split substrings. `NULL` if any of the arguments are `NULL` or regex is invalid. | ||
| ======= |
There was a problem hiding this comment.
Can you please fix the docs? It seems the merge conflicts are not resolved correctly
| .inputTypeStrategy( | ||
| sequence( | ||
| logical(LogicalTypeFamily.CHARACTER_STRING), | ||
| logical(LogicalTypeFamily.CHARACTER_STRING))) |
| """ | ||
| Splits the string by the regular expression regex and returns an array of strings. | ||
| null if any of the arguments are null or regex is invalid. | ||
|
|
||
| E.g., regexp_split('Hello123World456', '[0-9]+') returns ['Hello', 'World', '']. | ||
|
|
||
| :param regex: A STRING expression with a matching pattern. | ||
| :return: An ARRAY<STRING> of split substrings. | ||
| """ |
There was a problem hiding this comment.
Ideally this should mirror the JavaDocs in BaseExpressions
| "REGEXP_SPLIT(f5, '[a-z]+')", | ||
| new String[] {"12345"}, | ||
| DataTypes.ARRAY(DataTypes.STRING()).notNull()) | ||
| // Invalid regex - return null |
There was a problem hiding this comment.
Pleas add literal/non-literal invalid input tests
| String strValue = str.toString(); | ||
| StringData[] result = new StringData[strValue.length()]; | ||
| for (int i = 0; i < strValue.length(); i++) { | ||
| result[i] = StringData.fromString(String.valueOf(strValue.charAt(i))); |
There was a problem hiding this comment.
Please have a look at this PR: #28264
So you split the SMP correctly. Maybe we can even extract this logic into a util.
| public static @Nullable Pattern getRegexpPattern(@Nullable String regex) { | ||
| if (regex == null) { | ||
| return null; | ||
| } |
There was a problem hiding this comment.
We can also make this non-null
| public static @Nullable Pattern getRegexpPattern(@Nullable String regex) { | |
| if (regex == null) { | |
| return null; | |
| } | |
| public static @Nullable Pattern getRegexpPattern(String regex) { |
What is the purpose of the change
This pull request adds a new built-in function
REGEXP_SPLITto Flink SQL and Table API, which splits a string by a regular expression pattern and returns an array of substrings. This function is commonly available in other SQL engines (e.g., Spark, Presto, Hive) and provides users with more powerful string manipulation capabilities using regex patterns.Brief change log
REGEXP_SPLITfunction definition inBuiltInFunctionDefinitionswith proper input/output type strategiesRegexpSplitFunctionas a scalar function with regex pattern caching for performance optimizationregexpSplit()method toBaseExpressionsfor Table API supportRegexpFunctionsITCasecovering various scenarios including null handling, empty regex, invalid regex patterns, and edge casesVerifying this change
This change added tests and can be verified as follows:
RegexpFunctionsITCasethat cover:[0-9]+)[,;|])\\s+)Does this pull request potentially affect one of the following parts:
@Public(Evolving): yes (BaseExpressionsis@PublicEvolving, addedregexpSplit()method)Documentation
RegexpSplitFunctionclass JavaDoc)