From 7ea31648791de9c6322af8c8ae40cbe4e713c2bc Mon Sep 17 00:00:00 2001 From: linzhenqi Date: Fri, 13 Mar 2026 11:17:38 +0800 Subject: [PATCH] [Feature](func) Support REGEXP_EXTRACT_ALL_ARRAY --- .../regexp-extract-all-array.md | 266 +++++++++++++++++ .../string-functions/regexp-extract-all.md | 8 +- .../murmur-hash3-64-v2.md | 4 +- .../murmur-hash3-64.md | 2 + .../encrypt-digest-functions/xxhash-64.md | 4 + .../regexp-extract-all-array.md | 265 +++++++++++++++++ .../string-functions/regexp-extract-all.md | 10 +- .../regexp/regexp-extract-all.md | 2 +- .../regexp/regexp-extract-all.md | 2 +- .../string-functions/regexp-extract-all.md | 8 +- .../murmur-hash3-64.md | 2 - .../encrypt-digest-functions/xxhash-64.md | 4 - .../string-functions/regexp-extract-all.md | 8 +- .../regexp-extract-all-array.md | 269 ++++++++++++++++++ .../string-functions/regexp-extract-all.md | 10 +- .../regexp/regexp-extract-all.md | 2 +- .../regexp/regexp-extract-all.md | 2 +- .../string-functions/regexp-extract-all.md | 6 +- .../string-functions/regexp-extract-all.md | 6 +- .../regexp-extract-all-array.md | 266 +++++++++++++++++ .../string-functions/regexp-extract-all.md | 8 +- 21 files changed, 1114 insertions(+), 40 deletions(-) create mode 100644 docs/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md rename i18n/zh-CN/docusaurus-plugin-content-docs/{version-3.x => current}/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64-v2.md (91%) create mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md create mode 100644 i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md create mode 100644 versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md diff --git a/docs/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md b/docs/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md new file mode 100644 index 0000000000000..12da8930c6567 --- /dev/null +++ b/docs/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md @@ -0,0 +1,266 @@ +--- +{ + "title": "REGEXP_EXTRACT_ALL_ARRAY", + "language": "en", + "description": "The REGEXP_EXTRACT_ALL_ARRAY function performs regular expression matching and returns all values captured by the first sub-pattern as an array." +} +--- + +## Description + +The `REGEXP_EXTRACT_ALL_ARRAY` function performs regular expression matching on a given string `str` and returns all values captured by the first sub-pattern of `pattern` as an array. + +If there is no match, or if the pattern has no sub-pattern, an empty array is returned. + +Default supported character match classes: https://github.com/google/re2/wiki/Syntax + +Doris supports enabling advanced regex features (such as look-around assertions) via session variable `enable_extended_regex` (default: `false`). + +When `enable_extended_regex=true`, supported syntax follows Boost.Regex: https://www.boost.org/doc/libs/latest/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html + +## Syntax + +```sql +REGEXP_EXTRACT_ALL_ARRAY(, ) +``` + +## Parameters + +| Parameter | Description | +| -- | -- | +| `` | Input string for regex matching. | +| `` | Regex pattern. The first capturing group is used for extraction. | + +## Return value + +Returns `ARRAY`. + +If no matches are found, returns `[]`. + +If any parameter is NULL, return NULL + +**Default Behavior**: + +| Default Setting | Behavior | +| ------------------------------------ | ----------------------------------------------------------------------------------------- | +| `.` matches newline | `.` can match `\n` (newline) by default. | +| Case-sensitive | Matching is case-sensitive. | +| `^`/`$` match full string boundaries | `^` matches only the start of the string, `$` matches only the end, not line starts/ends. | +| Greedy quantifiers | `*`, `+`, etc. match as much as possible by default. | +| UTF-8 | Strings are processed as UTF-8. | + +**Pattern Modifiers**: + +You can override the default behavior by prefixing the `pattern` with `(?flags)`. Multiple modifiers can be combined, e.g., `(?im)`; a `-` prefix disables the corresponding option, e.g., `(?-s)`. + +Pattern modifiers only take effect when using the default regex engine. If `enable_extended_regex=true` is enabled while using zero-width assertions (e.g., `(?<=...)`, `(?=...)`), the query will be handled by the Boost.Regex engine, and modifier behavior may not work as expected. It is recommended not to mix them. + +| Flag | Meaning | +| ------- | ---------------------------------------------------------------------------- | +| `(?i)` | Case-insensitive matching | +| `(?-i)` | Case-sensitive (default) | +| `(?s)` | `.` matches newline (enabled by default) | +| `(?-s)` | `.` does **not** match newline | +| `(?m)` | Multiline mode: `^` matches start of each line, `$` matches end of each line | +| `(?-m)` | Single-line mode: `^`/`$` match full string boundaries (default) | +| `(?U)` | Non-greedy quantifiers: `*`, `+`, etc. match as little as possible | +| `(?-U)` | Greedy quantifiers (default): `*`, `+`, etc. match as much as possible | + + +## Example + +Basic matching of lowercase letters around 'C'. + +```sql +SELECT regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') AS res; ++-------+ +| res | ++-------+ +| ["b"] | ++-------+ +``` + +```sql +SELECT + array_size( + regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') + )AS res_size; ++----------+ +| res_size | ++----------+ +| 1 | ++----------+ +``` + +Multiple matches in a string. + +```sql +SELECT regexp_extract_all_array('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)') AS res; ++------------+ +| res | ++------------+ +| ["b", "f"] | ++------------+ +``` + +Extracting keys from key - value pairs. + +```sql +SELECT regexp_extract_all_array('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS res; ++-----------------------+ +| res | ++-----------------------+ +| ["abc", "def", "ghi"] | ++-----------------------+ +``` + +Matching Chinese characters. + +```sql +SELECT regexp_extract_all_array('这是一段中文 This is a passage in English 1234567', '(\\p{Han}+)(.+)') AS res; ++------------------------+ +| res | ++------------------------+ +| ["这是一段中文"] | ++------------------------+ +``` + +Inserting data and using REGEXP_EXTRACT_ALL_ARRAY. + +```sql +CREATE TABLE test_regexp_extract_all_array ( + id INT, + text_content VARCHAR(255), + pattern VARCHAR(255) +) PROPERTIES ("replication_num"="1"); + +INSERT INTO test_regexp_extract_all_array VALUES +(1, 'apple1, banana2, cherry3', '([a-zA-Z]+)\\d'), +(2, 'red#123, blue#456, green#789', '([a-zA-Z]+)#\\d+'), +(3, 'hello@example.com, world@test.net', '([a-zA-Z]+)@'); + +SELECT id, regexp_extract_all_array(text_content, pattern) AS extracted_data +FROM test_regexp_extract_all_array; ++------+-------------------------------+ +| id | extracted_data | ++------+-------------------------------+ +| 1 | ["apple", "banana", "cherry"] | +| 2 | ["red", "blue", "green"] | +| 3 | ["hello", "world"] | ++------+-------------------------------+ +``` + +No matched, return empty array. + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)'); ++-------------------------------------------+ +| REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)') | ++-------------------------------------------+ +| [] | ++-------------------------------------------+ +``` + +Emoji match. + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('👩‍💻,👨‍🚀', '(💻|🚀)') AS res; ++------------------+ +| res | ++------------------+ +| ["💻", "🚀"] | ++------------------+ +``` + +'str' is NULL, return NULL. + +```sql +SELECT regexp_extract_all_array(NULL, '([a-z]+)') AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +'pattern' is NULL, return NULL. + +```sql +SELECT regexp_extract_all_array('Hello World', NULL) AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +All parameters are NULL, return NULL. + +```sql +SELECT regexp_extract_all_array(NULL, NULL) AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +If the `pattern` is not allowed regexp regular, throw error. + +```sql +SELECT regexp_extract_all_array('hello (world) 123', '([[:alpha:]+') AS res; +-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: ([[:alpha:]+. Error: missing ]: [[:alpha:]+. If you need advanced regex features, try setting enable_extended_regex=true +``` + +Advanced regexp. + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)'); +-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: (?<=ID:)([A-Z]{2}-\d). Error: invalid perl operator: (?<. If you need advanced regex features, try setting enable_extended_regex=true +``` + +```sql +SET enable_extended_regex = true; +SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)') AS res; ++--------------------------+ +| res | ++--------------------------+ +| ["AA-1", "BB-2", "CC-3"] | ++--------------------------+ +``` + +Pattern Modifiers + +Case-insensitive matching: `(?i)` makes the match ignore case + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(hello)') AS case_sensitive, + REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(?i)(hello)') AS case_insensitive; ++----------------+-----------------------------+ +| case_sensitive | case_insensitive | ++----------------+-----------------------------+ +| ["hello"] | ["Hello", "hello", "HELLO"] | ++----------------+-----------------------------+ +``` + +Multiline mode: `(?m)` makes `^` and `$` match start/end of each line +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '^([a-z]+)') AS single_line, + REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '(?m)^([a-z]+)') AS multi_line; ++-------------+---------------------+ +| single_line | multi_line | ++-------------+---------------------+ +| ['foo'] | ['foo','bar','baz'] | ++-------------+---------------------+ +``` + +Greedy vs non-greedy: `(?U)` makes quantifiers match as little as possible +```sql +SELECT REGEXP_EXTRACT_ALL('aXbXcXd', '(a.*X)') AS greedy, + REGEXP_EXTRACT_ALL('aXbXcXd', '(?U)(a.*X)') AS non_greedy; ++----------+------------+ +| greedy | non_greedy | ++----------+------------+ +| ['aXbXcX'] | ['aX'] | ++----------+------------+ +``` diff --git a/docs/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md b/docs/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md index f383cfa0966be..e3e7ffcfded38 100644 --- a/docs/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md +++ b/docs/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md @@ -8,7 +8,7 @@ ## Description -The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. For the function to return an array of strings representing the matched parts of the pattern, the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned. +The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. The function returns a string representing the matched part of the pattern, and the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned. It should be noted that when handling character set matching, Utf-8 standard character classes should be used. This ensures that functions can correctly identify and process various characters from different languages. @@ -35,7 +35,9 @@ REGEXP_EXTRACT_ALL(, ) ## Return value -The function returns an array of strings that represent the parts of the input string that match the first sub - pattern of the specified regular expression. The return type is an array of String values. If no matches are found, or if the pattern has no sub - patterns, an empty array is returned. +The function returns a string that represents the part of the input string that matches the first sub - pattern of the specified regular expression. The return type is String. If no matches are found, or if the pattern has no sub - patterns, an empty string is returned. + +For the array-returning variant, see [REGEXP_EXTRACT_ALL_ARRAY](./regexp-extract-all-array.md). **Default Behavior**: @@ -88,7 +90,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)'); +-----------------------------------------------------------------+ ``` -Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is an array of the keys ['abc', 'def', 'ghi']. +Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is ['abc', 'def', 'ghi']. ```sql mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)'); diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64-v2.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64-v2.md similarity index 91% rename from i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64-v2.md rename to i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64-v2.md index 39f53b6f48344..589599a6f7497 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64-v2.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64-v2.md @@ -12,7 +12,7 @@ 与`MURMUR_HASH3_64`的区别是:此版本复用 MurmurHash3 的 128 位处理函数,仅输出第一个 64 位哈希值,与[标准库](https://mmh3.readthedocs.io/en/latest/api.html#mmh3.hash64)的行为保持一致。 --注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64_v2 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。 +-注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64_v2 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。如需更优的 64 位 MurmurHash3 性能,可考虑使用 `murmur_hash3_64`。 ## 语法 @@ -44,4 +44,4 @@ select murmur_hash3_64_v2(null), murmur_hash3_64_v2("hello"), murmur_hash3_64_v2 +-----------------------+--------------------------+-----------------------------------+ | NULL | -3215607508166160593 | 3583109472027628045 | +-----------------------+--------------------------+-----------------------------------+ -``` +``` \ No newline at end of file diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64.md index 471c394e1121e..e15592047258e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64.md @@ -10,6 +10,8 @@ 计算 64 位 murmur3 hash 值 +与`MURMUR_HASH3_64_V2`的区别是:此版本专门为 64 位输出优化,性能略优于 v2 版本, 但与[标准库](https://mmh3.readthedocs.io/en/latest/api.html#mmh3.hash64)实现不一致。 + -注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。 ## 语法 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/xxhash-64.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/xxhash-64.md index e5aad81a8ac26..b1cf7e791ee2a 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/xxhash-64.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/xxhash-64.md @@ -12,6 +12,10 @@ -注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。 +## 别名 + +- `XXHASH3_64` + ## 语法 ```sql diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md new file mode 100644 index 0000000000000..d4f8b9d3c2a55 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md @@ -0,0 +1,265 @@ +--- +{ + "title": "REGEXP_EXTRACT_ALL_ARRAY", + "language": "zh-CN", + "description": "REGEXP_EXTRACT_ALL_ARRAY 函数用于正则匹配并返回第一个子模式捕获到的全部结果数组。" +} +--- + +## 描述 + +`REGEXP_EXTRACT_ALL_ARRAY` 函数用于对字符串 `str` 执行正则匹配,并返回 `pattern` 中第一个子模式捕获到的全部结果数组。 + +如果没有匹配,或者模式中没有子模式,则返回空数组。 + +默认支持的字符匹配语法: https://github.com/google/re2/wiki/Syntax + +Doris 支持通过会话变量 `enable_extended_regex`(默认 `false`)启用高级正则能力(如 look-around 零宽断言)。 + +当 `enable_extended_regex=true` 时,支持的语法参见 Boost.Regex: https://www.boost.org/doc/libs/latest/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html + +## 语法 + +```sql +REGEXP_EXTRACT_ALL_ARRAY(, ) +``` + +## 参数 + +| 参数 | 描述 | +| -- | -- | +| `` | 待匹配的输入字符串。 | +| `` | 正则表达式。函数提取第一个捕获组的所有匹配结果。 | + +## 返回值 + +返回 `ARRAY`。 + +若未匹配到结果,返回 `[]`。 + +任一参数为 null, 返回 null + +**默认行为**: + +| 默认配置 | 行为说明 | +| -------------------------- | ----------------------------------------------------------------- | +| `.` 匹配换行符 | `.` 默认可以匹配 `\n`(换行符)。 | +| 大小写敏感 | 匹配时区分大小写。 | +| `^`/`$` 匹配整个字符串边界 | `^` 仅匹配字符串开头,`$` 仅匹配字符串结尾,而非每行的行首/行尾。 | +| 量词贪婪 | `*`、`+` 等量词默认尽可能多地匹配。 | +| UTF-8 | 字符串按 UTF-8 处理。 | + +**模式修饰符**: + +可通过在 `pattern` 前缀写入 `(?flags)` 来覆盖默认行为。多个修饰符可组合,如 `(?im)`;`-` 前缀表示关闭对应选项,如 `(?-s)`。 + +模式修饰符仅在使用默认正则引擎时生效。若启用了 `enable_extended_regex=true` 同时使用零宽断言(如 `(?<=...)`、`(?=...)`),查询将由 Boost.Regex 引擎处理,此时修饰符行为可能与预期不符,建议不要混合使用。 + +| 标志 | 含义 | +| ------- | -------------------------------------------- | +| `(?i)` | 大小写不敏感匹配 | +| `(?-i)` | 大小写敏感(默认) | +| `(?s)` | `.` 匹配换行符(默认已开启) | +| `(?-s)` | `.` 不匹配换行符 | +| `(?m)` | 多行模式:`^` 匹配每行行首,`$` 匹配每行行尾 | +| `(?-m)` | 单行模式:`^`/`$` 匹配整个字符串首尾(默认) | +| `(?U)` | 量词非贪婪:`*`、`+` 等尽可能少地匹配 | +| `(?-U)` | 量词贪婪(默认):`*`、`+` 等尽可能多地匹配 | + +## 示例 + +围绕 'C' 的小写字母基本匹配,在这个示例中,模式([[:lower:]]+)C([[:lower:]]+)匹配字符串中一个或多个小写字母后跟 'C' 再跟一个或多个小写字母的部分。'C' 之前的第一个子模式([[:lower:]]+)匹配 'b',因此结果为['b']。 + +```sql +SELECT regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') AS res; ++-------+ +| res | ++-------+ +| ["b"] | ++-------+ +``` + +返回类型为Array +```sql +SELECT + array_size( + regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') + )AS res_size; ++----------+ +| res_size | ++----------+ +| 1 | ++----------+ +``` + +字符串中的多个匹配项,在这里,模式在字符串中匹配两个部分。第一个匹配的第一个子模式匹配 'b',第二个匹配的第一个子模式匹配 'f'。因此结果为['b', 'f']。 + +```sql +SELECT regexp_extract_all_array('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)') AS res; ++------------+ +| res | ++------------+ +| ["b", "f"] | ++------------+ +``` + +从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为 ['abc', 'def', 'ghi']。 + +```sql +SELECT regexp_extract_all_array('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS res; ++-----------------------+ +| res | ++-----------------------+ +| ["abc", "def", "ghi"] | ++-----------------------+ +``` + +匹配汉字, 模式(\p{Han}+)(.+)首先通过第一个子模式(\p{Han}+)匹配一个或多个汉字,因此结果为['这是一段中文']。 + +```sql +SELECT regexp_extract_all_array('这是一段中文 This is a passage in English 1234567', '(\\p{Han}+)(.+)') AS res; ++------------------------+ +| res | ++------------------------+ +| ["这是一段中文"] | ++------------------------+ +``` + +插入数据并使用 REGEXP_EXTRACT_ALL + +```sql +CREATE TABLE test_regexp_extract_all_array ( + id INT, + text_content VARCHAR(255), + pattern VARCHAR(255) +) PROPERTIES ("replication_num"="1"); + +INSERT INTO test_regexp_extract_all_array VALUES +(1, 'apple1, banana2, cherry3', '([a-zA-Z]+)\\d'), +(2, 'red#123, blue#456, green#789', '([a-zA-Z]+)#\\d+'), +(3, 'hello@example.com, world@test.net', '([a-zA-Z]+)@'); + +SELECT id, regexp_extract_all_array(text_content, pattern) AS extracted_data +FROM test_regexp_extract_all_array; ++------+-------------------------------+ +| id | extracted_data | ++------+-------------------------------+ +| 1 | ["apple", "banana", "cherry"] | +| 2 | ["red", "blue", "green"] | +| 3 | ["hello", "world"] | ++------+-------------------------------+ +``` + +没有匹配到,返回空字符串 + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)'); ++-------------------------------------------+ +| REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)') | ++-------------------------------------------+ +| [] | ++-------------------------------------------+ +``` +emoji字符匹配 + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('👩‍💻,👨‍🚀', '(💻|🚀)') AS res; ++------------------+ +| res | ++------------------+ +| ["💻", "🚀"] | ++------------------+ +``` + +'Str' 是 NULL,返回 NULL + +```sql +SELECT regexp_extract_all_array(NULL, '([a-z]+)') AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +'pattern' 是 NULL,返回 NULL + +```sql +SELECT regexp_extract_all_array('Hello World', NULL) AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +全部参数都是 NULL,返回 NULL + +```sql +SELECT regexp_extract_all_array(NULL, NULL) AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +如果 'pattern' 参数不符合正则表达式,则抛出错误 + +```sql +SELECT regexp_extract_all_array('hello (world) 123', '([[:alpha:]+') AS res; +-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: ([[:alpha:]+. Error: missing ]: [[:alpha:]+. If you need advanced regex features, try setting enable_extended_regex=true +``` + +高级的正则表达式 + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)'); +-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: (?<=ID:)([A-Z]{2}-\d). Error: invalid perl operator: (?<. If you need advanced regex features, try setting enable_extended_regex=true +``` + +```sql +SET enable_extended_regex = true; +SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)') AS res; ++--------------------------+ +| res | ++--------------------------+ +| ["AA-1", "BB-2", "CC-3"] | ++--------------------------+ +``` + +模式修饰符 + +大小写不敏感:`(?i)` 使匹配忽略大小写 + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(hello)') AS case_sensitive, + REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(?i)(hello)') AS case_insensitive; ++----------------+-----------------------------+ +| case_sensitive | case_insensitive | ++----------------+-----------------------------+ +| ["hello"] | ["Hello", "hello", "HELLO"] | ++----------------+-----------------------------+ +``` + +多行模式:`(?m)` 使 `^` 和 `$` 匹配每行行首/行尾 +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '^([a-z]+)') AS single_line, + REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '(?m)^([a-z]+)') AS multi_line; ++-------------+---------------------+ +| single_line | multi_line | ++-------------+---------------------+ +| ['foo'] | ['foo','bar','baz'] | ++-------------+---------------------+ +``` + +贪婪与非贪婪:`(?U)` 使量词尽可能少地匹配 +```sql +SELECT REGEXP_EXTRACT_ALL('aXbXcXd', '(a.*X)') AS greedy, + REGEXP_EXTRACT_ALL('aXbXcXd', '(?U)(a.*X)') AS non_greedy; ++----------+------------+ +| greedy | non_greedy | ++----------+------------+ +| ['aXbXcX'] | ['aX'] | ++----------+------------+ +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md index 1c679966a7daa..8916f695300d2 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/current/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md @@ -2,7 +2,7 @@ { "title": "REGEXP_EXTRACT_ALL", "language": "zh-CN", - "description": "REGEXPEXTRACTALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。为了使函数返回表示模式匹配部分的字符串数组,该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式," + "description": "REGEXPEXTRACTALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。函数返回表示模式匹配部分的字符串,且该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式," } --- @@ -27,7 +27,7 @@ under the License. ## 描述 -REGEXP_EXTRACT_ALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。为了使函数返回表示模式匹配部分的字符串数组,该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式,则返回空字符串。 +REGEXP_EXTRACT_ALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。函数返回表示模式匹配部分的字符串,且该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式,则返回空字符串。 需要注意的是,在处理字符集匹配时,应使用 Utf-8 标准字符类。这确保函数能够正确识别和处理来自不同语言的各种字符。 @@ -55,7 +55,9 @@ REGEXP_EXTRACT_ALL(, ) ## 返回值 -函数返回表示输入字符串中与指定正则表达式的第一个子模式匹配部分的字符串数组。返回类型为 String 值数组。如果未找到匹配项,或模式没有子模式,则返回空数组。 +函数返回表示输入字符串中与指定正则表达式的第一个子模式匹配部分的字符串。返回类型为 String。如果未找到匹配项,或模式没有子模式,则返回空字符串。 + +如需返回数组版本,请参见 [REGEXP_EXTRACT_ALL_ARRAY](./regexp-extract-all-array.md)。 **默认行为**: @@ -107,7 +109,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)'); +-----------------------------------------------------------------+ ``` -从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为键的数组['abc', 'def', 'ghi']。 +从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为 ['abc', 'def', 'ghi']。 ```sql mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)'); diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md index ff7043162957a..dc1145287946e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-1.2/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md @@ -7,7 +7,7 @@ ## 描述 -对字符串 str 进行正则匹配,抽取符合 pattern 的所有子模式匹配部分。需要 pattern 完全匹配 str 中的某部分,这样才能返回 pattern 部分中需匹配部分的字符串数组。如果没有匹配或者pattern没有子模式,返回空字符串。 +对字符串 str 进行正则匹配,抽取符合 pattern 的所有子模式匹配部分。需要 pattern 完全匹配 str 中的某部分,这样才能返回 pattern 部分中需匹配部分的字符串。如果没有匹配或者pattern没有子模式,返回空字符串。 字符集匹配需要使用 Unicode 标准字符类型。例如,匹配中文请使用 `\p{Han}`。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md index 39133cb58bce3..e08ae4b0e6e2b 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.0/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md @@ -7,7 +7,7 @@ ## 描述 -对字符串 str 进行正则匹配,抽取符合 pattern 的所有子模式匹配部分。需要 pattern 完全匹配 str 中的某部分,这样才能返回 pattern 部分中需匹配部分的字符串数组。如果没有匹配或者pattern没有子模式,返回空字符串。 +对字符串 str 进行正则匹配,抽取符合 pattern 的所有子模式匹配部分。需要 pattern 完全匹配 str 中的某部分,这样才能返回 pattern 部分中需匹配部分的字符串。如果没有匹配或者pattern没有子模式,返回空字符串。 字符集匹配需要使用 Unicode 标准字符类型。例如,匹配中文请使用 `\p{Han}`。 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md index 13e6ae559ac9e..530bf772fc497 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md @@ -2,13 +2,13 @@ { "title": "REGEXP_EXTRACT_ALL", "language": "zh-CN", - "description": "REGEXP_EXTRACT_ALL 是用于字符串正则匹配的 SQL 函数,返回所有匹配结果中第一个子模式的字符串数组,支持 UTF-8 与 RE2 语法。" + "description": "REGEXP_EXTRACT_ALL 是用于字符串正则匹配的 SQL 函数,返回所有匹配结果中第一个子模式的字符串,支持 UTF-8 与 RE2 语法。" } --- ## 描述 -REGEXP_EXTRACT_ALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。为了使函数返回表示模式匹配部分的字符串数组,该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式,则返回空字符串。 +REGEXP_EXTRACT_ALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。函数返回表示模式匹配部分的字符串,且该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式,则返回空字符串。 需要注意的是,在处理字符集匹配时,应使用 Utf-8 标准字符类。这确保函数能够正确识别和处理来自不同语言的各种字符。 @@ -31,7 +31,7 @@ REGEXP_EXTRACT_ALL(, ) ## 返回值 -函数返回表示输入字符串中与指定正则表达式的第一个子模式匹配部分的字符串数组。返回类型为 String 值数组。如果未找到匹配项,或模式没有子模式,则返回空数组。 +函数返回表示输入字符串中与指定正则表达式的第一个子模式匹配部分的字符串。返回类型为 String。如果未找到匹配项,或模式没有子模式,则返回空字符串。 ## 例子 @@ -56,7 +56,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)'); +-----------------------------------------------------------------+ ``` -从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为键的数组['abc', 'def', 'ghi']。 +从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为 ['abc', 'def', 'ghi']。 ```sql mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)'); diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64.md index e15592047258e..471c394e1121e 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/murmur-hash3-64.md @@ -10,8 +10,6 @@ 计算 64 位 murmur3 hash 值 -与`MURMUR_HASH3_64_V2`的区别是:此版本专门为 64 位输出优化,性能略优于 v2 版本, 但与[标准库](https://mmh3.readthedocs.io/en/latest/api.html#mmh3.hash64)实现不一致。 - -注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。 ## 语法 diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/xxhash-64.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/xxhash-64.md index 20da9de837455..3523391ba3e60 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/xxhash-64.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/encrypt-digest-functions/xxhash-64.md @@ -12,10 +12,6 @@ -注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。 -## 别名 - -- `XXHASH3_64` - ## 语法 ```sql diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md index a9b302024c96f..dbb5662903d63 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md @@ -2,7 +2,7 @@ { "title": "REGEXP_EXTRACT_ALL", "language": "zh-CN", - "description": "REGEXPEXTRACTALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。为了使函数返回表示模式匹配部分的字符串数组,该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式," + "description": "REGEXPEXTRACTALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。函数返回表示模式匹配部分的字符串,且该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式," } --- @@ -27,7 +27,7 @@ under the License. ## 描述 -REGEXP_EXTRACT_ALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。为了使函数返回表示模式匹配部分的字符串数组,该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式,则返回空字符串。 +REGEXP_EXTRACT_ALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。函数返回表示模式匹配部分的字符串,且该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式,则返回空字符串。 需要注意的是,在处理字符集匹配时,应使用 Utf-8 标准字符类。这确保函数能够正确识别和处理来自不同语言的各种字符。 @@ -50,7 +50,7 @@ REGEXP_EXTRACT_ALL(, ) ## 返回值 -函数返回表示输入字符串中与指定正则表达式的第一个子模式匹配部分的字符串数组。返回类型为 String 值数组。如果未找到匹配项,或模式没有子模式,则返回空数组。 +函数返回表示输入字符串中与指定正则表达式的第一个子模式匹配部分的字符串。返回类型为 String。如果未找到匹配项,或模式没有子模式,则返回空字符串。 **默认行为**: @@ -100,7 +100,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)'); +-----------------------------------------------------------------+ ``` -从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为键的数组['abc', 'def', 'ghi']。 +从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为 ['abc', 'def', 'ghi']。 ```sql mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)'); diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md new file mode 100644 index 0000000000000..9fdd94ddcd5e3 --- /dev/null +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md @@ -0,0 +1,269 @@ +--- +{ + "title": "REGEXP_EXTRACT_ALL_ARRAY", + "language": "zh-CN", + "description": "REGEXP_EXTRACT_ALL_ARRAY 函数用于正则匹配并返回第一个子模式捕获到的全部结果数组。" +} +--- + +## 描述 + +:::note +自4.1.0起支持 +::: + +`REGEXP_EXTRACT_ALL_ARRAY` 函数用于对字符串 `str` 执行正则匹配,并返回 `pattern` 中第一个子模式捕获到的全部结果数组。 + +如果没有匹配,或者模式中没有子模式,则返回空数组。 + +默认支持的字符匹配语法: https://github.com/google/re2/wiki/Syntax + +Doris 支持通过会话变量 `enable_extended_regex`(默认 `false`)启用高级正则能力(如 look-around 零宽断言)。 + +当 `enable_extended_regex=true` 时,支持的语法参见 Boost.Regex: https://www.boost.org/doc/libs/latest/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html + +## 语法 + +```sql +REGEXP_EXTRACT_ALL_ARRAY(, ) +``` + +## 参数 + +| 参数 | 描述 | +| -- | -- | +| `` | 待匹配的输入字符串。 | +| `` | 正则表达式。函数提取第一个捕获组的所有匹配结果。 | + +## 返回值 + +返回 `ARRAY`。 + +若未匹配到结果,返回 `[]`。 + +任一参数为 null, 返回 null + +**默认行为**: + +| 默认配置 | 行为说明 | +| -------------------------- | ----------------------------------------------------------------- | +| `.` 匹配换行符 | `.` 默认可以匹配 `\n`(换行符)。 | +| 大小写敏感 | 匹配时区分大小写。 | +| `^`/`$` 匹配整个字符串边界 | `^` 仅匹配字符串开头,`$` 仅匹配字符串结尾,而非每行的行首/行尾。 | +| 量词贪婪 | `*`、`+` 等量词默认尽可能多地匹配。 | +| UTF-8 | 字符串按 UTF-8 处理。 | + +**模式修饰符**: + +可通过在 `pattern` 前缀写入 `(?flags)` 来覆盖默认行为。多个修饰符可组合,如 `(?im)`;`-` 前缀表示关闭对应选项,如 `(?-s)`。 + +模式修饰符仅在使用默认正则引擎时生效。若启用了 `enable_extended_regex=true` 同时使用零宽断言(如 `(?<=...)`、`(?=...)`),查询将由 Boost.Regex 引擎处理,此时修饰符行为可能与预期不符,建议不要混合使用。 + +| 标志 | 含义 | +| ------- | -------------------------------------------- | +| `(?i)` | 大小写不敏感匹配 | +| `(?-i)` | 大小写敏感(默认) | +| `(?s)` | `.` 匹配换行符(默认已开启) | +| `(?-s)` | `.` 不匹配换行符 | +| `(?m)` | 多行模式:`^` 匹配每行行首,`$` 匹配每行行尾 | +| `(?-m)` | 单行模式:`^`/`$` 匹配整个字符串首尾(默认) | +| `(?U)` | 量词非贪婪:`*`、`+` 等尽可能少地匹配 | +| `(?-U)` | 量词贪婪(默认):`*`、`+` 等尽可能多地匹配 | + +## 示例 + +围绕 'C' 的小写字母基本匹配,在这个示例中,模式([[:lower:]]+)C([[:lower:]]+)匹配字符串中一个或多个小写字母后跟 'C' 再跟一个或多个小写字母的部分。'C' 之前的第一个子模式([[:lower:]]+)匹配 'b',因此结果为['b']。 + +```sql +SELECT regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') AS res; ++-------+ +| res | ++-------+ +| ["b"] | ++-------+ +``` + +返回类型为Array +```sql +SELECT + array_size( + regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') + )AS res_size; ++----------+ +| res_size | ++----------+ +| 1 | ++----------+ +``` + +字符串中的多个匹配项,在这里,模式在字符串中匹配两个部分。第一个匹配的第一个子模式匹配 'b',第二个匹配的第一个子模式匹配 'f'。因此结果为['b', 'f']。 + +```sql +SELECT regexp_extract_all_array('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)') AS res; ++------------+ +| res | ++------------+ +| ["b", "f"] | ++------------+ +``` + +从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为 ['abc', 'def', 'ghi']。 + +```sql +SELECT regexp_extract_all_array('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS res; ++-----------------------+ +| res | ++-----------------------+ +| ["abc", "def", "ghi"] | ++-----------------------+ +``` + +匹配汉字, 模式(\p{Han}+)(.+)首先通过第一个子模式(\p{Han}+)匹配一个或多个汉字,因此结果为['这是一段中文']。 + +```sql +SELECT regexp_extract_all_array('这是一段中文 This is a passage in English 1234567', '(\\p{Han}+)(.+)') AS res; ++------------------------+ +| res | ++------------------------+ +| ["这是一段中文"] | ++------------------------+ +``` + +插入数据并使用 REGEXP_EXTRACT_ALL + +```sql +CREATE TABLE test_regexp_extract_all_array ( + id INT, + text_content VARCHAR(255), + pattern VARCHAR(255) +) PROPERTIES ("replication_num"="1"); + +INSERT INTO test_regexp_extract_all_array VALUES +(1, 'apple1, banana2, cherry3', '([a-zA-Z]+)\\d'), +(2, 'red#123, blue#456, green#789', '([a-zA-Z]+)#\\d+'), +(3, 'hello@example.com, world@test.net', '([a-zA-Z]+)@'); + +SELECT id, regexp_extract_all_array(text_content, pattern) AS extracted_data +FROM test_regexp_extract_all_array; ++------+-------------------------------+ +| id | extracted_data | ++------+-------------------------------+ +| 1 | ["apple", "banana", "cherry"] | +| 2 | ["red", "blue", "green"] | +| 3 | ["hello", "world"] | ++------+-------------------------------+ +``` + +没有匹配到,返回空字符串 + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)'); ++-------------------------------------------+ +| REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)') | ++-------------------------------------------+ +| [] | ++-------------------------------------------+ +``` +emoji字符匹配 + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('👩‍💻,👨‍🚀', '(💻|🚀)') AS res; ++------------------+ +| res | ++------------------+ +| ["💻", "🚀"] | ++------------------+ +``` + +'Str' 是 NULL,返回 NULL + +```sql +SELECT regexp_extract_all_array(NULL, '([a-z]+)') AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +'pattern' 是 NULL,返回 NULL + +```sql +SELECT regexp_extract_all_array('Hello World', NULL) AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +全部参数都是 NULL,返回 NULL + +```sql +SELECT regexp_extract_all_array(NULL, NULL) AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +如果 'pattern' 参数不符合正则表达式,则抛出错误 + +```sql +SELECT regexp_extract_all_array('hello (world) 123', '([[:alpha:]+') AS res; +-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: ([[:alpha:]+. Error: missing ]: [[:alpha:]+. If you need advanced regex features, try setting enable_extended_regex=true +``` + +高级的正则表达式 + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)'); +-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: (?<=ID:)([A-Z]{2}-\d). Error: invalid perl operator: (?<. If you need advanced regex features, try setting enable_extended_regex=true +``` + +```sql +SET enable_extended_regex = true; +SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)') AS res; ++--------------------------+ +| res | ++--------------------------+ +| ["AA-1", "BB-2", "CC-3"] | ++--------------------------+ +``` + +模式修饰符 + +大小写不敏感:`(?i)` 使匹配忽略大小写 + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(hello)') AS case_sensitive, + REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(?i)(hello)') AS case_insensitive; ++----------------+-----------------------------+ +| case_sensitive | case_insensitive | ++----------------+-----------------------------+ +| ["hello"] | ["Hello", "hello", "HELLO"] | ++----------------+-----------------------------+ +``` + +多行模式:`(?m)` 使 `^` 和 `$` 匹配每行行首/行尾 +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '^([a-z]+)') AS single_line, + REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '(?m)^([a-z]+)') AS multi_line; ++-------------+---------------------+ +| single_line | multi_line | ++-------------+---------------------+ +| ['foo'] | ['foo','bar','baz'] | ++-------------+---------------------+ +``` + +贪婪与非贪婪:`(?U)` 使量词尽可能少地匹配 +```sql +SELECT REGEXP_EXTRACT_ALL('aXbXcXd', '(a.*X)') AS greedy, + REGEXP_EXTRACT_ALL('aXbXcXd', '(?U)(a.*X)') AS non_greedy; ++----------+------------+ +| greedy | non_greedy | ++----------+------------+ +| ['aXbXcX'] | ['aX'] | ++----------+------------+ +``` diff --git a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md index a9b302024c96f..e47cf839f976c 100644 --- a/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md +++ b/i18n/zh-CN/docusaurus-plugin-content-docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md @@ -2,7 +2,7 @@ { "title": "REGEXP_EXTRACT_ALL", "language": "zh-CN", - "description": "REGEXPEXTRACTALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。为了使函数返回表示模式匹配部分的字符串数组,该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式," + "description": "REGEXPEXTRACTALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。函数返回表示模式匹配部分的字符串,且该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式," } --- @@ -27,7 +27,7 @@ under the License. ## 描述 -REGEXP_EXTRACT_ALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。为了使函数返回表示模式匹配部分的字符串数组,该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式,则返回空字符串。 +REGEXP_EXTRACT_ALL 函数用于对给定字符串str执行正则表达式匹配,所有与指定 pattern 匹配的文本串当中的与第一个子模式匹配的部分。函数返回表示模式匹配部分的字符串,且该模式必须与输入字符串 str 的一部分完全匹配。如果没有匹配项,或模式不包含任何子模式,则返回空字符串。 需要注意的是,在处理字符集匹配时,应使用 Utf-8 标准字符类。这确保函数能够正确识别和处理来自不同语言的各种字符。 @@ -50,7 +50,9 @@ REGEXP_EXTRACT_ALL(, ) ## 返回值 -函数返回表示输入字符串中与指定正则表达式的第一个子模式匹配部分的字符串数组。返回类型为 String 值数组。如果未找到匹配项,或模式没有子模式,则返回空数组。 +函数返回表示输入字符串中与指定正则表达式的第一个子模式匹配部分的字符串。返回类型为 String。如果未找到匹配项,或模式没有子模式,则返回空字符串。 + +如需返回数组版本,请参见 [REGEXP_EXTRACT_ALL_ARRAY](./regexp-extract-all-array.md)。 **默认行为**: @@ -100,7 +102,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)'); +-----------------------------------------------------------------+ ``` -从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为键的数组['abc', 'def', 'ghi']。 +从键值对中提取键, 该模式匹配字符串中的键值对。第一个子模式捕获键,因此结果为 ['abc', 'def', 'ghi']。 ```sql mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)'); diff --git a/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md b/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md index 81dbac78bcd5c..2df2edd4aea03 100644 --- a/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md +++ b/versioned_docs/version-1.2/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md @@ -7,7 +7,7 @@ ## Description -Regularly matches a string `str` and extracts the first sub-pattern matching part of `pattern`. The pattern needs to exactly match a part of `str` in order to return an array of strings for the part of the pattern that needs to be matched. If there is no match or the pattern has no sub-pattern, the empty string is returned. +Regularly matches a string `str` and extracts the first sub-pattern matching part of `pattern`. The pattern needs to exactly match a part of `str` in order to return a string for the part of the pattern that needs to be matched. If there is no match or the pattern has no sub-pattern, the empty string is returned. - Character set matching requires the use of Unicode standard character classes. For example, to match Chinese, use `\p{Han}`. diff --git a/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md index 81dbac78bcd5c..2df2edd4aea03 100644 --- a/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md +++ b/versioned_docs/version-2.0/sql-manual/sql-functions/string-functions/regexp/regexp-extract-all.md @@ -7,7 +7,7 @@ ## Description -Regularly matches a string `str` and extracts the first sub-pattern matching part of `pattern`. The pattern needs to exactly match a part of `str` in order to return an array of strings for the part of the pattern that needs to be matched. If there is no match or the pattern has no sub-pattern, the empty string is returned. +Regularly matches a string `str` and extracts the first sub-pattern matching part of `pattern`. The pattern needs to exactly match a part of `str` in order to return a string for the part of the pattern that needs to be matched. If there is no match or the pattern has no sub-pattern, the empty string is returned. - Character set matching requires the use of Unicode standard character classes. For example, to match Chinese, use `\p{Han}`. diff --git a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md index 204e3756e054b..7998bd890c188 100644 --- a/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md +++ b/versioned_docs/version-2.1/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md @@ -8,7 +8,7 @@ ## Description -The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. For the function to return an array of strings representing the matched parts of the pattern, the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned. +The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. The function returns a string representing the matched part of the pattern, and the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned. It should be noted that when handling character set matching, Utf-8 standard character classes should be used. This ensures that functions can correctly identify and process various characters from different languages. @@ -31,7 +31,7 @@ REGEXP_EXTRACT_ALL(, ) ## Return value -The function returns an array of strings that represent the parts of the input string that match the first sub - pattern of the specified regular expression. The return type is an array of String values. If no matches are found, or if the pattern has no sub - patterns, an empty array is returned. +The function returns a string that represents the part of the input string that matches the first sub - pattern of the specified regular expression. The return type is String. If no matches are found, or if the pattern has no sub - patterns, an empty string is returned. ## Example @@ -57,7 +57,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)'); +-----------------------------------------------------------------+ ``` -Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is an array of the keys ['abc', 'def', 'ghi']. +Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is ['abc', 'def', 'ghi']. ```sql mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)'); diff --git a/versioned_docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md b/versioned_docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md index b14d3f0dee046..b535f334d67ff 100644 --- a/versioned_docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md +++ b/versioned_docs/version-3.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md @@ -8,7 +8,7 @@ ## Description -The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. For the function to return an array of strings representing the matched parts of the pattern, the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned. +The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. The function returns a string representing the matched part of the pattern, and the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned. It should be noted that when handling character set matching, Utf-8 standard character classes should be used. This ensures that functions can correctly identify and process various characters from different languages. @@ -31,7 +31,7 @@ REGEXP_EXTRACT_ALL(, ) ## Return value -The function returns an array of strings that represent the parts of the input string that match the first sub - pattern of the specified regular expression. The return type is an array of String values. If no matches are found, or if the pattern has no sub - patterns, an empty array is returned. +The function returns a string that represents the part of the input string that matches the first sub - pattern of the specified regular expression. The return type is String. If no matches are found, or if the pattern has no sub - patterns, an empty string is returned. **Default Behavior**: @@ -82,7 +82,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)'); +-----------------------------------------------------------------+ ``` -Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is an array of the keys ['abc', 'def', 'ghi']. +Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is ['abc', 'def', 'ghi']. ```sql mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)'); diff --git a/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md b/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md new file mode 100644 index 0000000000000..12da8930c6567 --- /dev/null +++ b/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all-array.md @@ -0,0 +1,266 @@ +--- +{ + "title": "REGEXP_EXTRACT_ALL_ARRAY", + "language": "en", + "description": "The REGEXP_EXTRACT_ALL_ARRAY function performs regular expression matching and returns all values captured by the first sub-pattern as an array." +} +--- + +## Description + +The `REGEXP_EXTRACT_ALL_ARRAY` function performs regular expression matching on a given string `str` and returns all values captured by the first sub-pattern of `pattern` as an array. + +If there is no match, or if the pattern has no sub-pattern, an empty array is returned. + +Default supported character match classes: https://github.com/google/re2/wiki/Syntax + +Doris supports enabling advanced regex features (such as look-around assertions) via session variable `enable_extended_regex` (default: `false`). + +When `enable_extended_regex=true`, supported syntax follows Boost.Regex: https://www.boost.org/doc/libs/latest/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html + +## Syntax + +```sql +REGEXP_EXTRACT_ALL_ARRAY(, ) +``` + +## Parameters + +| Parameter | Description | +| -- | -- | +| `` | Input string for regex matching. | +| `` | Regex pattern. The first capturing group is used for extraction. | + +## Return value + +Returns `ARRAY`. + +If no matches are found, returns `[]`. + +If any parameter is NULL, return NULL + +**Default Behavior**: + +| Default Setting | Behavior | +| ------------------------------------ | ----------------------------------------------------------------------------------------- | +| `.` matches newline | `.` can match `\n` (newline) by default. | +| Case-sensitive | Matching is case-sensitive. | +| `^`/`$` match full string boundaries | `^` matches only the start of the string, `$` matches only the end, not line starts/ends. | +| Greedy quantifiers | `*`, `+`, etc. match as much as possible by default. | +| UTF-8 | Strings are processed as UTF-8. | + +**Pattern Modifiers**: + +You can override the default behavior by prefixing the `pattern` with `(?flags)`. Multiple modifiers can be combined, e.g., `(?im)`; a `-` prefix disables the corresponding option, e.g., `(?-s)`. + +Pattern modifiers only take effect when using the default regex engine. If `enable_extended_regex=true` is enabled while using zero-width assertions (e.g., `(?<=...)`, `(?=...)`), the query will be handled by the Boost.Regex engine, and modifier behavior may not work as expected. It is recommended not to mix them. + +| Flag | Meaning | +| ------- | ---------------------------------------------------------------------------- | +| `(?i)` | Case-insensitive matching | +| `(?-i)` | Case-sensitive (default) | +| `(?s)` | `.` matches newline (enabled by default) | +| `(?-s)` | `.` does **not** match newline | +| `(?m)` | Multiline mode: `^` matches start of each line, `$` matches end of each line | +| `(?-m)` | Single-line mode: `^`/`$` match full string boundaries (default) | +| `(?U)` | Non-greedy quantifiers: `*`, `+`, etc. match as little as possible | +| `(?-U)` | Greedy quantifiers (default): `*`, `+`, etc. match as much as possible | + + +## Example + +Basic matching of lowercase letters around 'C'. + +```sql +SELECT regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') AS res; ++-------+ +| res | ++-------+ +| ["b"] | ++-------+ +``` + +```sql +SELECT + array_size( + regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') + )AS res_size; ++----------+ +| res_size | ++----------+ +| 1 | ++----------+ +``` + +Multiple matches in a string. + +```sql +SELECT regexp_extract_all_array('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)') AS res; ++------------+ +| res | ++------------+ +| ["b", "f"] | ++------------+ +``` + +Extracting keys from key - value pairs. + +```sql +SELECT regexp_extract_all_array('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS res; ++-----------------------+ +| res | ++-----------------------+ +| ["abc", "def", "ghi"] | ++-----------------------+ +``` + +Matching Chinese characters. + +```sql +SELECT regexp_extract_all_array('这是一段中文 This is a passage in English 1234567', '(\\p{Han}+)(.+)') AS res; ++------------------------+ +| res | ++------------------------+ +| ["这是一段中文"] | ++------------------------+ +``` + +Inserting data and using REGEXP_EXTRACT_ALL_ARRAY. + +```sql +CREATE TABLE test_regexp_extract_all_array ( + id INT, + text_content VARCHAR(255), + pattern VARCHAR(255) +) PROPERTIES ("replication_num"="1"); + +INSERT INTO test_regexp_extract_all_array VALUES +(1, 'apple1, banana2, cherry3', '([a-zA-Z]+)\\d'), +(2, 'red#123, blue#456, green#789', '([a-zA-Z]+)#\\d+'), +(3, 'hello@example.com, world@test.net', '([a-zA-Z]+)@'); + +SELECT id, regexp_extract_all_array(text_content, pattern) AS extracted_data +FROM test_regexp_extract_all_array; ++------+-------------------------------+ +| id | extracted_data | ++------+-------------------------------+ +| 1 | ["apple", "banana", "cherry"] | +| 2 | ["red", "blue", "green"] | +| 3 | ["hello", "world"] | ++------+-------------------------------+ +``` + +No matched, return empty array. + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)'); ++-------------------------------------------+ +| REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)') | ++-------------------------------------------+ +| [] | ++-------------------------------------------+ +``` + +Emoji match. + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('👩‍💻,👨‍🚀', '(💻|🚀)') AS res; ++------------------+ +| res | ++------------------+ +| ["💻", "🚀"] | ++------------------+ +``` + +'str' is NULL, return NULL. + +```sql +SELECT regexp_extract_all_array(NULL, '([a-z]+)') AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +'pattern' is NULL, return NULL. + +```sql +SELECT regexp_extract_all_array('Hello World', NULL) AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +All parameters are NULL, return NULL. + +```sql +SELECT regexp_extract_all_array(NULL, NULL) AS res; ++------+ +| res | ++------+ +| NULL | ++------+ +``` + +If the `pattern` is not allowed regexp regular, throw error. + +```sql +SELECT regexp_extract_all_array('hello (world) 123', '([[:alpha:]+') AS res; +-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: ([[:alpha:]+. Error: missing ]: [[:alpha:]+. If you need advanced regex features, try setting enable_extended_regex=true +``` + +Advanced regexp. + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)'); +-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: (?<=ID:)([A-Z]{2}-\d). Error: invalid perl operator: (?<. If you need advanced regex features, try setting enable_extended_regex=true +``` + +```sql +SET enable_extended_regex = true; +SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)') AS res; ++--------------------------+ +| res | ++--------------------------+ +| ["AA-1", "BB-2", "CC-3"] | ++--------------------------+ +``` + +Pattern Modifiers + +Case-insensitive matching: `(?i)` makes the match ignore case + +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(hello)') AS case_sensitive, + REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(?i)(hello)') AS case_insensitive; ++----------------+-----------------------------+ +| case_sensitive | case_insensitive | ++----------------+-----------------------------+ +| ["hello"] | ["Hello", "hello", "HELLO"] | ++----------------+-----------------------------+ +``` + +Multiline mode: `(?m)` makes `^` and `$` match start/end of each line +```sql +SELECT REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '^([a-z]+)') AS single_line, + REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '(?m)^([a-z]+)') AS multi_line; ++-------------+---------------------+ +| single_line | multi_line | ++-------------+---------------------+ +| ['foo'] | ['foo','bar','baz'] | ++-------------+---------------------+ +``` + +Greedy vs non-greedy: `(?U)` makes quantifiers match as little as possible +```sql +SELECT REGEXP_EXTRACT_ALL('aXbXcXd', '(a.*X)') AS greedy, + REGEXP_EXTRACT_ALL('aXbXcXd', '(?U)(a.*X)') AS non_greedy; ++----------+------------+ +| greedy | non_greedy | ++----------+------------+ +| ['aXbXcX'] | ['aX'] | ++----------+------------+ +``` diff --git a/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md b/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md index b14d3f0dee046..6f639669a511e 100644 --- a/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md +++ b/versioned_docs/version-4.x/sql-manual/sql-functions/scalar-functions/string-functions/regexp-extract-all.md @@ -8,7 +8,7 @@ ## Description -The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. For the function to return an array of strings representing the matched parts of the pattern, the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned. +The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. The function returns a string representing the matched part of the pattern, and the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned. It should be noted that when handling character set matching, Utf-8 standard character classes should be used. This ensures that functions can correctly identify and process various characters from different languages. @@ -31,7 +31,9 @@ REGEXP_EXTRACT_ALL(, ) ## Return value -The function returns an array of strings that represent the parts of the input string that match the first sub - pattern of the specified regular expression. The return type is an array of String values. If no matches are found, or if the pattern has no sub - patterns, an empty array is returned. +The function returns a string that represents the part of the input string that matches the first sub - pattern of the specified regular expression. The return type is String. If no matches are found, or if the pattern has no sub - patterns, an empty string is returned. + +For the array-returning variant, see [REGEXP_EXTRACT_ALL_ARRAY](./regexp-extract-all-array.md). **Default Behavior**: @@ -82,7 +84,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)'); +-----------------------------------------------------------------+ ``` -Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is an array of the keys ['abc', 'def', 'ghi']. +Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is ['abc', 'def', 'ghi']. ```sql mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)');