Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
@@ -0,0 +1,266 @@
---
{
"title": "REGEXP_EXTRACT_ALL_ARRAY",
"language": "en",
"description": "The REGEXP_EXTRACT_ALL_ARRAY function performs regular expression matching and returns all values captured by the first sub-pattern as an array."
}
---

## Description

The `REGEXP_EXTRACT_ALL_ARRAY` function performs regular expression matching on a given string `str` and returns all values captured by the first sub-pattern of `pattern` as an array.

If there is no match, or if the pattern has no sub-pattern, an empty array is returned.

Default supported character match classes: https://github.com/google/re2/wiki/Syntax

Doris supports enabling advanced regex features (such as look-around assertions) via session variable `enable_extended_regex` (default: `false`).

When `enable_extended_regex=true`, supported syntax follows Boost.Regex: https://www.boost.org/doc/libs/latest/libs/regex/doc/html/boost_regex/syntax/perl_syntax.html

## Syntax

```sql
REGEXP_EXTRACT_ALL_ARRAY(<str>, <pattern>)
```

## Parameters

| Parameter | Description |
| -- | -- |
| `<str>` | Input string for regex matching. |
| `<pattern>` | Regex pattern. The first capturing group is used for extraction. |

## Return value

Returns `ARRAY<STRING>`.

If no matches are found, returns `[]`.

If any parameter is NULL, return NULL

**Default Behavior**:

| Default Setting | Behavior |
| ------------------------------------ | ----------------------------------------------------------------------------------------- |
| `.` matches newline | `.` can match `\n` (newline) by default. |
| Case-sensitive | Matching is case-sensitive. |
| `^`/`$` match full string boundaries | `^` matches only the start of the string, `$` matches only the end, not line starts/ends. |
| Greedy quantifiers | `*`, `+`, etc. match as much as possible by default. |
| UTF-8 | Strings are processed as UTF-8. |

**Pattern Modifiers**:

You can override the default behavior by prefixing the `pattern` with `(?flags)`. Multiple modifiers can be combined, e.g., `(?im)`; a `-` prefix disables the corresponding option, e.g., `(?-s)`.

Pattern modifiers only take effect when using the default regex engine. If `enable_extended_regex=true` is enabled while using zero-width assertions (e.g., `(?<=...)`, `(?=...)`), the query will be handled by the Boost.Regex engine, and modifier behavior may not work as expected. It is recommended not to mix them.

| Flag | Meaning |
| ------- | ---------------------------------------------------------------------------- |
| `(?i)` | Case-insensitive matching |
| `(?-i)` | Case-sensitive (default) |
| `(?s)` | `.` matches newline (enabled by default) |
| `(?-s)` | `.` does **not** match newline |
| `(?m)` | Multiline mode: `^` matches start of each line, `$` matches end of each line |
| `(?-m)` | Single-line mode: `^`/`$` match full string boundaries (default) |
| `(?U)` | Non-greedy quantifiers: `*`, `+`, etc. match as little as possible |
| `(?-U)` | Greedy quantifiers (default): `*`, `+`, etc. match as much as possible |


## Example

Basic matching of lowercase letters around 'C'.

```sql
SELECT regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)') AS res;
+-------+
| res |
+-------+
| ["b"] |
+-------+
```

```sql
SELECT
array_size(
regexp_extract_all_array('AbCdE', '([[:lower:]]+)C([[:lower:]]+)')
)AS res_size;
+----------+
| res_size |
+----------+
| 1 |
+----------+
```

Multiple matches in a string.

```sql
SELECT regexp_extract_all_array('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)') AS res;
+------------+
| res |
+------------+
| ["b", "f"] |
+------------+
```

Extracting keys from key - value pairs.

```sql
SELECT regexp_extract_all_array('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)') AS res;
+-----------------------+
| res |
+-----------------------+
| ["abc", "def", "ghi"] |
+-----------------------+
```

Matching Chinese characters.

```sql
SELECT regexp_extract_all_array('这是一段中文 This is a passage in English 1234567', '(\\p{Han}+)(.+)') AS res;
+------------------------+
| res |
+------------------------+
| ["这是一段中文"] |
+------------------------+
```

Inserting data and using REGEXP_EXTRACT_ALL_ARRAY.

```sql
CREATE TABLE test_regexp_extract_all_array (
id INT,
text_content VARCHAR(255),
pattern VARCHAR(255)
) PROPERTIES ("replication_num"="1");

INSERT INTO test_regexp_extract_all_array VALUES
(1, 'apple1, banana2, cherry3', '([a-zA-Z]+)\\d'),
(2, 'red#123, blue#456, green#789', '([a-zA-Z]+)#\\d+'),
(3, 'hello@example.com, world@test.net', '([a-zA-Z]+)@');

SELECT id, regexp_extract_all_array(text_content, pattern) AS extracted_data
FROM test_regexp_extract_all_array;
+------+-------------------------------+
| id | extracted_data |
+------+-------------------------------+
| 1 | ["apple", "banana", "cherry"] |
| 2 | ["red", "blue", "green"] |
| 3 | ["hello", "world"] |
+------+-------------------------------+
```

No matched, return empty array.

```sql
SELECT REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)');
+-------------------------------------------+
| REGEXP_EXTRACT_ALL_ARRAY('ABC', '(\\d+)') |
+-------------------------------------------+
| [] |
+-------------------------------------------+
```

Emoji match.

```sql
SELECT REGEXP_EXTRACT_ALL_ARRAY('👩‍💻,👨‍🚀', '(💻|🚀)') AS res;
+------------------+
| res |
+------------------+
| ["💻", "🚀"] |
+------------------+
```

'str' is NULL, return NULL.

```sql
SELECT regexp_extract_all_array(NULL, '([a-z]+)') AS res;
+------+
| res |
+------+
| NULL |
+------+
```

'pattern' is NULL, return NULL.

```sql
SELECT regexp_extract_all_array('Hello World', NULL) AS res;
+------+
| res |
+------+
| NULL |
+------+
```

All parameters are NULL, return NULL.

```sql
SELECT regexp_extract_all_array(NULL, NULL) AS res;
+------+
| res |
+------+
| NULL |
+------+
```

If the `pattern` is not allowed regexp regular, throw error.

```sql
SELECT regexp_extract_all_array('hello (world) 123', '([[:alpha:]+') AS res;
-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: ([[:alpha:]+. Error: missing ]: [[:alpha:]+. If you need advanced regex features, try setting enable_extended_regex=true
```

Advanced regexp.

```sql
SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)');
-- ERROR 1105 (HY000): errCode = 2, detailMessage = (127.0.0.1)[INVALID_ARGUMENT]Invalid regex pattern: (?<=ID:)([A-Z]{2}-\d). Error: invalid perl operator: (?<. If you need advanced regex features, try setting enable_extended_regex=true
```

```sql
SET enable_extended_regex = true;
SELECT REGEXP_EXTRACT_ALL_ARRAY('ID:AA-1,ID:BB-2,ID:CC-3', '(?<=ID:)([A-Z]{2}-\\d)') AS res;
+--------------------------+
| res |
+--------------------------+
| ["AA-1", "BB-2", "CC-3"] |
+--------------------------+
```

Pattern Modifiers

Case-insensitive matching: `(?i)` makes the match ignore case

```sql
SELECT REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(hello)') AS case_sensitive,
REGEXP_EXTRACT_ALL_ARRAY('Hello hello HELLO', '(?i)(hello)') AS case_insensitive;
+----------------+-----------------------------+
| case_sensitive | case_insensitive |
+----------------+-----------------------------+
| ["hello"] | ["Hello", "hello", "HELLO"] |
+----------------+-----------------------------+
```

Multiline mode: `(?m)` makes `^` and `$` match start/end of each line
```sql
SELECT REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '^([a-z]+)') AS single_line,
REGEXP_EXTRACT_ALL_ARRAY('foo\nbar\nbaz', '(?m)^([a-z]+)') AS multi_line;
+-------------+---------------------+
| single_line | multi_line |
+-------------+---------------------+
| ['foo'] | ['foo','bar','baz'] |
+-------------+---------------------+
```

Greedy vs non-greedy: `(?U)` makes quantifiers match as little as possible
```sql
SELECT REGEXP_EXTRACT_ALL('aXbXcXd', '(a.*X)') AS greedy,
REGEXP_EXTRACT_ALL('aXbXcXd', '(?U)(a.*X)') AS non_greedy;
+----------+------------+
| greedy | non_greedy |
+----------+------------+
| ['aXbXcX'] | ['aX'] |
+----------+------------+
```
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,7 @@

## Description

The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. For the function to return an array of strings representing the matched parts of the pattern, the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned.
The `REGEXP_EXTRACT_ALL` function is used to perform a regular expression match on a given string `str` and extract all the parts that match the first sub - pattern of the specified `pattern`. The function returns a string representing the matched part of the pattern, and the pattern must exactly match a portion of the input string `str`. If there is no match, or if the pattern does not contain any sub - patterns, an empty string is returned.

It should be noted that when handling character set matching, Utf-8 standard character classes should be used. This ensures that functions can correctly identify and process various characters from different languages.

Expand All @@ -35,7 +35,9 @@ REGEXP_EXTRACT_ALL(<str>, <pattern>)

## Return value

The function returns an array of strings that represent the parts of the input string that match the first sub - pattern of the specified regular expression. The return type is an array of String values. If no matches are found, or if the pattern has no sub - patterns, an empty array is returned.
The function returns a string that represents the part of the input string that matches the first sub - pattern of the specified regular expression. The return type is String. If no matches are found, or if the pattern has no sub - patterns, an empty string is returned.

For the array-returning variant, see [REGEXP_EXTRACT_ALL_ARRAY](./regexp-extract-all-array.md).

**Default Behavior**:

Expand Down Expand Up @@ -88,7 +90,7 @@ mysql> SELECT regexp_extract_all('AbCdEfCg', '([[:lower:]]+)C([[:lower:]]+)');
+-----------------------------------------------------------------+
```

Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is an array of the keys ['abc', 'def', 'ghi'].
Extracting keys from key - value pairs.The pattern matches key - value pairs in the string. The first sub - pattern captures the keys, so the result is ['abc', 'def', 'ghi'].

```sql
mysql> SELECT regexp_extract_all('abc=111, def=222, ghi=333','("[^"]+"|\\w+)=("[^"]+"|\\w+)');
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@

与`MURMUR_HASH3_64`的区别是:此版本复用 MurmurHash3 的 128 位处理函数,仅输出第一个 64 位哈希值,与[标准库](https://mmh3.readthedocs.io/en/latest/api.html#mmh3.hash64)的行为保持一致。

-注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64_v2 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。
-注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64_v2 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。如需更优的 64 位 MurmurHash3 性能,可考虑使用 `murmur_hash3_64`。

## 语法

Expand Down Expand Up @@ -44,4 +44,4 @@ select murmur_hash3_64_v2(null), murmur_hash3_64_v2("hello"), murmur_hash3_64_v2
+-----------------------+--------------------------+-----------------------------------+
| NULL | -3215607508166160593 | 3583109472027628045 |
+-----------------------+--------------------------+-----------------------------------+
```
```
Original file line number Diff line number Diff line change
Expand Up @@ -10,6 +10,8 @@

计算 64 位 murmur3 hash 值

与`MURMUR_HASH3_64_V2`的区别是:此版本专门为 64 位输出优化,性能略优于 v2 版本, 但与[标准库](https://mmh3.readthedocs.io/en/latest/api.html#mmh3.hash64)实现不一致。

-注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。

## 语法
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -12,6 +12,10 @@

-注:经过测试 xxhash_64 的性能大约是 murmur_hash3_64 的 2 倍,所以在计算 hash 值时,更推荐使用`xxhash_64`,而不是`murmur_hash3_64`。

## 别名

- `XXHASH3_64`

## 语法

```sql
Expand Down
Loading