Skip to content

Support Unicode property classes (\p{…} / \P{…}) #7

@DaliVana

Description

@DaliVana

Summary

Regex.compile rejects all Unicode property escapes (\p{…} and \P{…}). This includes general categories (\p{L}, \p{N}, \p{Lu}, \p{Nd}, …), scripts (\p{Greek}, \p{Han}, …), blocks, and binary properties (\p{White_Space}, \p{Alphabetic}, …). These are standardized in [UTS #18 (Unicode Regular Expressions)][tr18] and are required to express tokenizer pre-tokenization patterns faithfully.

Version

  • zig-regex: 0.1.1
  • Zig: 0.16.0

Reproduction

const Regex = @import("regex").Regex;

test "unicode property classes" {
    inline for (.{ "\\p{L}+", "\\p{N}+", "\\p{Lu}", "\\P{L}", "[^\\p{L}\\p{N}]" }) |pat| {
        var re = try Regex.compile(std.testing.allocator, pat);
        re.deinit();
    }
}

Expected
All Unicode property forms compile and match per the Unicode Character Database (UCD), mirroring the UTS #18 / PCRE / fancy-regex semantics:

General categories: one- and two-letter (\p{L}, \p{Lu}, \p{N}, \p{Nd}, \p{P}, \p{Z}, …) — see UAX #44 §5.7.1 General_Category
Negation: \P{…} and \p{^…}
Scripts and script extensions: \p{Greek}, \p{Script=Han}, \p{scx=Latin}, … — see UAX #24 (Script property)
Binary properties: \p{White_Space}, \p{Alphabetic}, \p{Emoji}, …
Usable both standalone and inside character classes, e.g. [^\p{L}\p{N}], with property/value names matched loosely per UAX #44-LM3 (loose matching) (case/_/-/space-insensitive)

Actual
Compilation fails. The lexer has no case for p/P after a backslash, so it falls through to else => RegexError.InvalidEscapeSequence (src/parser.zig:107). Inside [...] the class parser likewise rejects it (parseCharClass → InvalidCharacterClass, src/parser.zig:722-723).

Why this matters
The canonical GPT-4 BPE split pattern is:

'(?i:[sdmt]|ll|ve|re)|[^\r\n\p{L}\p{N}]?+\p{L}+|\p{N}{1,3}| ?[^\s\p{L}\p{N}]++[\r\n]|\s[\r\n]|\s+(?!\S)|\s+ Without Unicode property classes this can only be approximated with ASCII [A-Za-z]/[0-9], which silently mis-tokenizes any non-ASCII text (accented Latin, CJK, Cyrillic, emoji, etc.). zig-regex already supports the harder parts of this pattern (lookahead), so property classes are the main remaining blocker for a faithful port — and they're broadly useful well beyond tokenization.

Proposed scope
Full \p{…}/\P{…} support backed by generated UCD tables, targeting UTS #18 "Basic Unicode Support" (RL1.2):

General categories — both abbreviated (L, Lu, Nd, P, Z, …) and the super-categories (L, M, N, P, S, Z, C).
Negation — \P{…} and the \p{^…} form.
Scripts / script extensions — \p{Script=…} / \p{sc=…} and \p{Script_Extensions=…} / \p{scx=…}, plus the bare \p{Greek} shorthand.
Binary properties — at minimum White_Space, Alphabetic, Uppercase, Lowercase, Noncharacter_Code_Point; ideally the full set.
Loose name matching per UAX #44-LM3.
Works identically standalone and inside [...], and composes with negated classes ([^\p{L}\p{N}]) and existing quantifiers.
Implementation suggestion: vendor a small code-generator that emits codepoint range tables from the UCD data files
(DerivedGeneralCategory.txt, Scripts.txt, ScriptExtensions.txt, PropList.txt, PropertyAliases.txt, PropertyValueAliases.txt) at build time, pin the Unicode version, and resolve \p{…} to range sets the existing matcher already understands. Phasing is fine — general categories + negation first (covers the tokenizer use case), scripts/binary properties next.

I'm happy to test a branch against a real BPE tokenizer port (ASCII and non-ASCII corpora) if that helps validate correctness.

References
UTS #18 — Unicode Regular Expressions (property syntax & conformance levels)
UAX #44 — Unicode Character Database (General_Category, loose matching UAX44-LM3)
UAX #24 — Unicode Script Property
UCD data files (latest)
Prior art: Rust regex crate Unicode support, PCRE2 \p docs

Metadata

Metadata

Assignees

Labels

No labels
No labels

Type

No fields configured for Bug.

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions