Replace ICU with a bundled Unicode case table#38
Conversation
There was a problem hiding this comment.
Pull request overview
This PR removes the ICU dependency by bundling a generated Unicode case-mapping table and registering Unicode-aware lower()/upper() SQLite SQL functions via an auto-extension, ensuring SQL case conversion matches JavaScript (String.prototype.toLowerCase()/toUpperCase()) across platforms (including Windows).
Changes:
- Added a generated Unicode case table and a UTF-8 case-transform implementation to provide Unicode
lower()/upper()without ICU. - Registered the new
lower()/upper()as a SQLite auto-extension during addon initialization. - Replaced ICU-specific build/test wiring with new JS-parity tests for Unicode casing (including 1:many mappings and Greek final-sigma behavior).
Reviewed changes
Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.
Show a summary per file
| File | Description |
|---|---|
| test/52.unicode-case.js | Adds new tests validating SQL lower()/upper() parity with JavaScript across scripts and edge cases. |
| test/52.icu.js | Removes ICU-dependent tests that no longer apply after dropping ICU. |
| src/util/unicode_case.cpp | Introduces the Unicode-aware lower()/upper() implementation backed by the generated tables and final-sigma handling. |
| src/better_sqlite3.cpp | Registers the Unicode case functions as a SQLite auto-extension during module initialization. |
| deps/sqlite3.gyp | Removes conditional SQLITE_ENABLE_ICU build configuration and ICU include paths. |
| deps/icu.js | Deletes ICU discovery/link helper script (no longer needed). |
| deps/gen-unicode-case.mjs | Adds generator script to produce the embedded Unicode case-mapping data from Node. |
| deps/download.sh | Removes ICU-related commentary tied to prior amalgamation/define behavior. |
| binding.gyp | Removes ICU link/define wiring from addon build configuration. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
ICU's value for Zero was only Unicode lower()/upper() (zqlite compiles ILIKE to lower(col) LIKE lower(pattern)), but linking it has been a continual source of pain: no -fPIC distro static archives, soname coupling that breaks glibc consumers (libicui18n.so.67), QEMU arm builds, ~28MB, and a Windows exception. Drop the ICU dependency entirely. Register Unicode-aware lower()/upper() on every connection from an embedded case-mapping table generated from Node's toLowerCase/toUpperCase (so the SQL functions match the client-side IVM matcher). The result is self-contained on every platform (incl. Windows), with no runtime ICU, no -fPIC/soname/QEMU problems, and ~120KB of generated data. - deps/gen-unicode-case.mjs: generator -> src/util/unicode_case_data.h. - src/util/unicode_case.cpp: UTF-8 case transform + lower()/upper(), registered via sqlite3_auto_extension in the addon init. - Remove SQLITE_ENABLE_ICU, deps/icu.js, and the ICU gyp linking. - Rename test/52.icu.js -> test/52.unicode-case.js; assert lower()/upper() match JavaScript across scripts and 1:many mappings (ß->SS, İ->i̇). Scope: context-free case mapping only (no folding, no Greek final-sigma context). Follow-ups: the now-dead CI ICU installs can be removed (pairs with the native arm64 PR), and the zero_sqlite3 shell still uses ASCII lower()/upper(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the one context-sensitive rule that JS toLowerCase applies in the default
(locale-independent) algorithm: Σ lowercases to ς when preceded by a cased
letter (skipping case-ignorable chars) and not followed by one, else σ.
- Generator emits Cased / Case_Ignorable as code-point ranges (from the
\p{Cased} / \p{Case_Ignorable} escapes, so they track V8's Unicode version).
- lower() tracks prevCased and, for Σ, scans forward for a following cased
letter (LowerSigma). upper() is unchanged.
- Tests: add final-sigma cases and an exhaustive guard that lower()/upper()
equal JS for every code point (verified: 0 mismatches across all 1,112,064).
With this there is no remaining context divergence from JS toLowerCase/
toUpperCase within a Unicode version.
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "matches JavaScript across every code point" test compares the generated table against the *runtime* Node's toLowerCase/toUpperCase, which only holds when the runtime's Unicode version matches the table's. On other Node versions (e.g. Node 23 ships an older Unicode than the table's 17.0) a few newly-added mappings differ, failing the sweep. Skip it unless versions match; the curated cases use stable mappings and continue to run everywhere. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make Node 24 LTS (Unicode 17) the explicit anchor for the case table, and guarantee verification on a matching runtime: - Add a test that regenerates the table and asserts it equals the committed header (drift guard), gated to a runtime whose Unicode matches the table. - Document that the generator should be run with the active Node LTS. On the Node 24 matrix cell, CI now both (a) exhaustively checks lower()/upper() match JS for every code point and (b) confirms the committed table is current. On other Node versions these skip; the stable curated cases still run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "committed case table is up to date" test regenerated the header and byte-compared it, which fails across runtimes for reasons unrelated to the data: the `// Source: Node X.Y.Z` provenance comment varies by generating Node version, line endings differ on Windows, and Bun spawns a different binary. The exhaustive "matches JavaScript across every code point" test already guards that the table is correct/current on any runtime whose Unicode matches, so drop the drift check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
584410f to
1b1c1a7
Compare
- Utf8Decode: strict UTF-8 decoding — reject bad continuation bytes, overlong encodings, surrogates, and values > U+10FFFF; emit U+FFFD (consuming one byte) on any malformed sequence, so output is always well-formed. - lower()/upper(): a NULL from sqlite3_value_text() after a non-NULL value is an allocation/conversion failure — surface it as OOM instead of a NULL result. - unicode_case.cpp: include <sqlite3.h> directly (don't rely on the unity build). - better_sqlite3.cpp: guard sqlite3_auto_extension with std::call_once so repeated NODE_MODULE_INIT (e.g. worker threads) registers it once. - test: add a malformed-UTF-8 case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
I ended up adding support for final sigma. It was the only care where js and SQLite were different so I thought it worth fixing. |
This is Plan C — drop the ICU dependency entirely and provide Unicode
lower()/upper()from a small embedded case table. Up for judging.Why
ICU's only value for Zero is Unicode
lower()/upper()(zqlite compilesILIKE→lower(col) LIKE lower(pattern)). Linking it has been a continual source of pain:-fPIC(Alpine + glibc both failed to link),libicui18n.so.67: cannot open shared object filebroke every glibc consumer not on ICU 67,What
Register Unicode-aware
lower()/upper()on every connection from an embedded case-mapping table generated from Node'stoLowerCase/toUpperCase— so the SQL functions match the client-side IVM matcher by construction.deps/gen-unicode-case.mjs→src/util/unicode_case_data.h(~120 KB, 1488 lower / 1580 upper entries; regenerate on a deliberate Unicode bump).src/util/unicode_case.cpp: UTF-8 case transform +lower()/upper(), registered viasqlite3_auto_extensionin the addon init.SQLITE_ENABLE_ICU,deps/icu.js, and all ICU gyp linking.52.icu.js→52.unicode-case.js: asserts the functions match JavaScript across scripts and 1:many mappings (ß→SS,İ→i̇).Result
-fPIC/soname/QEMU problems, no 28 MB.otool -Lshows no ICU.Scope / honesty
ß↔ss) and no context rules (Greek word-final sigma). JStoLowerCasedoes final-sigma; a word-finalΣcan therefore differ. Zero'sILIKEparity test guards what we actually rely on.zero_sqlite3shell still uses ASCIIlower()/upper()(it doesn't load the addon) — follow-up if we want it.Follow-ups once this is judged good
icu-libsand just bump the version.libicu-dev/icu4c/icu-dev) — pairs with ci: build arm64 prebuilds on native ARM runners (drop QEMU + 32-bit arm) #37 (native arm64).To maintain going forward: re-run the generator on a Unicode bump (optional), and the ~150 lines of C. That's the whole surface.
🤖 Generated with Claude Code