Replace ICU with a bundled Unicode case table by arv · Pull Request #38 · rocicorp/zero-sqlite3

arv · 2026-06-03T14:45:27Z

This is Plan C — drop the ICU dependency entirely and provide Unicode lower()/upper() from a small embedded case table. Up for judging.

Why

ICU's only value for Zero is Unicode lower()/upper() (zqlite compiles ILIKE → lower(col) LIKE lower(pattern)). Linking it has been a continual source of pain:

distro static archives aren't -fPIC (Alpine + glibc both failed to link),
dynamic linking couples to the build image's ICU soname → libicui18n.so.67: cannot open shared object file broke every glibc consumer not on ICU 67,
QEMU arm builds, ~28 MB of ICU data, and a Windows ASCII-only exception.

What

Register Unicode-aware lower()/upper() on every connection from an embedded case-mapping table generated from Node's toLowerCase/toUpperCase — so the SQL functions match the client-side IVM matcher by construction.

deps/gen-unicode-case.mjs → src/util/unicode_case_data.h (~120 KB, 1488 lower / 1580 upper entries; regenerate on a deliberate Unicode bump).
src/util/unicode_case.cpp: UTF-8 case transform + lower()/upper(), registered via sqlite3_auto_extension in the addon init.
Removed SQLITE_ENABLE_ICU, deps/icu.js, and all ICU gyp linking.
Test 52.icu.js → 52.unicode-case.js: asserts the functions match JavaScript across scripts and 1:many mappings (ß→SS, İ→i̇).

Result

Self-contained on every platform, including Windows. No runtime ICU, no -fPIC/soname/QEMU problems, no 28 MB. otool -L shows no ICU.
Full suite: 335 passing, incl. the JS-parity samples.

Scope / honesty

Context-free case mapping only. No case folding (ß↔ss) and no context rules (Greek word-final sigma). JS toLowerCase does final-sigma; a word-final Σ can therefore differ. Zero's ILIKE parity test guards what we actually rely on.
The zero_sqlite3 shell still uses ASCII lower()/upper() (it doesn't load the addon) — follow-up if we want it.

Follow-ups once this is judged good

Cut 1.1.1; the soname problem is gone, so the in-flight mono PR can drop icu-libs and just bump the version.
Remove the now-dead CI ICU installs (libicu-dev/icu4c/icu-dev) — pairs with ci: build arm64 prebuilds on native ARM runners (drop QEMU + 32-bit arm) #37 (native arm64).

To maintain going forward: re-run the generator on a Unicode bump (optional), and the ~150 lines of C. That's the whole surface.

🤖 Generated with Claude Code

Copilot

Pull request overview

This PR removes the ICU dependency by bundling a generated Unicode case-mapping table and registering Unicode-aware lower()/upper() SQLite SQL functions via an auto-extension, ensuring SQL case conversion matches JavaScript (String.prototype.toLowerCase()/toUpperCase()) across platforms (including Windows).

Changes:

Added a generated Unicode case table and a UTF-8 case-transform implementation to provide Unicode lower()/upper() without ICU.
Registered the new lower()/upper() as a SQLite auto-extension during addon initialization.
Replaced ICU-specific build/test wiring with new JS-parity tests for Unicode casing (including 1:many mappings and Greek final-sigma behavior).

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
test/52.unicode-case.js	Adds new tests validating SQL `lower()`/`upper()` parity with JavaScript across scripts and edge cases.
test/52.icu.js	Removes ICU-dependent tests that no longer apply after dropping ICU.
src/util/unicode_case.cpp	Introduces the Unicode-aware `lower()`/`upper()` implementation backed by the generated tables and final-sigma handling.
src/better_sqlite3.cpp	Registers the Unicode case functions as a SQLite auto-extension during module initialization.
deps/sqlite3.gyp	Removes conditional `SQLITE_ENABLE_ICU` build configuration and ICU include paths.
deps/icu.js	Deletes ICU discovery/link helper script (no longer needed).
deps/gen-unicode-case.mjs	Adds generator script to produce the embedded Unicode case-mapping data from Node.
deps/download.sh	Removes ICU-related commentary tied to prior amalgamation/define behavior.
binding.gyp	Removes ICU link/define wiring from addon build configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

ICU's value for Zero was only Unicode lower()/upper() (zqlite compiles ILIKE to lower(col) LIKE lower(pattern)), but linking it has been a continual source of pain: no -fPIC distro static archives, soname coupling that breaks glibc consumers (libicui18n.so.67), QEMU arm builds, ~28MB, and a Windows exception. Drop the ICU dependency entirely. Register Unicode-aware lower()/upper() on every connection from an embedded case-mapping table generated from Node's toLowerCase/toUpperCase (so the SQL functions match the client-side IVM matcher). The result is self-contained on every platform (incl. Windows), with no runtime ICU, no -fPIC/soname/QEMU problems, and ~120KB of generated data. - deps/gen-unicode-case.mjs: generator -> src/util/unicode_case_data.h. - src/util/unicode_case.cpp: UTF-8 case transform + lower()/upper(), registered via sqlite3_auto_extension in the addon init. - Remove SQLITE_ENABLE_ICU, deps/icu.js, and the ICU gyp linking. - Rename test/52.icu.js -> test/52.unicode-case.js; assert lower()/upper() match JavaScript across scripts and 1:many mappings (ß->SS, İ->i̇). Scope: context-free case mapping only (no folding, no Greek final-sigma context). Follow-ups: the now-dead CI ICU installs can be removed (pairs with the native arm64 PR), and the zero_sqlite3 shell still uses ASCII lower()/upper(). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds the one context-sensitive rule that JS toLowerCase applies in the default (locale-independent) algorithm: Σ lowercases to ς when preceded by a cased letter (skipping case-ignorable chars) and not followed by one, else σ. - Generator emits Cased / Case_Ignorable as code-point ranges (from the \p{Cased} / \p{Case_Ignorable} escapes, so they track V8's Unicode version). - lower() tracks prevCased and, for Σ, scans forward for a following cased letter (LowerSigma). upper() is unchanged. - Tests: add final-sigma cases and an exhaustive guard that lower()/upper() equal JS for every code point (verified: 0 mismatches across all 1,112,064). With this there is no remaining context divergence from JS toLowerCase/ toUpperCase within a Unicode version. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "matches JavaScript across every code point" test compares the generated table against the *runtime* Node's toLowerCase/toUpperCase, which only holds when the runtime's Unicode version matches the table's. On other Node versions (e.g. Node 23 ships an older Unicode than the table's 17.0) a few newly-added mappings differ, failing the sweep. Skip it unless versions match; the curated cases use stable mappings and continue to run everywhere. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Make Node 24 LTS (Unicode 17) the explicit anchor for the case table, and guarantee verification on a matching runtime: - Add a test that regenerates the table and asserts it equals the committed header (drift guard), gated to a runtime whose Unicode matches the table. - Document that the generator should be run with the active Node LTS. On the Node 24 matrix cell, CI now both (a) exhaustively checks lower()/upper() match JS for every code point and (b) confirms the committed table is current. On other Node versions these skip; the stable curated cases still run. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The "committed case table is up to date" test regenerated the header and byte-compared it, which fails across runtimes for reasons unrelated to the data: the `// Source: Node X.Y.Z` provenance comment varies by generating Node version, line endings differ on Windows, and Bun spawns a different binary. The exhaustive "matches JavaScript across every code point" test already guards that the table is correct/current on any runtime whose Unicode matches, so drop the drift check. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- Utf8Decode: strict UTF-8 decoding — reject bad continuation bytes, overlong encodings, surrogates, and values > U+10FFFF; emit U+FFFD (consuming one byte) on any malformed sequence, so output is always well-formed. - lower()/upper(): a NULL from sqlite3_value_text() after a non-NULL value is an allocation/conversion failure — surface it as OOM instead of a NULL result. - unicode_case.cpp: include <sqlite3.h> directly (don't rely on the unity build). - better_sqlite3.cpp: guard sqlite3_auto_extension with std::call_once so repeated NODE_MODULE_INIT (e.g. worker threads) registers it once. - test: add a malformed-UTF-8 case. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

arv · 2026-06-04T07:07:29Z

context rules (Greek word-final sigma). JS toLowerCase does final-sigma; a word-final Σ can therefore differ. Zero's ILIKE parity test guards what we actually rely on.

I ended up adding support for final sigma. It was the only care where js and SQLite were different so I thought it worth fixing.

arv changed the title ~~Replace ICU with a bundled Unicode case table (option C)~~ Replace ICU with a bundled Unicode case table Jun 3, 2026

arv requested review from Copilot and tantaman June 3, 2026 15:00

Copilot started reviewing on behalf of arv June 3, 2026 15:00 View session

Copilot AI reviewed Jun 3, 2026

View reviewed changes

Comment thread src/util/unicode_case.cpp

Comment thread src/util/unicode_case.cpp

Comment thread src/util/unicode_case.cpp

Comment thread src/better_sqlite3.cpp Outdated

arv mentioned this pull request Jun 3, 2026

ci: test only the maintained LTS Node lines (22, 24) #39

Merged

arv and others added 5 commits June 3, 2026 17:22

arv force-pushed the arv/unicode-case-no-icu branch from 584410f to 1b1c1a7 Compare June 3, 2026 15:23

arv merged commit 7b9bfb5 into main Jun 3, 2026
25 checks passed

arv deleted the arv/unicode-case-no-icu branch June 3, 2026 15:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace ICU with a bundled Unicode case table#38

Replace ICU with a bundled Unicode case table#38
arv merged 6 commits into
mainfrom
arv/unicode-case-no-icu

arv commented Jun 3, 2026 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arv commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

arv commented Jun 3, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Why

What

Result

Scope / honesty

Follow-ups once this is judged good

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

arv commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

arv commented Jun 3, 2026 •

edited

Loading