Skip to content

Replace ICU with a bundled Unicode case table#38

Merged
arv merged 6 commits into
mainfrom
arv/unicode-case-no-icu
Jun 3, 2026
Merged

Replace ICU with a bundled Unicode case table#38
arv merged 6 commits into
mainfrom
arv/unicode-case-no-icu

Conversation

@arv
Copy link
Copy Markdown
Contributor

@arv arv commented Jun 3, 2026

This is Plan C — drop the ICU dependency entirely and provide Unicode lower()/upper() from a small embedded case table. Up for judging.

Why

ICU's only value for Zero is Unicode lower()/upper() (zqlite compiles ILIKElower(col) LIKE lower(pattern)). Linking it has been a continual source of pain:

  • distro static archives aren't -fPIC (Alpine + glibc both failed to link),
  • dynamic linking couples to the build image's ICU soname → libicui18n.so.67: cannot open shared object file broke every glibc consumer not on ICU 67,
  • QEMU arm builds, ~28 MB of ICU data, and a Windows ASCII-only exception.

What

Register Unicode-aware lower()/upper() on every connection from an embedded case-mapping table generated from Node's toLowerCase/toUpperCase — so the SQL functions match the client-side IVM matcher by construction.

  • deps/gen-unicode-case.mjssrc/util/unicode_case_data.h (~120 KB, 1488 lower / 1580 upper entries; regenerate on a deliberate Unicode bump).
  • src/util/unicode_case.cpp: UTF-8 case transform + lower()/upper(), registered via sqlite3_auto_extension in the addon init.
  • Removed SQLITE_ENABLE_ICU, deps/icu.js, and all ICU gyp linking.
  • Test 52.icu.js52.unicode-case.js: asserts the functions match JavaScript across scripts and 1:many mappings (ß→SS, İ→i̇).

Result

  • Self-contained on every platform, including Windows. No runtime ICU, no -fPIC/soname/QEMU problems, no 28 MB. otool -L shows no ICU.
  • Full suite: 335 passing, incl. the JS-parity samples.

Scope / honesty

  • Context-free case mapping only. No case folding (ß↔ss) and no context rules (Greek word-final sigma). JS toLowerCase does final-sigma; a word-final Σ can therefore differ. Zero's ILIKE parity test guards what we actually rely on.
  • The zero_sqlite3 shell still uses ASCII lower()/upper() (it doesn't load the addon) — follow-up if we want it.

Follow-ups once this is judged good

To maintain going forward: re-run the generator on a Unicode bump (optional), and the ~150 lines of C. That's the whole surface.

🤖 Generated with Claude Code

@arv arv changed the title Replace ICU with a bundled Unicode case table (option C) Replace ICU with a bundled Unicode case table Jun 3, 2026
@arv arv requested review from Copilot and tantaman June 3, 2026 15:00
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes the ICU dependency by bundling a generated Unicode case-mapping table and registering Unicode-aware lower()/upper() SQLite SQL functions via an auto-extension, ensuring SQL case conversion matches JavaScript (String.prototype.toLowerCase()/toUpperCase()) across platforms (including Windows).

Changes:

  • Added a generated Unicode case table and a UTF-8 case-transform implementation to provide Unicode lower()/upper() without ICU.
  • Registered the new lower()/upper() as a SQLite auto-extension during addon initialization.
  • Replaced ICU-specific build/test wiring with new JS-parity tests for Unicode casing (including 1:many mappings and Greek final-sigma behavior).

Reviewed changes

Copilot reviewed 9 out of 10 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
test/52.unicode-case.js Adds new tests validating SQL lower()/upper() parity with JavaScript across scripts and edge cases.
test/52.icu.js Removes ICU-dependent tests that no longer apply after dropping ICU.
src/util/unicode_case.cpp Introduces the Unicode-aware lower()/upper() implementation backed by the generated tables and final-sigma handling.
src/better_sqlite3.cpp Registers the Unicode case functions as a SQLite auto-extension during module initialization.
deps/sqlite3.gyp Removes conditional SQLITE_ENABLE_ICU build configuration and ICU include paths.
deps/icu.js Deletes ICU discovery/link helper script (no longer needed).
deps/gen-unicode-case.mjs Adds generator script to produce the embedded Unicode case-mapping data from Node.
deps/download.sh Removes ICU-related commentary tied to prior amalgamation/define behavior.
binding.gyp Removes ICU link/define wiring from addon build configuration.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/util/unicode_case.cpp
Comment thread src/util/unicode_case.cpp
Comment thread src/util/unicode_case.cpp
Comment thread src/better_sqlite3.cpp Outdated
arv and others added 5 commits June 3, 2026 17:22
ICU's value for Zero was only Unicode lower()/upper() (zqlite compiles ILIKE to
lower(col) LIKE lower(pattern)), but linking it has been a continual source of
pain: no -fPIC distro static archives, soname coupling that breaks glibc
consumers (libicui18n.so.67), QEMU arm builds, ~28MB, and a Windows exception.

Drop the ICU dependency entirely. Register Unicode-aware lower()/upper() on
every connection from an embedded case-mapping table generated from Node's
toLowerCase/toUpperCase (so the SQL functions match the client-side IVM
matcher). The result is self-contained on every platform (incl. Windows), with
no runtime ICU, no -fPIC/soname/QEMU problems, and ~120KB of generated data.

- deps/gen-unicode-case.mjs: generator -> src/util/unicode_case_data.h.
- src/util/unicode_case.cpp: UTF-8 case transform + lower()/upper(), registered
  via sqlite3_auto_extension in the addon init.
- Remove SQLITE_ENABLE_ICU, deps/icu.js, and the ICU gyp linking.
- Rename test/52.icu.js -> test/52.unicode-case.js; assert lower()/upper() match
  JavaScript across scripts and 1:many mappings (ß->SS, İ->i̇).

Scope: context-free case mapping only (no folding, no Greek final-sigma context).
Follow-ups: the now-dead CI ICU installs can be removed (pairs with the native
arm64 PR), and the zero_sqlite3 shell still uses ASCII lower()/upper().

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds the one context-sensitive rule that JS toLowerCase applies in the default
(locale-independent) algorithm: Σ lowercases to ς when preceded by a cased
letter (skipping case-ignorable chars) and not followed by one, else σ.

- Generator emits Cased / Case_Ignorable as code-point ranges (from the
  \p{Cased} / \p{Case_Ignorable} escapes, so they track V8's Unicode version).
- lower() tracks prevCased and, for Σ, scans forward for a following cased
  letter (LowerSigma). upper() is unchanged.
- Tests: add final-sigma cases and an exhaustive guard that lower()/upper()
  equal JS for every code point (verified: 0 mismatches across all 1,112,064).

With this there is no remaining context divergence from JS toLowerCase/
toUpperCase within a Unicode version.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "matches JavaScript across every code point" test compares the generated
table against the *runtime* Node's toLowerCase/toUpperCase, which only holds
when the runtime's Unicode version matches the table's. On other Node versions
(e.g. Node 23 ships an older Unicode than the table's 17.0) a few newly-added
mappings differ, failing the sweep. Skip it unless versions match; the curated
cases use stable mappings and continue to run everywhere.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Make Node 24 LTS (Unicode 17) the explicit anchor for the case table, and
guarantee verification on a matching runtime:

- Add a test that regenerates the table and asserts it equals the committed
  header (drift guard), gated to a runtime whose Unicode matches the table.
- Document that the generator should be run with the active Node LTS.

On the Node 24 matrix cell, CI now both (a) exhaustively checks lower()/upper()
match JS for every code point and (b) confirms the committed table is current.
On other Node versions these skip; the stable curated cases still run.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The "committed case table is up to date" test regenerated the header and
byte-compared it, which fails across runtimes for reasons unrelated to the
data: the `// Source: Node X.Y.Z` provenance comment varies by generating Node
version, line endings differ on Windows, and Bun spawns a different binary. The
exhaustive "matches JavaScript across every code point" test already guards that
the table is correct/current on any runtime whose Unicode matches, so drop the
drift check.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@arv arv force-pushed the arv/unicode-case-no-icu branch from 584410f to 1b1c1a7 Compare June 3, 2026 15:23
- Utf8Decode: strict UTF-8 decoding — reject bad continuation bytes, overlong
  encodings, surrogates, and values > U+10FFFF; emit U+FFFD (consuming one byte)
  on any malformed sequence, so output is always well-formed.
- lower()/upper(): a NULL from sqlite3_value_text() after a non-NULL value is an
  allocation/conversion failure — surface it as OOM instead of a NULL result.
- unicode_case.cpp: include <sqlite3.h> directly (don't rely on the unity build).
- better_sqlite3.cpp: guard sqlite3_auto_extension with std::call_once so
  repeated NODE_MODULE_INIT (e.g. worker threads) registers it once.
- test: add a malformed-UTF-8 case.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@arv arv merged commit 7b9bfb5 into main Jun 3, 2026
25 checks passed
@arv arv deleted the arv/unicode-case-no-icu branch June 3, 2026 15:34
@arv
Copy link
Copy Markdown
Contributor Author

arv commented Jun 4, 2026

context rules (Greek word-final sigma). JS toLowerCase does final-sigma; a word-final Σ can therefore differ. Zero's ILIKE parity test guards what we actually rely on.

I ended up adding support for final sigma. It was the only care where js and SQLite were different so I thought it worth fixing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants