Skip to content

Compose canonical decompositions (NFC) in the default shaper#380

Open
lehni wants to merge 1 commit into
foliojs:masterfrom
lineto:feature/unicode-normalization
Open

Compose canonical decompositions (NFC) in the default shaper#380
lehni wants to merge 1 commit into
foliojs:masterfrom
lineto:feature/unicode-normalization

Conversation

@lehni

@lehni lehni commented Jun 15, 2026

Copy link
Copy Markdown

Problem

HarfBuzz normalizes each base + combining-mark cluster before GSUB/GPOS: the marks are reordered by combining class and composed onto the base when the font has a precomposed glyph. The default shaper does no such step, so decomposed input shapes as separate glyphs instead of the precomposed glyph HarfBuzz and browsers produce — e.g. "i" + U+0300 → [i, gravecomb] instead of [igrave]; Arabic alef + fathatan + hamza-above → three glyphs instead of [alef-with-hamza, fathatan] (the hamza composes onto the alef across the lower-class fathatan).

Fix

Apply font-aware NFC per cluster at the start of the default shaper's feature assignment, before GSUB:

  • Reorder + compose via normalize('NFC'), but only when the font has a glyph for the precomposed codepoint; otherwise leave the marks decomposed for GPOS mark positioning.
  • Only apply when the result changes the glyph count (composition / decompose fallback) — pure canonical reordering is left alone so it can't disturb the order downstream GSUB expects (e.g. Arabic shadda + vowel calt).

Scoped to the default shaper and the Arabic/Hebrew/Thai shapers that inherit its assignFeatures; the Indic/Hangul/Universal shapers keep their own composition.

Tests

Adds 6 shaping tests (FiraSans Latin + Amiri Arabic), including a regression test that a non-composing mark cluster is left unreordered.

- Compose each base + combining-mark cluster into the font's precomposed glyph via font-aware NFC before GSUB, matching HarfBuzz: decomposed input (i + U+0300, or Arabic alef + fathatan + hamza-above) shapes to the precomposed glyph instead of separate marks
- Only apply the result when it changes the glyph count (composition or its decompose fallback); leave pure canonical reordering alone so it can't disturb downstream GSUB (e.g. Arabic shadda + vowel calt)
- Decompose back any composed codepoint the font has no glyph for, so its marks stay available for GPOS mark positioning
- Scope to the default shaper and the Arabic/Hebrew/Thai shapers that inherit it; Indic/Hangul/Universal keep their own composition
- Add 6 canonical-composition shaping tests (FiraSans Latin, Amiri Arabic incl. the reorder regression)
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant