Skip to content

Word counting with \w+ is inaccurate for non-English languages #2

@mdroidian

Description

@mdroidian

Description
The current word-counting approach using a regex like \w+ is not sufficient for non-English languages. It fails on many scripts (e.g., CJK languages, languages without clear word boundaries, or those using combining characters).

This is not necessarily a request to change the current behavior, but rather a note for future reference in case users report inaccuracies.

Context / Rationale

  • \w+ is heavily biased toward ASCII/Latin scripts.
  • Even Unicode-aware regex approaches still break down across languages.
  • The closest modern standard approach in JavaScript is Intl.Segmenter, which performs locale-aware text segmentation.

Reference:
https://developer.mozilla.org/en-US/docs/Web/JavaScript/Reference/Global_Objects/Intl/Segmenter

Notes / Caveats

  • Intl.Segmenter is better, but still not universally “correct” in all linguistic contexts.
  • Any word-counting solution will involve tradeoffs depending on language, locale, and definition of “word.”

Suggested Action

  • No immediate change required.
  • Keep this issue as documentation / justification if word-count accuracy for non-English text becomes a concern in the future.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type
    No fields configured for issues without a type.

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions