Skip to content

Add CJK (Chinese, Japanese, Korean) search support#2337

Open
Maklu wants to merge 5 commits intobasecamp:mainfrom
Maklu:fix-cjk-search
Open

Add CJK (Chinese, Japanese, Korean) search support#2337
Maklu wants to merge 5 commits intobasecamp:mainfrom
Maklu:fix-cjk-search

Conversation

@Maklu
Copy link
Copy Markdown
Contributor

@Maklu Maklu commented Jan 10, 2026

Summary

This PR re-implements CJK search support, fixing the SQLite FTS5 display issue that caused the previous PR (#2299) to be reverted.

Problem

The original implementation had a bug where SQLite's highlight() and snippet() functions returned stemmed content instead of original text. For example:

  • "Card to delete" displayed as "card to delet"
  • "日本語" displayed as "日 本 語"

Solution

Key fix: Separate storage from display.

  1. FTS table: Stores stemmed content for search matching
  2. Main table: Stores original content for display
  3. Highlighting: Use Search::Highlighter instead of FTS5's highlight() function

This ensures search works correctly while displaying original text to users.

Changes

File Change
search.rb Add shared CJK_PATTERN constant
highlighter.rb Add CJK-aware highlight() and snippet() with cjk_dominant? detection
query.rb Use \p{L}\p{N} for Unicode character support
stemmer.rb Add character-level tokenization for CJK
record/sqlite.rb Use Search::Highlighter instead of FTS5 functions

Test plan

  • All 919 tests pass
  • Added tests for Chinese, Japanese, and Korean text
  • Added tests for mixed CJK/English content
  • Added tests for CJK snippet truncation
  • Verified search results display original text (not stemmed)

Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: 13e2a7efc1

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/models/search/record/sqlite.rb Outdated
@Maklu Maklu marked this pull request as draft February 23, 2026 11:41
Fix search functionality for CJK languages by addressing three issues:

1. Query sanitization: Switch from ASCII-only \w to Unicode-aware \p{L}\p{N}
2. Stemmer: Add character-level tokenization for CJK (e.g., "日本語" → "日 本 語")
3. Highlighter: Add CJK-aware highlighting without word boundaries

For SQLite FTS5: Store original content in main table, stemmed content in
FTS table. Use Search::Highlighter instead of FTS5's highlight() function
to display original text in search results.
@Maklu Maklu marked this pull request as ready for review February 23, 2026 11:46
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: fd83c659ba

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/models/search/record/sqlite.rb Outdated
Comment thread app/models/search/highlighter.rb Outdated
… CJK boundary matching

- Stem highlighter terms so searches like "running" highlight "run" in results
- Add stem_query to preserve quoted phrases in SQLite MATCH queries
- Fix word boundary regex failing at CJK/Latin transitions (Ruby \b issue)
- Use case-insensitive matching for CJK snippet positioning
Copy link
Copy Markdown

@chatgpt-codex-connector chatgpt-codex-connector Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

💡 Codex Review

Here are some automated review suggestions for this pull request.

Reviewed commit: aa20528f9e

ℹ️ About Codex in GitHub

Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you

  • Open a pull request for review
  • Mark a draft as ready
  • Comment "@codex review".

If Codex has suggestions, it will comment; otherwise it will react with 👍.

Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".

Comment thread app/models/search/stemmer.rb Outdated
Comment thread app/models/search/highlighter.rb Outdated
…hting

- Unquoted CJK terms like 中文 now become "中 文" (phrase) in MATCH queries
  to preserve character order
- CJK highlight regex is now case-insensitive for mixed CJK/Latin terms
@chatgpt-codex-connector
Copy link
Copy Markdown

You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard.

@Maklu
Copy link
Copy Markdown
Contributor Author

Maklu commented Feb 23, 2026

@jorgemanrubia This is a re-implementation of CJK search support, addressing the issues from the previous attempt (#2321). Would appreciate your review when you have a chance. Thanks!

Copy link
Copy Markdown
Member

@jorgemanrubia jorgemanrubia left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great work here @Maklu.

This looks better to me and works well in my tests 👍. I would love to get @djmb input here too.

Comment thread app/models/search/record/sqlite.rb Outdated
Comment thread app/models/search/highlighter.rb Outdated
Comment thread app/models/search/highlighter.rb Outdated
Copy link
Copy Markdown
Contributor

@djmb djmb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few thoughts:

This won't work with MySQL because by default it requires tokens to be 3 characters or more. We should check that it doesn't break anything though.

We are now going to double stem data, once with mittens in Ruby and then a second time with the porter stemmer in the FTS5 SQLite implementation. This maybe works fine, but it seems a bit iffy, for example if the two stemming algorithms do not work the same way.

Ideally we should change the table to tokenize='unicode61' to avoid the SQLite tokenization.

I think we also will need to do a full reindex in a migration so the correctly stemmed data is stored.

Finally we should add some tests with CJK data to searches_controller_test.rb for a more complete testing. The setup is quite slow for those tests because they are not transactional so we might want a single test that tests a bunch of different queries.

Comment thread app/models/search/highlighter.rb Outdated
Copilot AI review requested due to automatic review settings March 4, 2026 04:16
@Maklu
Copy link
Copy Markdown
Contributor Author

Maklu commented Mar 4, 2026

Thanks for the detailed review @djmb, and the feedback @jorgemanrubia!

Updated in 973ff66:

  • FTS5 tokenize='unicode61' — this avoids the double-stemming (Mittens + porter). Added a migration that recreates the table and reindexes.
  • Unified snippet — single character-based method, no more cjk_dominant? branching. Also resolves the x3 and bytesize questions since both are gone now.
  • highlight_snippet / highlight_full — applied to both SQLite and Trilogy modules.
  • CJK controller test — single test covering Chinese, Japanese, Korean, and mixed queries.
  • MySQL — CJK single-char tokens fall below innodb_ft_min_token_size, so CJK search on MySQL would need a different approach (ngram tokenizer etc.), but that's a separate concern. Existing non-CJK behavior is unaffected.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR re-implements CJK (Chinese/Japanese/Korean) search support by separating stemmed storage/indexing from user-facing display, and replacing SQLite FTS5 highlight()/snippet() usage with an application-level Search::Highlighter to avoid displaying stemmed text.

Changes:

  • Add CJK-aware stemming/tokenization and query stemming to support CJK matching in SQLite FTS.
  • Update highlighting/snippet generation to be CJK-aware and to operate on original (display) text rather than FTS-returned stemmed text.
  • Switch SQLite FTS tokenizer to unicode61 and add/adjust tests covering CJK search/highlighting/snippet behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.

Show a summary per file
File Description
app/models/search.rb Introduces shared CJK_PATTERN constant for cross-component CJK detection.
app/models/search/highlighter.rb Adds CJK-aware highlight/snippet logic and improves term handling (stemming-aware).
app/models/search/query.rb Updates sanitization to preserve Unicode letters/numbers (enables CJK queries).
app/models/search/stemmer.rb Implements character-level tokenization for CJK plus stem_query to produce FTS5-friendly queries.
app/models/search/record/sqlite.rb Uses Search::Highlighter for display and stores stemmed content in the FTS table; stems queries for matching.
app/models/search/record/trilogy.rb Refactors highlight helpers to highlight_full/highlight_snippet for consistency with SQLite path.
db/migrate/20260304120000_change_search_fts_tokenizer_to_unicode61.rb Migrates SQLite FTS table to unicode61 and reindexes.
db/schema_sqlite.rb Updates schema version and FTS virtual table definition to unicode61.
db/cable_schema.rb Contains unrelated schema-dump changes not mentioned in the PR description.
test/models/search/stemmer_test.rb Adds coverage for CJK stemming, mixed content, and stem_query behavior.
test/models/search/highlighter_test.rb Adds CJK highlight/snippet tests and switches snippet tests to max_chars.
test/models/card/searchable_test.rb Updates SQLite FTS expectation to reflect stemmed content storage.
test/controllers/searches_controller_test.rb Adds integration coverage for CJK search results and highlight rendering.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread db/cable_schema.rb Outdated
Copilot AI review requested due to automatic review settings March 4, 2026 04:29
@Maklu Maklu requested a review from djmb March 4, 2026 04:33
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/models/search/highlighter.rb
Comment thread app/models/search/stemmer.rb
- Switch FTS5 tokenizer from porter to unicode61 to eliminate
  double-stemming (Ruby Mittens + SQLite porter). Added migration
  with full reindex.
- Unify snippet into a single character-based method, removing the
  separate CJK/western branches and cjk_dominant? check.
- Split highlight(text, show:) into highlight_snippet/highlight_full
  in both SQLite and Trilogy modules.
- Add CJK integration test to searches_controller_test (SQLite only,
  MySQL fulltext requires 3+ character tokens).
- Revert unrelated cable_schema.rb changes.
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread app/models/search/record/sqlite.rb
Comment thread app/models/search/highlighter.rb
@BANG88
Copy link
Copy Markdown

BANG88 commented Mar 26, 2026

@Maklu Any updates ? I need this too. Thank you.

Resolve conflict in SQLite matching scope: adopt the explicit
INNER JOIN from main while keeping stemmed query from this branch.
@Maklu
Copy link
Copy Markdown
Contributor Author

Maklu commented Mar 27, 2026

@BANG88 Just rebased to resolve conflicts. Still waiting on maintainer review — hopefully it gets picked up soon.

@BANG88
Copy link
Copy Markdown

BANG88 commented Mar 27, 2026

@BANG88 Just rebased to resolve conflicts. Still waiting on maintainer review — hopefully it gets picked up soon.

Bro. Thank you very much.

BANG88 added a commit to PuffinStudio/fizzy that referenced this pull request Mar 29, 2026
* pr-2337-temp:
  Address review feedback: switch FTS5 to unicode61, unify snippet logic
  Wrap CJK tokens as phrases in FTS5 queries and fix mixed-term highlighting
  Address review feedback: fix stemmed highlighting, phrase search, and CJK boundary matching
  Add CJK (Chinese, Japanese, Korean) search support
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants