Add CJK (Chinese, Japanese, Korean) search support#2337
Add CJK (Chinese, Japanese, Korean) search support#2337Maklu wants to merge 5 commits intobasecamp:mainfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 13e2a7efc1
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
Fix search functionality for CJK languages by addressing three issues:
1. Query sanitization: Switch from ASCII-only \w to Unicode-aware \p{L}\p{N}
2. Stemmer: Add character-level tokenization for CJK (e.g., "日本語" → "日 本 語")
3. Highlighter: Add CJK-aware highlighting without word boundaries
For SQLite FTS5: Store original content in main table, stemmed content in
FTS table. Use Search::Highlighter instead of FTS5's highlight() function
to display original text in search results.
13e2a7e to
fd83c65
Compare
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: fd83c659ba
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
… CJK boundary matching - Stem highlighter terms so searches like "running" highlight "run" in results - Add stem_query to preserve quoted phrases in SQLite MATCH queries - Fix word boundary regex failing at CJK/Latin transitions (Ruby \b issue) - Use case-insensitive matching for CJK snippet positioning
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: aa20528f9e
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
…hting - Unquoted CJK terms like 中文 now become "中 文" (phrase) in MATCH queries to preserve character order - CJK highlight regex is now case-insensitive for mixed CJK/Latin terms
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
@jorgemanrubia This is a re-implementation of CJK search support, addressing the issues from the previous attempt (#2321). Would appreciate your review when you have a chance. Thanks! |
djmb
left a comment
There was a problem hiding this comment.
A few thoughts:
This won't work with MySQL because by default it requires tokens to be 3 characters or more. We should check that it doesn't break anything though.
We are now going to double stem data, once with mittens in Ruby and then a second time with the porter stemmer in the FTS5 SQLite implementation. This maybe works fine, but it seems a bit iffy, for example if the two stemming algorithms do not work the same way.
Ideally we should change the table to tokenize='unicode61' to avoid the SQLite tokenization.
I think we also will need to do a full reindex in a migration so the correctly stemmed data is stored.
Finally we should add some tests with CJK data to searches_controller_test.rb for a more complete testing. The setup is quite slow for those tests because they are not transactional so we might want a single test that tests a bunch of different queries.
|
Thanks for the detailed review @djmb, and the feedback @jorgemanrubia! Updated in 973ff66:
|
There was a problem hiding this comment.
Pull request overview
This PR re-implements CJK (Chinese/Japanese/Korean) search support by separating stemmed storage/indexing from user-facing display, and replacing SQLite FTS5 highlight()/snippet() usage with an application-level Search::Highlighter to avoid displaying stemmed text.
Changes:
- Add CJK-aware stemming/tokenization and query stemming to support CJK matching in SQLite FTS.
- Update highlighting/snippet generation to be CJK-aware and to operate on original (display) text rather than FTS-returned stemmed text.
- Switch SQLite FTS tokenizer to
unicode61and add/adjust tests covering CJK search/highlighting/snippet behavior.
Reviewed changes
Copilot reviewed 14 out of 14 changed files in this pull request and generated 2 comments.
Show a summary per file
| File | Description |
|---|---|
app/models/search.rb |
Introduces shared CJK_PATTERN constant for cross-component CJK detection. |
app/models/search/highlighter.rb |
Adds CJK-aware highlight/snippet logic and improves term handling (stemming-aware). |
app/models/search/query.rb |
Updates sanitization to preserve Unicode letters/numbers (enables CJK queries). |
app/models/search/stemmer.rb |
Implements character-level tokenization for CJK plus stem_query to produce FTS5-friendly queries. |
app/models/search/record/sqlite.rb |
Uses Search::Highlighter for display and stores stemmed content in the FTS table; stems queries for matching. |
app/models/search/record/trilogy.rb |
Refactors highlight helpers to highlight_full/highlight_snippet for consistency with SQLite path. |
db/migrate/20260304120000_change_search_fts_tokenizer_to_unicode61.rb |
Migrates SQLite FTS table to unicode61 and reindexes. |
db/schema_sqlite.rb |
Updates schema version and FTS virtual table definition to unicode61. |
db/cable_schema.rb |
Contains unrelated schema-dump changes not mentioned in the PR description. |
test/models/search/stemmer_test.rb |
Adds coverage for CJK stemming, mixed content, and stem_query behavior. |
test/models/search/highlighter_test.rb |
Adds CJK highlight/snippet tests and switches snippet tests to max_chars. |
test/models/card/searchable_test.rb |
Updates SQLite FTS expectation to reflect stemmed content storage. |
test/controllers/searches_controller_test.rb |
Adds integration coverage for CJK search results and highlight rendering. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Switch FTS5 tokenizer from porter to unicode61 to eliminate double-stemming (Ruby Mittens + SQLite porter). Added migration with full reindex. - Unify snippet into a single character-based method, removing the separate CJK/western branches and cjk_dominant? check. - Split highlight(text, show:) into highlight_snippet/highlight_full in both SQLite and Trilogy modules. - Add CJK integration test to searches_controller_test (SQLite only, MySQL fulltext requires 3+ character tokens). - Revert unrelated cable_schema.rb changes.
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 13 out of 13 changed files in this pull request and generated 2 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
@Maklu Any updates ? I need this too. Thank you. |
Resolve conflict in SQLite matching scope: adopt the explicit INNER JOIN from main while keeping stemmed query from this branch.
|
@BANG88 Just rebased to resolve conflicts. Still waiting on maintainer review — hopefully it gets picked up soon. |
Bro. Thank you very much. |
* pr-2337-temp: Address review feedback: switch FTS5 to unicode61, unify snippet logic Wrap CJK tokens as phrases in FTS5 queries and fix mixed-term highlighting Address review feedback: fix stemmed highlighting, phrase search, and CJK boundary matching Add CJK (Chinese, Japanese, Korean) search support
Summary
This PR re-implements CJK search support, fixing the SQLite FTS5 display issue that caused the previous PR (#2299) to be reverted.
Problem
The original implementation had a bug where SQLite's
highlight()andsnippet()functions returned stemmed content instead of original text. For example:Solution
Key fix: Separate storage from display.
Search::Highlighterinstead of FTS5'shighlight()functionThis ensures search works correctly while displaying original text to users.
Changes
search.rbCJK_PATTERNconstanthighlighter.rbhighlight()andsnippet()withcjk_dominant?detectionquery.rb\p{L}\p{N}for Unicode character supportstemmer.rbrecord/sqlite.rbSearch::Highlighterinstead of FTS5 functionsTest plan