feat(extras): add history + culture domains from corpus mining#246
Merged
Conversation
Wikipedia 全 dump (4.6GB, 2.4M articles) を dictool candidates corpus で mining し、categorical に分類して 2 新 domain を追加: - history.tsv (17 entries): 元号 / 位階 / 古典書名 / 歴史人物 - culture.tsv (18 entries): 出版社 / 音楽人 / 作品名 Mozc top-1 と衝突する 3 件 (寛保/曹操/阿久悠) は cost 7000-8000 で常用語 (漢方/早々/悪友) を top-1 に維持。既存 Mozc lattice が正解する候補 (徳川家康/三島由紀夫/平家物語 等) は drop。 accuracy-corpus.toml に extras カテゴリで 13 ケース追加 (含む 3 件の cost-conflict ケース)。 Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds two new curated "extras" dictionary domains—history (eras, court ranks, classical works, historical figures) and culture (publishers, music artists, works)—sourced from Wikipedia corpus mining via dictool candidates corpus. Cost overrides (7000–8000) are used for entries that would otherwise displace common-word top-1 results (寛保 vs 漢方, 曹操 vs 早々, 阿久悠 vs 悪友). Accuracy corpus test cases and a unit test in extras.rs verify both the new entries and the cost-conflict coexistence.
Changes:
- Add
extras/history.tsv(17 entries) andextras/culture.tsv(18 entries), registered in theDOMAINSlist inextras.rs. - Extend the
dict_source::extras::testssmoke test to assert presence of representative history/culture entries and round-tripping of the 寛保 cost override. - Add 14 accuracy-corpus cases under
extras(history/culture/cost-conflict tags) for regression coverage.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.
| File | Description |
|---|---|
| engine/crates/lex-cli/src/dict_source/extras.rs | Register the two new TSVs in DOMAINS and assert representative entries (+ cost override) in tests. |
| engine/crates/lex-cli/src/dict_source/extras/history.tsv | New curated history vocabulary with selective cost bumps for homophone conflicts. |
| engine/crates/lex-cli/src/dict_source/extras/culture.tsv | New curated culture/media vocabulary with cost bump for 阿久悠/悪友 conflict. |
| engine/testcorpus/accuracy-corpus.toml | Add history/culture accuracy cases including cost-conflict coexistence checks. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
dictool candidates corpusで mining し、categorical 分類で 2 新 domain を追加 (history 17件 + culture 18件 = 35件)Test plan
🤖 Generated with Claude Code