Skip to content

feat(extras): add history + culture domains from corpus mining#246

Merged
send merged 1 commit into
mainfrom
feat/extras-corpus-history
May 15, 2026
Merged

feat(extras): add history + culture domains from corpus mining#246
send merged 1 commit into
mainfrom
feat/extras-corpus-history

Conversation

@send
Copy link
Copy Markdown
Owner

@send send commented May 15, 2026

Summary

  • Wikipedia 全 dump (4.6GB, 2.4M articles) を dictool candidates corpus で mining し、categorical 分類で 2 新 domain を追加 (history 17件 + culture 18件 = 35件)
  • Mozc top-1 と衝突する候補は cost 7000-8000 で常用語を top-1 に維持 (寛保→漢方 / 曹操→早々 / 阿久悠→悪友)
  • 既存 Mozc lattice が正解する候補 (徳川家康/三島由紀夫/平家物語/帝国議会/連合国軍最高司令官総司令部 等) は drop

Test plan

  • cargo fmt --all --check
  • cargo clippy --workspace --all-features -- -D warnings
  • cargo test --workspace --all-features
  • mise run dict (Merged 'extras': +31 readings, +44 entries, 0 replaced)
  • mise run accuracy — 93/93 pass (1 skip 既存)
  • mise run accuracy-history — 6/6 pass
  • lextool explain で 35 entries 全件 top-1 / cost-conflict 期待値確認

🤖 Generated with Claude Code

Wikipedia 全 dump (4.6GB, 2.4M articles) を dictool candidates corpus で
mining し、categorical に分類して 2 新 domain を追加:

- history.tsv (17 entries): 元号 / 位階 / 古典書名 / 歴史人物
- culture.tsv (18 entries): 出版社 / 音楽人 / 作品名

Mozc top-1 と衝突する 3 件 (寛保/曹操/阿久悠) は cost 7000-8000 で常用語
(漢方/早々/悪友) を top-1 に維持。既存 Mozc lattice が正解する候補
(徳川家康/三島由紀夫/平家物語 等) は drop。

accuracy-corpus.toml に extras カテゴリで 13 ケース追加 (含む 3 件の
cost-conflict ケース)。

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 15, 2026 08:52
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds two new curated "extras" dictionary domains—history (eras, court ranks, classical works, historical figures) and culture (publishers, music artists, works)—sourced from Wikipedia corpus mining via dictool candidates corpus. Cost overrides (7000–8000) are used for entries that would otherwise displace common-word top-1 results (寛保 vs 漢方, 曹操 vs 早々, 阿久悠 vs 悪友). Accuracy corpus test cases and a unit test in extras.rs verify both the new entries and the cost-conflict coexistence.

Changes:

  • Add extras/history.tsv (17 entries) and extras/culture.tsv (18 entries), registered in the DOMAINS list in extras.rs.
  • Extend the dict_source::extras::tests smoke test to assert presence of representative history/culture entries and round-tripping of the 寛保 cost override.
  • Add 14 accuracy-corpus cases under extras (history/culture/cost-conflict tags) for regression coverage.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated no comments.

File Description
engine/crates/lex-cli/src/dict_source/extras.rs Register the two new TSVs in DOMAINS and assert representative entries (+ cost override) in tests.
engine/crates/lex-cli/src/dict_source/extras/history.tsv New curated history vocabulary with selective cost bumps for homophone conflicts.
engine/crates/lex-cli/src/dict_source/extras/culture.tsv New curated culture/media vocabulary with cost bump for 阿久悠/悪友 conflict.
engine/testcorpus/accuracy-corpus.toml Add history/culture accuracy cases including cost-conflict coexistence checks.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 4 out of 4 changed files in this pull request and generated no new comments.

@send send merged commit ecc2dda into main May 15, 2026
18 checks passed
@send send deleted the feat/extras-corpus-history branch May 15, 2026 18:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants