feat(candidates): add corpus mining (Wikipedia dump scan) by send · Pull Request #244 · send/lexime

send · 2026-05-09T18:28:46Z

Summary

Surface-first vocabulary mining from a Wikipedia jawiki dump. Adds `dictool candidates corpus ` that:

streams the bz2 directly (no full decompress to disk)
filters to article namespace (`0`)
skips wikitext templates `{{...}}` and `...` blocks
extracts maximal kanji runs (`[一-龥々]+`, length 2-20)
frequency-counts and diffs against the build dict's surface set
outputs `wikipedia.tsv` (gitignored) with `surface\tfreq`

Reading-assignment intentionally deferred — user picks top-N gap surfaces and looks up readings by hand before promoting to `extras/.tsv`. Mirrors the existing `mine`-then-promote workflow.

Why this approach (vs Sudachi / Wikidata, both tried in PR #243 / Wikidata experiment)

Per `feedback_extras_promotion.md`:

Sudachi naive scan: 1.9M candidates → 2 promoted (yield ~10⁻⁶) — frozen vocab, no frequency signal
Wikidata SPARQL: 5000 → 5 promoted (all 神話/宗教) — P1814 only fills固有名詞
Corpus mining (this PR): real-text frequency signal, catches modern vocab, surface-first

User insight (this session): forget reading + POS during extraction; just yank kanji runs. Reading lookup is the bottleneck — apply it only to the small post-diff candidate set.

Pilot result

`jawiki-articles1.bz2` (398MB compressed → 1.5GB raw, 80K articles):

32s end-to-end
304K gap surfaces (freq >= 5) after template/ref/ns filter
Top entries dominated by lattice-composable compounds (Mozc has 徳川/家康 separately and composes via Viterbi). `lextool explain` confirms 徳川家康 / 室町時代 / 令和元年 are all top-1 already.
Real misses surface in the mix: e.g. `宇宙戦艦` (Mozc top-1 returns 宇宙船感). Per-candidate verification via `lextool explain` is the next step before promotion.

No extras additions in this PR — this is the tool only. Future PRs hand-pick from the candidate file.

Test plan

`cargo fmt --all --check` / `cargo clippy --workspace --all-features -- -D warnings` / `cargo test --workspace --all-features` all green
12 new unit tests for `scan_kanji_runs` (length filters, iteration mark, multi-occurrence) and `scan_prose_kanji_runs` (template skip, nested templates, `` skip, self-closing ref, cross-slice state)
Smoke test on synthetic XML (3 pages, 13 surfaces, all hit dict)
Pilot run on jawiki-articles1.bz2: 80K articles in 32s, 304K freq>=5 gap surfaces
Spot-checked `徳川家康`/`室町時代`/`令和元年`/`宇宙戦艦` against `lextool explain` — confirms tool finds genuine miss (`宇宙戦艦`) along with lattice-composable noise

Follow-ups (not this PR)

Reading-assignment helper (Sudachi lookup or manual) for the top-N gap surfaces
Verification step that runs `lextool explain` per candidate to filter lattice-composable rows
Document workflow in feedback memory once a real promotion lands

🤖 Generated with Claude Code

Surface-first vocabulary mining from a Wikipedia jawiki dump. Streams the bz2 directly, skips wikitext templates `{{...}}` and `<ref>` blocks, filters to article namespace, extracts maximal kanji runs, and diffs against the build dict's surface set. Outputs `wikipedia.tsv` with `surface\tfreq` rows. Reading-assignment is intentionally deferred — the user picks top-N gap surfaces and looks up readings before promoting to `extras/<domain>.tsv`, mirroring the existing `mine`-then-promote-by-hand workflow. Pilot run on jawiki-articles1 (80K articles, ~1.5GB raw text) finishes in ~32s and yields 304K freq>=5 gap surfaces. Most are lattice- composable (徳川家康, 室町時代, 令和元年 — Mozc handles via segment composition) but real misses surface in the mix (e.g. 宇宙戦艦 → Mozc top-1 returns 宇宙船感). Per-candidate verification via `lextool explain` is still required before promotion. deps: bzip2 0.4 (lex-cli only — same dev-tool scope as the existing zip dep used by `candidates mine`). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Adds a new “surface-first” candidate mining path that scans a Wikipedia XML dump to extract frequent kanji-run surfaces, diffs them against the merged build dictionary’s surface set, and writes the remaining gaps to a TSV for manual curation into extras/.

Changes:

Added dictool candidates corpus <dump> subcommand to scan .xml / .xml.bz2 Wikipedia dumps and emit wikipedia.tsv (surface + frequency).
Implemented streaming Wikipedia dump scanner with template ({{...}}) and <ref>...</ref> skipping plus kanji-run extraction + unit tests.
Added bzip2 dependency to support streaming decompression of .bz2 dumps.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file

File	Description
engine/crates/lex-cli/src/commands/candidates_ops.rs	Adds `corpus()` orchestration: run Wikipedia scan, build surface coverage set from build dict, diff + write `wikipedia.tsv`.
engine/crates/lex-cli/src/candidates/wikipedia.rs	New scanner/extractor for Wikipedia dump streaming, markup skipping, kanji-run frequency counting, and unit tests.
engine/crates/lex-cli/src/candidates/mod.rs	Exposes the new `wikipedia` candidates module.
engine/crates/lex-cli/src/bin/dictool.rs	Adds the `CandidatesAction::Corpus` CLI subcommand and wiring to `candidates_ops::corpus`.
engine/crates/lex-cli/Cargo.toml	Adds `bzip2` dependency for dump decompression.
engine/Cargo.lock	Locks `bzip2` / `bzip2-sys` transitive additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

1. (IMP) `<ref` prefix match also captured `<references>` (common in Wikipedia citation sections). Since `<references>` closes with `</references>` rather than `</ref>`, `in_ref` got trapped true and silently dropped the rest of the page (and subsequent pages until the next page-boundary reset). Added a strict tag-name boundary check (`<ref` followed by space/`>`/`/`/EOL). 2. (IMP) Block-skip state (`tmpl_depth`, `in_ref`, `buf`) wasn't reset at `<page>` boundaries. Real dumps sometimes contain unbalanced `{{...` / `<ref ...` markup; without a reset, the open-block state leaked into the next page and silently skipped its content. Reset all three at every `<page>` line. 3. (MINOR) CLI `default_value_t = 3` duplicated the `wikipedia::DEFAULT_MIN_FREQ` constant. Reference the constant directly so they can't drift. Empirical impact on jawiki-articles1.bz2: 304K → 334K gap surfaces (+30K previously lost to the `<references>` trap). Tests: - `references_tag_does_not_trap_in_ref` covers both self-closing `<references/>` and `<references>...</references>` forms. - `page_boundary_resets_block_state` drives `extract_kanji_freqs` over a synthetic 2-page dump where page 1 has an unclosed template and verifies page 2 is still fully scanned. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`cargo vet check` was failing on the audit job because the new `bzip2` / `bzip2-sys` deps (added in c9684a0 for `dictool candidates corpus`) were unvetted. Add same-pattern exemptions matching the existing `zip` entry — both are pulled in only by the dev/build CLI, not the IME runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

`cargo vet check` ran during the audit-fix work stripped the inline comment above `[[exemptions.bumpalo]]` as part of its config-file normalization. Put it back — the explanation of why bumpalo is elevated to `safe-to-deploy` is load-bearing context for future audits.

Copilot

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

`scripts/check-build-scripts.sh` flagged `bzip2-sys` as a new crate with `build.rs` after PR #244 added the bzip2 dep for the Wikipedia corpus miner. The build.rs is upstream-standard (compiles vendored libbz2 C source via `cc`), same supply-chain posture as the existing audited C/build-script crates in the baseline (libc, ring, rustls, ring-bindgen, etc.). Accept by updating the baseline.

1. (IMP) `<text ... />` self-closing form wasn't handled. The pattern match treated it like a normal opening (`<text>` with no `</text>` on the same line), so `in_text` stuck true for that page. If the next line was XML metadata (e.g. `<title>` of the following page when self-closing immediately precedes `</page>`), it would have been scanned as prose and polluted frequency counts. Detect `/>` before the first `>` of the opening tag and short-circuit. Also reset `in_text` at `<page>` boundaries alongside the other state resets (defence-in-depth). 2. (MINOR) Test helper wrote to a timestamp-based path under `std::env::temp_dir()`, which could collide on parallel test runs and leave files behind on panic. Refactored `extract_kanji_freqs` to split out a `extract_kanji_freqs_from_reader` private API that takes any `impl BufRead`; tests now run against `Cursor<&[u8]>` directly, no filesystem involvement. The earlier R2 finding about supply-chain updates (Cargo.toml +35) is already addressed in daca284 + 1507c4e — resolving as stale. Tests: - New `self_closing_text_tag_is_handled` constructs a 2-page dump where page 1 is `<text bytes="0" />` (self-closing) and verifies page 2's body is still scanned AND page 2's `<title>` metadata is NOT counted (would be if in_text leaked). - All existing tests migrated to the reader-based helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

…s WONTFIX doc: `scan_prose_kanji_runs`'s doc comment said outside-block chars were "appended to a small local buffer". The implementation actually slices the input string directly (`&s[prose_start..i]`) and only the inner `scan_kanji_runs` reuses `buf` for the per-run kanji accumulator. Rewrote to match. The other R3 finding (perf: `dict.iter()` in `candidates_ops::corpus` materializes all readings/surfaces) is a dev-tool runtime-profile concern — same posture as the `dictool candidates mine` perf MINORs covered in feedback memory. The pipeline still completes in ~32s for jawiki-articles1.bz2 and the build dict surface set is ~10MB; not worth a lex-core API addition this PR. Resolved as WONTFIX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

engine/crates/lex-cli/src/candidates/wikipedia.rs:252

This branch slices with &s[i..] and calls s[i..].find('>') / starts_with, but i is advanced byte-by-byte and may not be a UTF-8 character boundary. That can panic at runtime if a non-ASCII byte equals b'<'. Recommend switching the loop to char_indices() or adding s.is_char_boundary(i) checks before any s[i..] slicing/matching.

        if !in_block && b == b'<' && is_ref_open(&s[i..], bytes, i) {
            // Self-closing `<ref ... />` is one shot; full `<ref>...</ref>`
            // is multi-token. Cheaply check the next `>`.
            if i > prose_start {
                scan_kanji_runs(&s[prose_start..i], buf, freqs);
            }
            // Find end of opening tag.
            if let Some(rel) = s[i..].find('>') {

PR244 Copilot R4 flagged the byte-indexed loop as potentially panicking on `&s[prose_start..i]` if a UTF-8 continuation byte matched `{` / `}` / `<`. That's not possible per the UTF-8 spec: continuation bytes are 0x80-0xBF, and our ASCII delimiters are 0x00-0x7F. Add an inline note so the invariant is visible in the source — preempts future re-raises without changing behavior. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The doc block landed as a trailing comment on `let mut prose_start = 0;` which rustfmt then re-indents to a confusing column. Move the comment to its own block above the binding so it formats cleanly.

Copilot AI review requested due to automatic review settings May 9, 2026 18:28

Copilot started reviewing on behalf of send May 9, 2026 18:29 View session

Copilot AI reviewed May 9, 2026

View reviewed changes

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs Outdated

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs

Comment thread engine/crates/lex-cli/src/bin/dictool.rs Outdated

send requested a review from Copilot May 14, 2026 10:02

Copilot started reviewing on behalf of send May 14, 2026 10:03 View session

send and others added 2 commits May 14, 2026 19:05

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs Outdated

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs Outdated

Comment thread engine/crates/lex-cli/Cargo.toml

send and others added 2 commits May 14, 2026 19:10

send requested a review from Copilot May 14, 2026 10:14

Copilot started reviewing on behalf of send May 14, 2026 10:14 View session

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread engine/crates/lex-cli/src/commands/candidates_ops.rs

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs Outdated

send requested a review from Copilot May 14, 2026 10:19

Copilot AI reviewed May 14, 2026

View reviewed changes

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs

send and others added 2 commits May 14, 2026 19:24

fix(candidates): satisfy rustfmt on UTF-8-safety doc comment

e4b6fa6

The doc block landed as a trailing comment on `let mut prose_start = 0;` which rustfmt then re-indents to a confusing column. Move the comment to its own block above the binding so it formats cleanly.

send merged commit 5c818c4 into main May 14, 2026
10 checks passed

send deleted the feat/extras-corpus-mining branch May 14, 2026 10:29

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(candidates): add corpus mining (Wikipedia dump scan)#244

feat(candidates): add corpus mining (Wikipedia dump scan)#244
send merged 9 commits into
mainfrom
feat/extras-corpus-mining

send commented May 9, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

send commented May 9, 2026

Summary

Why this approach (vs Sudachi / Wikidata, both tried in PR #243 / Wikidata experiment)

Pilot result

Test plan

Follow-ups (not this PR)

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants