feat(candidates): add corpus mining (Wikipedia dump scan)#244
Merged
Conversation
Surface-first vocabulary mining from a Wikipedia jawiki dump. Streams
the bz2 directly, skips wikitext templates `{{...}}` and `<ref>` blocks,
filters to article namespace, extracts maximal kanji runs, and diffs
against the build dict's surface set. Outputs `wikipedia.tsv` with
`surface\tfreq` rows.
Reading-assignment is intentionally deferred — the user picks top-N gap
surfaces and looks up readings before promoting to `extras/<domain>.tsv`,
mirroring the existing `mine`-then-promote-by-hand workflow.
Pilot run on jawiki-articles1 (80K articles, ~1.5GB raw text) finishes
in ~32s and yields 304K freq>=5 gap surfaces. Most are lattice-
composable (徳川家康, 室町時代, 令和元年 — Mozc handles via segment
composition) but real misses surface in the mix (e.g. 宇宙戦艦 →
Mozc top-1 returns 宇宙船感). Per-candidate verification via
`lextool explain` is still required before promotion.
deps: bzip2 0.4 (lex-cli only — same dev-tool scope as the existing
zip dep used by `candidates mine`).
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Adds a new “surface-first” candidate mining path that scans a Wikipedia XML dump to extract frequent kanji-run surfaces, diffs them against the merged build dictionary’s surface set, and writes the remaining gaps to a TSV for manual curation into extras/.
Changes:
- Added
dictool candidates corpus <dump>subcommand to scan.xml/.xml.bz2Wikipedia dumps and emitwikipedia.tsv(surface + frequency). - Implemented streaming Wikipedia dump scanner with template (
{{...}}) and<ref>...</ref>skipping plus kanji-run extraction + unit tests. - Added
bzip2dependency to support streaming decompression of.bz2dumps.
Reviewed changes
Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.
Show a summary per file
| File | Description |
|---|---|
| engine/crates/lex-cli/src/commands/candidates_ops.rs | Adds corpus() orchestration: run Wikipedia scan, build surface coverage set from build dict, diff + write wikipedia.tsv. |
| engine/crates/lex-cli/src/candidates/wikipedia.rs | New scanner/extractor for Wikipedia dump streaming, markup skipping, kanji-run frequency counting, and unit tests. |
| engine/crates/lex-cli/src/candidates/mod.rs | Exposes the new wikipedia candidates module. |
| engine/crates/lex-cli/src/bin/dictool.rs | Adds the CandidatesAction::Corpus CLI subcommand and wiring to candidates_ops::corpus. |
| engine/crates/lex-cli/Cargo.toml | Adds bzip2 dependency for dump decompression. |
| engine/Cargo.lock | Locks bzip2 / bzip2-sys transitive additions. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
1. (IMP) `<ref` prefix match also captured `<references>` (common in
Wikipedia citation sections). Since `<references>` closes with
`</references>` rather than `</ref>`, `in_ref` got trapped true and
silently dropped the rest of the page (and subsequent pages until
the next page-boundary reset). Added a strict tag-name boundary
check (`<ref` followed by space/`>`/`/`/EOL).
2. (IMP) Block-skip state (`tmpl_depth`, `in_ref`, `buf`) wasn't reset
at `<page>` boundaries. Real dumps sometimes contain unbalanced
`{{...` / `<ref ...` markup; without a reset, the open-block state
leaked into the next page and silently skipped its content. Reset
all three at every `<page>` line.
3. (MINOR) CLI `default_value_t = 3` duplicated the
`wikipedia::DEFAULT_MIN_FREQ` constant. Reference the constant
directly so they can't drift.
Empirical impact on jawiki-articles1.bz2: 304K → 334K gap surfaces
(+30K previously lost to the `<references>` trap).
Tests:
- `references_tag_does_not_trap_in_ref` covers both self-closing
`<references/>` and `<references>...</references>` forms.
- `page_boundary_resets_block_state` drives `extract_kanji_freqs` over
a synthetic 2-page dump where page 1 has an unclosed template and
verifies page 2 is still fully scanned.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo vet check` was failing on the audit job because the new `bzip2` / `bzip2-sys` deps (added in c9684a0 for `dictool candidates corpus`) were unvetted. Add same-pattern exemptions matching the existing `zip` entry — both are pulled in only by the dev/build CLI, not the IME runtime. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo vet check` ran during the audit-fix work stripped the inline comment above `[[exemptions.bumpalo]]` as part of its config-file normalization. Put it back — the explanation of why bumpalo is elevated to `safe-to-deploy` is load-bearing context for future audits.
`scripts/check-build-scripts.sh` flagged `bzip2-sys` as a new crate with `build.rs` after PR #244 added the bzip2 dep for the Wikipedia corpus miner. The build.rs is upstream-standard (compiles vendored libbz2 C source via `cc`), same supply-chain posture as the existing audited C/build-script crates in the baseline (libc, ring, rustls, ring-bindgen, etc.). Accept by updating the baseline.
1. (IMP) `<text ... />` self-closing form wasn't handled. The pattern match treated it like a normal opening (`<text>` with no `</text>` on the same line), so `in_text` stuck true for that page. If the next line was XML metadata (e.g. `<title>` of the following page when self-closing immediately precedes `</page>`), it would have been scanned as prose and polluted frequency counts. Detect `/>` before the first `>` of the opening tag and short-circuit. Also reset `in_text` at `<page>` boundaries alongside the other state resets (defence-in-depth). 2. (MINOR) Test helper wrote to a timestamp-based path under `std::env::temp_dir()`, which could collide on parallel test runs and leave files behind on panic. Refactored `extract_kanji_freqs` to split out a `extract_kanji_freqs_from_reader` private API that takes any `impl BufRead`; tests now run against `Cursor<&[u8]>` directly, no filesystem involvement. The earlier R2 finding about supply-chain updates (Cargo.toml +35) is already addressed in daca284 + 1507c4e — resolving as stale. Tests: - New `self_closing_text_tag_is_handled` constructs a 2-page dump where page 1 is `<text bytes="0" />` (self-closing) and verifies page 2's body is still scanned AND page 2's `<title>` metadata is NOT counted (would be if in_text leaked). - All existing tests migrated to the reader-based helper. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…s WONTFIX doc: `scan_prose_kanji_runs`'s doc comment said outside-block chars were "appended to a small local buffer". The implementation actually slices the input string directly (`&s[prose_start..i]`) and only the inner `scan_kanji_runs` reuses `buf` for the per-run kanji accumulator. Rewrote to match. The other R3 finding (perf: `dict.iter()` in `candidates_ops::corpus` materializes all readings/surfaces) is a dev-tool runtime-profile concern — same posture as the `dictool candidates mine` perf MINORs covered in feedback memory. The pipeline still completes in ~32s for jawiki-articles1.bz2 and the build dict surface set is ~10MB; not worth a lex-core API addition this PR. Resolved as WONTFIX. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.
Comments suppressed due to low confidence (1)
engine/crates/lex-cli/src/candidates/wikipedia.rs:252
- This branch slices with
&s[i..]and callss[i..].find('>')/starts_with, butiis advanced byte-by-byte and may not be a UTF-8 character boundary. That can panic at runtime if a non-ASCII byte equalsb'<'. Recommend switching the loop tochar_indices()or addings.is_char_boundary(i)checks before anys[i..]slicing/matching.
if !in_block && b == b'<' && is_ref_open(&s[i..], bytes, i) {
// Self-closing `<ref ... />` is one shot; full `<ref>...</ref>`
// is multi-token. Cheaply check the next `>`.
if i > prose_start {
scan_kanji_runs(&s[prose_start..i], buf, freqs);
}
// Find end of opening tag.
if let Some(rel) = s[i..].find('>') {
PR244 Copilot R4 flagged the byte-indexed loop as potentially panicking
on `&s[prose_start..i]` if a UTF-8 continuation byte matched `{` / `}` /
`<`. That's not possible per the UTF-8 spec: continuation bytes are
0x80-0xBF, and our ASCII delimiters are 0x00-0x7F. Add an inline note
so the invariant is visible in the source — preempts future re-raises
without changing behavior.
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The doc block landed as a trailing comment on `let mut prose_start = 0;` which rustfmt then re-indents to a confusing column. Move the comment to its own block above the binding so it formats cleanly.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Surface-first vocabulary mining from a Wikipedia jawiki dump. Adds `dictool candidates corpus ` that:
Reading-assignment intentionally deferred — user picks top-N gap surfaces and looks up readings by hand before promoting to `extras/.tsv`. Mirrors the existing `mine`-then-promote workflow.
Why this approach (vs Sudachi / Wikidata, both tried in PR #243 / Wikidata experiment)
Per `feedback_extras_promotion.md`:
User insight (this session): forget reading + POS during extraction; just yank kanji runs. Reading lookup is the bottleneck — apply it only to the small post-diff candidate set.
Pilot result
`jawiki-articles1.bz2` (398MB compressed → 1.5GB raw, 80K articles):
No extras additions in this PR — this is the tool only. Future PRs hand-pick from the candidate file.
Test plan
Follow-ups (not this PR)
🤖 Generated with Claude Code