Skip to content

feat(candidates): add corpus mining (Wikipedia dump scan)#244

Merged
send merged 9 commits into
mainfrom
feat/extras-corpus-mining
May 14, 2026
Merged

feat(candidates): add corpus mining (Wikipedia dump scan)#244
send merged 9 commits into
mainfrom
feat/extras-corpus-mining

Conversation

@send
Copy link
Copy Markdown
Owner

@send send commented May 9, 2026

Summary

Surface-first vocabulary mining from a Wikipedia jawiki dump. Adds `dictool candidates corpus ` that:

  • streams the bz2 directly (no full decompress to disk)
  • filters to article namespace (`0`)
  • skips wikitext templates `{{...}}` and `...` blocks
  • extracts maximal kanji runs (`[一-龥々]+`, length 2-20)
  • frequency-counts and diffs against the build dict's surface set
  • outputs `wikipedia.tsv` (gitignored) with `surface\tfreq`

Reading-assignment intentionally deferred — user picks top-N gap surfaces and looks up readings by hand before promoting to `extras/.tsv`. Mirrors the existing `mine`-then-promote workflow.

Why this approach (vs Sudachi / Wikidata, both tried in PR #243 / Wikidata experiment)

Per `feedback_extras_promotion.md`:

  • Sudachi naive scan: 1.9M candidates → 2 promoted (yield ~10⁻⁶) — frozen vocab, no frequency signal
  • Wikidata SPARQL: 5000 → 5 promoted (all 神話/宗教) — P1814 only fills固有名詞
  • Corpus mining (this PR): real-text frequency signal, catches modern vocab, surface-first

User insight (this session): forget reading + POS during extraction; just yank kanji runs. Reading lookup is the bottleneck — apply it only to the small post-diff candidate set.

Pilot result

`jawiki-articles1.bz2` (398MB compressed → 1.5GB raw, 80K articles):

  • 32s end-to-end
  • 304K gap surfaces (freq >= 5) after template/ref/ns filter
  • Top entries dominated by lattice-composable compounds (Mozc has 徳川/家康 separately and composes via Viterbi). `lextool explain` confirms 徳川家康 / 室町時代 / 令和元年 are all top-1 already.
  • Real misses surface in the mix: e.g. `宇宙戦艦` (Mozc top-1 returns 宇宙船感). Per-candidate verification via `lextool explain` is the next step before promotion.

No extras additions in this PR — this is the tool only. Future PRs hand-pick from the candidate file.

Test plan

  • `cargo fmt --all --check` / `cargo clippy --workspace --all-features -- -D warnings` / `cargo test --workspace --all-features` all green
  • 12 new unit tests for `scan_kanji_runs` (length filters, iteration mark, multi-occurrence) and `scan_prose_kanji_runs` (template skip, nested templates, `` skip, self-closing ref, cross-slice state)
  • Smoke test on synthetic XML (3 pages, 13 surfaces, all hit dict)
  • Pilot run on jawiki-articles1.bz2: 80K articles in 32s, 304K freq>=5 gap surfaces
  • Spot-checked `徳川家康`/`室町時代`/`令和元年`/`宇宙戦艦` against `lextool explain` — confirms tool finds genuine miss (`宇宙戦艦`) along with lattice-composable noise

Follow-ups (not this PR)

  • Reading-assignment helper (Sudachi lookup or manual) for the top-N gap surfaces
  • Verification step that runs `lextool explain` per candidate to filter lattice-composable rows
  • Document workflow in feedback memory once a real promotion lands

🤖 Generated with Claude Code

Surface-first vocabulary mining from a Wikipedia jawiki dump. Streams
the bz2 directly, skips wikitext templates `{{...}}` and `<ref>` blocks,
filters to article namespace, extracts maximal kanji runs, and diffs
against the build dict's surface set. Outputs `wikipedia.tsv` with
`surface\tfreq` rows.

Reading-assignment is intentionally deferred — the user picks top-N gap
surfaces and looks up readings before promoting to `extras/<domain>.tsv`,
mirroring the existing `mine`-then-promote-by-hand workflow.

Pilot run on jawiki-articles1 (80K articles, ~1.5GB raw text) finishes
in ~32s and yields 304K freq>=5 gap surfaces. Most are lattice-
composable (徳川家康, 室町時代, 令和元年 — Mozc handles via segment
composition) but real misses surface in the mix (e.g. 宇宙戦艦 →
Mozc top-1 returns 宇宙船感). Per-candidate verification via
`lextool explain` is still required before promotion.

deps: bzip2 0.4 (lex-cli only — same dev-tool scope as the existing
zip dep used by `candidates mine`).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copilot AI review requested due to automatic review settings May 9, 2026 18:28
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adds a new “surface-first” candidate mining path that scans a Wikipedia XML dump to extract frequent kanji-run surfaces, diffs them against the merged build dictionary’s surface set, and writes the remaining gaps to a TSV for manual curation into extras/.

Changes:

  • Added dictool candidates corpus <dump> subcommand to scan .xml / .xml.bz2 Wikipedia dumps and emit wikipedia.tsv (surface + frequency).
  • Implemented streaming Wikipedia dump scanner with template ({{...}}) and <ref>...</ref> skipping plus kanji-run extraction + unit tests.
  • Added bzip2 dependency to support streaming decompression of .bz2 dumps.

Reviewed changes

Copilot reviewed 5 out of 6 changed files in this pull request and generated 3 comments.

Show a summary per file
File Description
engine/crates/lex-cli/src/commands/candidates_ops.rs Adds corpus() orchestration: run Wikipedia scan, build surface coverage set from build dict, diff + write wikipedia.tsv.
engine/crates/lex-cli/src/candidates/wikipedia.rs New scanner/extractor for Wikipedia dump streaming, markup skipping, kanji-run frequency counting, and unit tests.
engine/crates/lex-cli/src/candidates/mod.rs Exposes the new wikipedia candidates module.
engine/crates/lex-cli/src/bin/dictool.rs Adds the CandidatesAction::Corpus CLI subcommand and wiring to candidates_ops::corpus.
engine/crates/lex-cli/Cargo.toml Adds bzip2 dependency for dump decompression.
engine/Cargo.lock Locks bzip2 / bzip2-sys transitive additions.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs
Comment thread engine/crates/lex-cli/src/bin/dictool.rs Outdated
1. (IMP) `<ref` prefix match also captured `<references>` (common in
   Wikipedia citation sections). Since `<references>` closes with
   `</references>` rather than `</ref>`, `in_ref` got trapped true and
   silently dropped the rest of the page (and subsequent pages until
   the next page-boundary reset). Added a strict tag-name boundary
   check (`<ref` followed by space/`>`/`/`/EOL).

2. (IMP) Block-skip state (`tmpl_depth`, `in_ref`, `buf`) wasn't reset
   at `<page>` boundaries. Real dumps sometimes contain unbalanced
   `{{...` / `<ref ...` markup; without a reset, the open-block state
   leaked into the next page and silently skipped its content. Reset
   all three at every `<page>` line.

3. (MINOR) CLI `default_value_t = 3` duplicated the
   `wikipedia::DEFAULT_MIN_FREQ` constant. Reference the constant
   directly so they can't drift.

Empirical impact on jawiki-articles1.bz2: 304K → 334K gap surfaces
(+30K previously lost to the `<references>` trap).

Tests:
- `references_tag_does_not_trap_in_ref` covers both self-closing
  `<references/>` and `<references>...</references>` forms.
- `page_boundary_resets_block_state` drives `extract_kanji_freqs` over
  a synthetic 2-page dump where page 1 has an unclosed template and
  verifies page 2 is still fully scanned.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
send and others added 2 commits May 14, 2026 19:05
`cargo vet check` was failing on the audit job because the new
`bzip2` / `bzip2-sys` deps (added in c9684a0 for `dictool candidates
corpus`) were unvetted. Add same-pattern exemptions matching the
existing `zip` entry — both are pulled in only by the dev/build CLI,
not the IME runtime.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
`cargo vet check` ran during the audit-fix work stripped the inline
comment above `[[exemptions.bumpalo]]` as part of its config-file
normalization. Put it back — the explanation of why bumpalo is
elevated to `safe-to-deploy` is load-bearing context for future
audits.
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 6 out of 7 changed files in this pull request and generated 3 comments.

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs Outdated
Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs Outdated
Comment thread engine/crates/lex-cli/Cargo.toml
send and others added 2 commits May 14, 2026 19:10
`scripts/check-build-scripts.sh` flagged `bzip2-sys` as a new crate
with `build.rs` after PR #244 added the bzip2 dep for the Wikipedia
corpus miner. The build.rs is upstream-standard (compiles vendored
libbz2 C source via `cc`), same supply-chain posture as the existing
audited C/build-script crates in the baseline (libc, ring, rustls,
ring-bindgen, etc.). Accept by updating the baseline.
1. (IMP) `<text ... />` self-closing form wasn't handled. The pattern
   match treated it like a normal opening (`<text>` with no `</text>`
   on the same line), so `in_text` stuck true for that page. If the
   next line was XML metadata (e.g. `<title>` of the following page
   when self-closing immediately precedes `</page>`), it would have
   been scanned as prose and polluted frequency counts. Detect `/>`
   before the first `>` of the opening tag and short-circuit. Also
   reset `in_text` at `<page>` boundaries alongside the other state
   resets (defence-in-depth).

2. (MINOR) Test helper wrote to a timestamp-based path under
   `std::env::temp_dir()`, which could collide on parallel test runs
   and leave files behind on panic. Refactored `extract_kanji_freqs`
   to split out a `extract_kanji_freqs_from_reader` private API that
   takes any `impl BufRead`; tests now run against `Cursor<&[u8]>`
   directly, no filesystem involvement.

The earlier R2 finding about supply-chain updates (Cargo.toml +35) is
already addressed in daca284 + 1507c4e — resolving as stale.

Tests:
- New `self_closing_text_tag_is_handled` constructs a 2-page dump
  where page 1 is `<text bytes="0" />` (self-closing) and verifies
  page 2's body is still scanned AND page 2's `<title>` metadata is
  NOT counted (would be if in_text leaked).
- All existing tests migrated to the reader-based helper.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 2 comments.

Comment thread engine/crates/lex-cli/src/commands/candidates_ops.rs
Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs Outdated
…s WONTFIX

doc: `scan_prose_kanji_runs`'s doc comment said outside-block chars
were "appended to a small local buffer". The implementation actually
slices the input string directly (`&s[prose_start..i]`) and only the
inner `scan_kanji_runs` reuses `buf` for the per-run kanji accumulator.
Rewrote to match.

The other R3 finding (perf: `dict.iter()` in `candidates_ops::corpus`
materializes all readings/surfaces) is a dev-tool runtime-profile
concern — same posture as the `dictool candidates mine` perf MINORs
covered in feedback memory. The pipeline still completes in ~32s for
jawiki-articles1.bz2 and the build dict surface set is ~10MB; not
worth a lex-core API addition this PR. Resolved as WONTFIX.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@send send requested a review from Copilot May 14, 2026 10:19
Copy link
Copy Markdown

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 7 out of 8 changed files in this pull request and generated 1 comment.

Comments suppressed due to low confidence (1)

engine/crates/lex-cli/src/candidates/wikipedia.rs:252

  • This branch slices with &s[i..] and calls s[i..].find('>') / starts_with, but i is advanced byte-by-byte and may not be a UTF-8 character boundary. That can panic at runtime if a non-ASCII byte equals b'<'. Recommend switching the loop to char_indices() or adding s.is_char_boundary(i) checks before any s[i..] slicing/matching.
        if !in_block && b == b'<' && is_ref_open(&s[i..], bytes, i) {
            // Self-closing `<ref ... />` is one shot; full `<ref>...</ref>`
            // is multi-token. Cheaply check the next `>`.
            if i > prose_start {
                scan_kanji_runs(&s[prose_start..i], buf, freqs);
            }
            // Find end of opening tag.
            if let Some(rel) = s[i..].find('>') {

Comment thread engine/crates/lex-cli/src/candidates/wikipedia.rs
send and others added 2 commits May 14, 2026 19:24
PR244 Copilot R4 flagged the byte-indexed loop as potentially panicking
on `&s[prose_start..i]` if a UTF-8 continuation byte matched `{` / `}` /
`<`. That's not possible per the UTF-8 spec: continuation bytes are
0x80-0xBF, and our ASCII delimiters are 0x00-0x7F. Add an inline note
so the invariant is visible in the source — preempts future re-raises
without changing behavior.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
The doc block landed as a trailing comment on `let mut prose_start = 0;`
which rustfmt then re-indents to a confusing column. Move the comment
to its own block above the binding so it formats cleanly.
@send send merged commit 5c818c4 into main May 14, 2026
10 checks passed
@send send deleted the feat/extras-corpus-mining branch May 14, 2026 10:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants