Add font-aware decoder for LaTeX math glyphs in PDFs#2150
Open
jonasvq wants to merge 3 commits into
Open
Conversation
PDFs from pdflatex embed Computer Modern math fonts without a ToUnicode CMap, so pdfminer emits unreadable (cid:N) tokens for sums, integrals, roots, delimiters, and Greek. Add a font-aware decoder that reads each glyph's font name and maps it through per-font CMEX/CMSY/CMMI tables, with cluster-level confidence and a <!-- FORMULA --> fallback for low-confidence runs. On by default; pass decode_cid=False to keep the raw tokens. Note: this changes default PDF output for math documents. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Author
|
@microsoft-github-policy-service agree |
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fill cmex10/cmsy10/cmmi10 to full 0-127 from the Computer Modern AFM metrics (adds contour integrals, coproduct, square union, big floor/ ceil, extensible delimiter pieces; fixes a CMMI epsilon swap where code 15 is epsilon1/varepsilon and 34 is epsilon/lunate). Extend decoding to the other math fonts that lack ToUnicode: - CMBSY10 (bold math symbols) and CMMIB10 (bold math italic), whose encodings are identical to CMSY10/CMMI10. - MSAM10 / MSBM10 (AMS symbols): squares, harpoons, negated relations, blackboard bold A-Z (Letterlike exceptions override the U+1D538 block), Hebrew letters. - LASY10 (LaTeX symbols), mapped conservatively from the published encoding table since its AFM glyph names are opaque. Unmappable or ambiguous codes are omitted so they fall through to the existing confidence/fallback path. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
PDFs compiled with pdflatex embed Computer Modern math fonts (CMEX/CMSY/CMMI) without a ToUnicode CMap. pdfminer can't map their glyphs, so math comes out as unreadable
(cid:12)(cid:104)…tokens — sums, integrals, roots, norms, delimiters, and Greek are all lost.Fix
A font-aware CID decoder:
build_cid_mapmakes one low-level pdfminer pass, reading each unmapped glyph'sfontnameso the correct per-font encoding table is used (cid:12is|in CMEX10 but a different glyph elsewhere — resolution must be font-keyed).decode_cidssubstitutes the tokens, grouping them into clusters (≈ one formula). Each cluster gets a confidence = resolved/total; clusters below 0.6 are wrapped as<!-- FORMULA: ... -->instead of being half-mistranslated. Unknown(font, cid)pairs are never guessed.Behavior
On by default; pass
decode_cid=Falseto keep the raw tokens. This changes default PDF output for math documents.Tests
test_pdf_cid.py+ a generatedtest_math_cid.pdffixture: default-on decoding, opt-out, prose-unaffected regression, clean-Unicode-not-corrupted, and CMSY/CMMI table coverage. Smoke vector added.MarkItDown — LaTeX math PDF (before)
Induced norm
∥x∥ = (cid:112)⟨x, x⟩ = (cid:32) (cid:88) |x_j|2 (cid:33)1/2
Spectral theorem
A = (cid:90) λ dE(λ)
Variational problem
J[u] = (cid:90) ½ ∥(cid:114)u∥2 − f u dx, (cid:114)2u = f
Inner product of a tensor product
(cid:42) (cid:81) (v_i (cid:10) w_i) (cid:43) = (cid:34) (cid:83) S_k (cid:35) ∩ { z : |z| ≤ 1 }
MarkItDown — LaTeX math PDF (after)
Induced norm
∥x∥ = √⟨x, x⟩ = ( ∑ |x_j|2 )1/2
Spectral theorem
A = ∫ λ dE(λ)
Variational problem
J[u] = ∫ ½ ∥∇u∥2 − f u dx, ∇2u = f
Inner product of a tensor product
⟨ ∏ (v_i ⊗ w_i) ⟩ = [ ∪ S_k ] ∩ { z : |z| ≤ 1 }
🤖 Generated with Claude Code.