Skip to content

Add font-aware decoder for LaTeX math glyphs in PDFs#2150

Open
jonasvq wants to merge 3 commits into
microsoft:mainfrom
jonasvq:fix/markitdown-cid-math-decoding
Open

Add font-aware decoder for LaTeX math glyphs in PDFs#2150
jonasvq wants to merge 3 commits into
microsoft:mainfrom
jonasvq:fix/markitdown-cid-math-decoding

Conversation

@jonasvq

@jonasvq jonasvq commented Jun 21, 2026

Copy link
Copy Markdown

Problem

PDFs compiled with pdflatex embed Computer Modern math fonts (CMEX/CMSY/CMMI) without a ToUnicode CMap. pdfminer can't map their glyphs, so math comes out as unreadable (cid:12)(cid:104)… tokens — sums, integrals, roots, norms, delimiters, and Greek are all lost.

Fix

A font-aware CID decoder:

  • build_cid_map makes one low-level pdfminer pass, reading each unmapped glyph's fontname so the correct per-font encoding table is used (cid:12 is | in CMEX10 but a different glyph elsewhere — resolution must be font-keyed).
  • decode_cids substitutes the tokens, grouping them into clusters (≈ one formula). Each cluster gets a confidence = resolved/total; clusters below 0.6 are wrapped as <!-- FORMULA: ... --> instead of being half-mistranslated. Unknown (font, cid) pairs are never guessed.
  • Tables verified glyph-by-glyph against the embedded font encodings; design-size and Latin Modern variants (CMSY8, LMMI10, …) map to the same family table.

Behavior

On by default; pass decode_cid=False to keep the raw tokens. This changes default PDF output for math documents.

Tests

test_pdf_cid.py + a generated test_math_cid.pdf fixture: default-on decoding, opt-out, prose-unaffected regression, clean-Unicode-not-corrupted, and CMSY/CMMI table coverage. Smoke vector added.

MarkItDown — LaTeX math PDF (before)

Default output without CID decoding. pdfminer cannot map the Computer Modern
math glyphs, so they leak through as raw (cid:N) tokens.

Induced norm

∥x∥ = (cid:112)⟨x, x⟩ = (cid:32) (cid:88) |x_j|2 (cid:33)1/2

Spectral theorem

A = (cid:90) λ dE(λ)

Variational problem

J[u] = (cid:90) ½ ∥(cid:114)u∥2 − f u dx, (cid:114)2u = f

Inner product of a tensor product

(cid:42) (cid:81) (v_i (cid:10) w_i) (cid:43) = (cid:34) (cid:83) S_k (cid:35) ∩ { z : |z| ≤ 1 }

MarkItDown — LaTeX math PDF (after)

Output with CID decoding (now on by default). Each (cid:N) is resolved
font-aware through the Computer Modern encoding tables to real Unicode.

Induced norm

∥x∥ = √⟨x, x⟩ = ( ∑ |x_j|2 )1/2

Spectral theorem

A = ∫ λ dE(λ)

Variational problem

J[u] = ∫ ½ ∥∇u∥2 − f u dx, ∇2u = f

Inner product of a tensor product

⟨ ∏ (v_i ⊗ w_i) ⟩ = [ ∪ S_k ] ∩ { z : |z| ≤ 1 }

🤖 Generated with Claude Code.

PDFs from pdflatex embed Computer Modern math fonts without a ToUnicode
CMap, so pdfminer emits unreadable (cid:N) tokens for sums, integrals,
roots, delimiters, and Greek. Add a font-aware decoder that reads each
glyph's font name and maps it through per-font CMEX/CMSY/CMMI tables,
with cluster-level confidence and a <!-- FORMULA --> fallback for
low-confidence runs.

On by default; pass decode_cid=False to keep the raw tokens. Note: this
changes default PDF output for math documents.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
@jonasvq

jonasvq commented Jun 21, 2026

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

jonasvq and others added 2 commits June 21, 2026 17:33
Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Fill cmex10/cmsy10/cmmi10 to full 0-127 from the Computer Modern AFM
metrics (adds contour integrals, coproduct, square union, big floor/
ceil, extensible delimiter pieces; fixes a CMMI epsilon swap where code
15 is epsilon1/varepsilon and 34 is epsilon/lunate).

Extend decoding to the other math fonts that lack ToUnicode:
- CMBSY10 (bold math symbols) and CMMIB10 (bold math italic), whose
  encodings are identical to CMSY10/CMMI10.
- MSAM10 / MSBM10 (AMS symbols): squares, harpoons, negated relations,
  blackboard bold A-Z (Letterlike exceptions override the U+1D538 block),
  Hebrew letters.
- LASY10 (LaTeX symbols), mapped conservatively from the published
  encoding table since its AFM glyph names are opaque.

Unmappable or ambiguous codes are omitted so they fall through to the
existing confidence/fallback path.

Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant