Skip to content

Expand LaTeX symbol coverage and merge KaTeX lookup tables#7

Open
chitwitgit wants to merge 10 commits into
md2docx:mainfrom
chitwitgit:feat/expand-latex-symbol-coverage
Open

Expand LaTeX symbol coverage and merge KaTeX lookup tables#7
chitwitgit wants to merge 10 commits into
md2docx:mainfrom
chitwitgit:feat/expand-latex-symbol-coverage

Conversation

@chitwitgit

@chitwitgit chitwitgit commented Jun 9, 2026

Copy link
Copy Markdown

Summary

Closes #6.

Replaces the hand-maintained LATEX_SYMBOLS map (~120 entries) in lib/src/index.ts with generated symbol data to support substantially more LaTeX commands in DOCX output.

KaTeX v0.16.22 source is fetched at codegen time (MIT) as a practical seed — a large, well-structured baseline that is easy to regenerate. The goal is broad LaTeX command coverage, not parity with any particular renderer; other sources and manual overrides can be layered on later.

This PR folds in the symbol-table optimization previously proposed in #8:

  • Flatten symbol + alias + override entries at codegen into a single katexData.ts lookup map (641 merged entries)
  • Replace 3-table lookup chain with KATEX_SYMBOLS[name]
  • Structural OMML handlers for n-ary operators, binom, stackrel, accents, and font wrappers
  • Skip empty inline/block math to prevent Word document corruption
  • Adds pnpm generate:katex to regenerate tables from KaTeX source
  • Adds lib/scripts/benchmark-bundle-formats.ts to compare serialization formats

Bundle size (local pnpm build in lib/)

Version gzip CJS
3-table lookup 6,284 B
Merged lookup (this PR) 6,189 B
Savings −116 B (−1.8%)

Test plan

  • pnpm typecheck passes in lib/
  • pnpm build passes in lib/
  • pnpm test passes in lib/
  • Spot-check DOCX output for common symbols (\neq, \alpha, \sum, \hat{x})

Replace the hand-maintained symbol map with generated tables seeded from
KaTeX v0.16.22 source snippets, add accent and function handling, and
document regeneration via pnpm generate:katex.

Fixes md2docx#6
@deepsource-io

deepsource-io Bot commented Jun 9, 2026

Copy link
Copy Markdown

DeepSource Code Review

We reviewed changes in 20db5a5...19cfdc0 on this pull request. Below is the summary for the review, and you can see the individual issues we found as inline review comments.

See full review on DeepSource ↗

PR Report Card

Overall Grade   Security  

Reliability  

Complexity  

Hygiene  

Code Review Summary

Analyzer Status Updated (UTC) Details
JavaScript Jun 21, 2026 2:35p.m. Review ↗

Important

AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.

@chitwitgit

chitwitgit commented Jun 9, 2026

Copy link
Copy Markdown
Author

Summary

Follow-up to #7.

PR #7 replaces the hand-maintained symbol map with KaTeX-generated tables (~567 symbols). That covers most macros where the fix is simply resolving a command to the right Unicode glyph (e.g. \wedge).

The port still had gaps: some common macros need more than symbol lookup. N-ary operators must emit m:nary, not a ∫/∏ character in a text run. Layout commands like \binom and \stackrel need delimiter and limit structures. Font wrappers should render their argument, not the macro name. A few KaTeX macro-only symbols (\quad, \ne, \cdots) were also missing from the generated overrides.

This PR adds explicit handlers for those cases (~115 lines on top of #7).

What changed

lib/src/index.ts (+~110 lines)

Category Macros OMML
N-ary operators \prod, \int, \oint, \bigcup, \bigcap, \bigoplus, \bigotimes m:nary
Layout \binom{a}{b} m:d (round brackets + fraction)
Layout \stackrel{a}{b} m:limLoc + MathLimitUpper
Accents \overline, \widetilde createMathAccentCharacter
Font/text \mathrm, \mathit, \textbf, \textit, \underline, \overbrace, \underbrace argument content only (no literal macro name)
Skip \boxed, \boldsymbol no literal fallback
Other \newline space (non-empty OMML)

\sum unchanged — already handled via MathSum.

lib/scripts/generate-katex-data.ts (+4 lines)

Codegen fixes for macro-only symbols missed by simple alias parsing:

  • \quad, \qquad — fixed \\\\hskip regex for vendored KaTeX file
  • \ne (alias of \neq)
  • \cdots (via \@cdots)

Regenerated katexMeta.ts: 21 → 25 KATEX_SYMBOL_OVERRIDES.

Test plan

  • pnpm build passes in lib/
  • pnpm test passes in lib/
  • Spot-check in Word: \binom{n}{k}, \int_0^1, \stackrel{def}{=}, \mathrm{ABC}, \prod_{i=1}^n

@chitwitgit chitwitgit force-pushed the feat/expand-latex-symbol-coverage branch from feae6c4 to 977dd65 Compare June 9, 2026 08:46
Map n-ary operators, binom, stackrel, accents, and font wrappers to proper
Word OMML instead of Unicode fallbacks; fix quad/ne/cdots codegen overrides.
@chitwitgit chitwitgit force-pushed the feat/expand-latex-symbol-coverage branch from 977dd65 to f05e5b5 Compare June 9, 2026 08:50
chitwitgit and others added 2 commits June 10, 2026 10:15
Log console errors and omit unrenderable OMML instead of emitting empty <m:oMath> elements that break Microsoft Word.
@mayank1513 mayank1513 self-requested a review June 14, 2026 09:52
@mayank1513

Copy link
Copy Markdown
Contributor

Thanks for the work on this PR. The refactoring appears cleaner and more extensible, especially around n-ary operators and symbol handling.

One concern is that the bundle size increases by roughly 3×, while the generated DOCX output appears unchanged in my testing, and matrix support still doesn't seem to work in either implementation.

Could you update the root sample.md with examples of LaTeX syntax that is now supported or unblocked by this redesign? Concrete examples would make it much easier to evaluate the practical benefits of the changes.

Thanks again for the contribution and for improving the maintainability of the math plugin.

@mayank1513

Copy link
Copy Markdown
Contributor

Also, can you please specify exact origin of the katext files you have added under scripts/data. I would prefer if we can fetch from authentic source in the script itself - I think users would trust that more.

@chitwitgit

chitwitgit commented Jun 14, 2026

Copy link
Copy Markdown
Author

Thanks for the work on this PR. The refactoring appears cleaner and more extensible, especially around n-ary operators and symbol handling.

One concern is that the bundle size increases by roughly 3×, while the generated DOCX output appears unchanged in my testing, and matrix support still doesn't seem to work in either implementation.

Could you update the root sample.md with examples of LaTeX syntax that is now supported or unblocked by this redesign? Concrete examples would make it much easier to evaluate the practical benefits of the changes.

Thanks again for the contribution and for improving the maintainability of the math plugin.

Thanks for taking the time to review this — really appreciate the feedback.

You're right that a quick pass over the current sample.md won't show much difference. Most of what's in there (\alpha, \frac, \sum, etc.) already worked with the old hand-maintained symbol map. The changes show up in the long tail: macros that used to render as plain text in Word.

That's actually what pushed me to open this PR. I kept hitting cases like \triangle, \wedge, and \ne where some symbols worked and others didn't — the macro name would just appear literally in the document. I've been using the expanded version via my published fork @chitwitgit/m2d-math and those cases are now fixed.

I'll update sample.md with a short section of concrete before/after examples — things like \triangle ABC, \binom{n}{k}, \stackrel{def}{=}, \prod_{i=1}^n, accents, and font wrappers — so the benefit is easier to see when converting to DOCX.

On bundle size — fair concern. It does grow (~3×), mostly from the generated symbol tables (~567 entries vs ~120 before). The runtime logic is actually a bit cleaner, but the tradeoff is real: you sacrifice bundle size to support a broader set of LaTeX symbols and macros. I don't think the original hand-maintained set is comprehensive enough for anyone trying to use this seriously, and adding symbols one-by-one as use cases come up doesn't scale well — that's why I vendored a relatively complete set from KaTeX and codegen from it. That said, I'm still happy to optimize where it makes sense. There are definitely ways to transform the data to make it minimize better or even use alternative data structures to save space. Happy to discuss trimming the symbol set if you'd prefer a smaller default (though partial coverage tends to bring back the inconsistency I was trying to fix).

Add concrete newly-supported LaTeX examples to sample.md so DOCX
benefits are easy to spot. Replace vendored KaTeX snippets with
codegen that fetches KaTeX v0.16.22 from GitHub (symbols.js,
macros.js, functions/op.js).
@chitwitgit

Copy link
Copy Markdown
Author

Also, can you please specify exact origin of the katext files you have added under scripts/data. I would prefer if we can fetch from authentic source in the script itself - I think users would trust that more.

Please check 1fe6f27.

The files under lib/scripts/data/ were verbatim excerpts from KaTeX v0.16.22 (MIT):

  • src/symbols.js
  • src/macros.js
  • src/functions/op.js

Source: https://github.com/KaTeX/KaTeX/tree/v0.16.22/src

Per your suggestion, I've removed the vendored copies and updated pnpm generate:katex to fetch those three files at codegen time from the tagged release:

https://raw.githubusercontent.com/KaTeX/KaTeX/v0.16.22/src/...

The version is pinned via a KATEX_VERSION constant in lib/scripts/generate-katex-data.ts, and the generated outputs (katexSymbols.ts, katexMeta.ts) include a header noting the source URL. Regeneration is auditable — run the script and it pulls directly from KaTeX's repo.

KaTeX is used only as a structured seed for command → Unicode mappings at build time, not as a runtime dependency.

Also included in the same commit: the sample.md update from my earlier reply — concrete examples like \triangle, \wedge, \ne, \binom, \stackrel, accents, and n-ary limits so the before/after benefit is easier to see when converting to DOCX.

@chitwitgit

Copy link
Copy Markdown
Author

Follow-up: opened #8 with a small bundle optimization on top of this PR.

It merges the three generated lookup tables (katexSymbols, alias/override in katexMeta) into a single katexData.ts with one flat map at runtime. Symbol coverage is unchanged — same 641 entries, just a cleaner layout that minifies slightly better.

Local pnpm build in lib/:

I also added a benchmark-bundle-formats.ts script that compared 8 serialization approaches (JSON.parse blobs, tuple arrays, gzip+base64, Map, etc.) — merged object literal was the best balance of size and maintainability. Happy to fold #8 into this PR instead if you'd prefer one merge.

chitwitgit and others added 2 commits June 18, 2026 09:25
Flatten symbol, alias, and override entries into one generated
katexData.ts object literal for a smaller minified bundle (~1.7%
gzip savings vs three-table lookup). Add a benchmark script to compare
serialization formats.
@chitwitgit chitwitgit changed the title Expand LaTeX symbol coverage using codegen from vendored tables Expand LaTeX symbol coverage and merge KaTeX lookup tables Jun 18, 2026
@chitwitgit

Copy link
Copy Markdown
Author

Hi @mayank1513 — following up on your earlier note to merge both branches and rebase on main.

Both feat/expand-latex-symbol-coverage and the symbol-table optimization from #8 are now folded into this PR, rebased on latest main, and #8 has been closed as superseded.

Build and tests pass locally (pnpm build, pnpm test in lib/). Would appreciate a review when you have a chance — happy to address any feedback.

@codecov

codecov Bot commented Jun 20, 2026

Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 98.44961% with 2 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
lib/src/index.ts 98.41% 2 Missing ⚠️

📢 Thoughts on this report? Let us know!

@mayank1513

Copy link
Copy Markdown
Contributor

Thanks a lot for the updates. Word is crashing while trying to open the DOCX file generated after building your branch.

Refactor the LaTeX→OMML mapper so n-ary operators, accents, and scripts emit
schema-valid OMML, generate operator tables from KaTeX at codegen time, and
add fixture-based tests validated against the Microsoft 365 OOXML schema.
@chitwitgit

Copy link
Copy Markdown
Author

Hi @mayank1513 — thanks for flagging the Word crash. I've pushed 19cfdc0 with a fix and regression coverage.

Root cause: several OMML structures we were emitting were schema-invalid for Word — most notably m:nary (missing or mis-ordered m:sub/m:sup/m:e children), standalone accent characters instead of proper m:acc wrappers, and ad-hoc script attachment via extra fields on MathRun nodes.

What changed:

  • Refactored the LaTeX→OMML mapper to use explicit pending markers (n-ary, accent, script, limits-text) instead of mutating MathRun objects
  • m:nary now always emits sub/sup/base in the required fixed order
  • Accents use m:acc with OMML combining marks (e.g. \hat{x}, \overline{AB}, \vec{v})
  • N-ary / integral / limits-text operator tables are generated from KaTeX at codegen time (KATEX_NARY_OPS, KATEX_INTEGRAL_OPS, KATEX_LIMITS_TEXT_OPS)
  • Added 32 markdown fixtures + a combined all-fixtures document; each is validated against the Microsoft 365 OOXML schema via @xarsh/ooxml-validator in CI (pnpm test — 35 tests, all passing)

I've manually spot-checked the generated DOCX files in Word as well — they open cleanly now, including the combined fixture document that exercises every supported case in one file.

Could you try rebuilding from latest feat/expand-latex-symbol-coverage and opening the DOCX again? If Word still crashes on your side, a minimal markdown snippet that reproduces it would help narrow it down quickly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Expand LaTeX symbol coverage beyond hand-maintained map

2 participants