Expand LaTeX symbol coverage and merge KaTeX lookup tables#7
Expand LaTeX symbol coverage and merge KaTeX lookup tables#7chitwitgit wants to merge 10 commits into
Conversation
Replace the hand-maintained symbol map with generated tables seeded from KaTeX v0.16.22 source snippets, add accent and function handling, and document regeneration via pnpm generate:katex. Fixes md2docx#6
|
|
Overall Grade |
Security Reliability Complexity Hygiene |
Code Review Summary
| Analyzer | Status | Updated (UTC) | Details |
|---|---|---|---|
| JavaScript | Jun 21, 2026 2:35p.m. | Review ↗ |
Important
AI Review is run only on demand for your team. We're only showing results of static analysis review right now. To trigger AI Review, comment @deepsourcebot review on this thread.
SummaryFollow-up to #7. PR #7 replaces the hand-maintained symbol map with KaTeX-generated tables (~567 symbols). That covers most macros where the fix is simply resolving a command to the right Unicode glyph (e.g. The port still had gaps: some common macros need more than symbol lookup. N-ary operators must emit This PR adds explicit handlers for those cases (~115 lines on top of #7). What changed
|
| Category | Macros | OMML |
|---|---|---|
| N-ary operators | \prod, \int, \oint, \bigcup, \bigcap, \bigoplus, \bigotimes |
m:nary |
| Layout | \binom{a}{b} |
m:d (round brackets + fraction) |
| Layout | \stackrel{a}{b} |
m:limLoc + MathLimitUpper |
| Accents | \overline, \widetilde |
createMathAccentCharacter |
| Font/text | \mathrm, \mathit, \textbf, \textit, \underline, \overbrace, \underbrace |
argument content only (no literal macro name) |
| Skip | \boxed, \boldsymbol |
no literal fallback |
| Other | \newline |
space (non-empty OMML) |
\sum unchanged — already handled via MathSum.
lib/scripts/generate-katex-data.ts (+4 lines)
Codegen fixes for macro-only symbols missed by simple alias parsing:
\quad,\qquad— fixed\\\\hskipregex for vendored KaTeX file\ne→≠(alias of\neq)\cdots→⋯(via\@cdots)
Regenerated katexMeta.ts: 21 → 25 KATEX_SYMBOL_OVERRIDES.
Test plan
-
pnpm buildpasses inlib/ -
pnpm testpasses inlib/ - Spot-check in Word:
\binom{n}{k},\int_0^1,\stackrel{def}{=},\mathrm{ABC},\prod_{i=1}^n
feae6c4 to
977dd65
Compare
Map n-ary operators, binom, stackrel, accents, and font wrappers to proper Word OMML instead of Unicode fallbacks; fix quad/ne/cdots codegen overrides.
977dd65 to
f05e5b5
Compare
Log console errors and omit unrenderable OMML instead of emitting empty <m:oMath> elements that break Microsoft Word.
|
Thanks for the work on this PR. The refactoring appears cleaner and more extensible, especially around n-ary operators and symbol handling. One concern is that the bundle size increases by roughly 3×, while the generated DOCX output appears unchanged in my testing, and matrix support still doesn't seem to work in either implementation. Could you update the root Thanks again for the contribution and for improving the maintainability of the math plugin. |
|
Also, can you please specify exact origin of the katext files you have added under scripts/data. I would prefer if we can fetch from authentic source in the script itself - I think users would trust that more. |
Thanks for taking the time to review this — really appreciate the feedback. You're right that a quick pass over the current That's actually what pushed me to open this PR. I kept hitting cases like I'll update On bundle size — fair concern. It does grow (~3×), mostly from the generated symbol tables (~567 entries vs ~120 before). The runtime logic is actually a bit cleaner, but the tradeoff is real: you sacrifice bundle size to support a broader set of LaTeX symbols and macros. I don't think the original hand-maintained set is comprehensive enough for anyone trying to use this seriously, and adding symbols one-by-one as use cases come up doesn't scale well — that's why I vendored a relatively complete set from KaTeX and codegen from it. That said, I'm still happy to optimize where it makes sense. There are definitely ways to transform the data to make it minimize better or even use alternative data structures to save space. Happy to discuss trimming the symbol set if you'd prefer a smaller default (though partial coverage tends to bring back the inconsistency I was trying to fix). |
Add concrete newly-supported LaTeX examples to sample.md so DOCX benefits are easy to spot. Replace vendored KaTeX snippets with codegen that fetches KaTeX v0.16.22 from GitHub (symbols.js, macros.js, functions/op.js).
Please check The files under
Source: https://github.com/KaTeX/KaTeX/tree/v0.16.22/src Per your suggestion, I've removed the vendored copies and updated
The version is pinned via a KaTeX is used only as a structured seed for command → Unicode mappings at build time, not as a runtime dependency. Also included in the same commit: the |
|
Follow-up: opened #8 with a small bundle optimization on top of this PR. It merges the three generated lookup tables ( Local
I also added a |
Flatten symbol, alias, and override entries into one generated katexData.ts object literal for a smaller minified bundle (~1.7% gzip savings vs three-table lookup). Add a benchmark script to compare serialization formats.
|
Hi @mayank1513 — following up on your earlier note to merge both branches and rebase on Both Build and tests pass locally ( |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
|
Thanks a lot for the updates. Word is crashing while trying to open the DOCX file generated after building your branch. |
Refactor the LaTeX→OMML mapper so n-ary operators, accents, and scripts emit schema-valid OMML, generate operator tables from KaTeX at codegen time, and add fixture-based tests validated against the Microsoft 365 OOXML schema.
|
Hi @mayank1513 — thanks for flagging the Word crash. I've pushed Root cause: several OMML structures we were emitting were schema-invalid for Word — most notably What changed:
I've manually spot-checked the generated DOCX files in Word as well — they open cleanly now, including the combined fixture document that exercises every supported case in one file. Could you try rebuilding from latest |
Summary
Closes #6.
Replaces the hand-maintained
LATEX_SYMBOLSmap (~120 entries) inlib/src/index.tswith generated symbol data to support substantially more LaTeX commands in DOCX output.KaTeX v0.16.22 source is fetched at codegen time (MIT) as a practical seed — a large, well-structured baseline that is easy to regenerate. The goal is broad LaTeX command coverage, not parity with any particular renderer; other sources and manual overrides can be layered on later.
This PR folds in the symbol-table optimization previously proposed in #8:
katexData.tslookup map (641 merged entries)KATEX_SYMBOLS[name]pnpm generate:katexto regenerate tables from KaTeX sourcelib/scripts/benchmark-bundle-formats.tsto compare serialization formatsBundle size (local
pnpm buildinlib/)Test plan
pnpm typecheckpasses inlib/pnpm buildpasses inlib/pnpm testpasses inlib/\neq,\alpha,\sum,\hat{x})