Skip to content

feat(llms): replace MDX source stripping with HTML -> Markdown pipeline for .md generation#339

Open
viktorkombov wants to merge 17 commits into
vnextfrom
vkombov/convert-api-links-to-md-in-llms-md
Open

feat(llms): replace MDX source stripping with HTML -> Markdown pipeline for .md generation#339
viktorkombov wants to merge 17 commits into
vnextfrom
vkombov/convert-api-links-to-md-in-llms-md

Conversation

@viktorkombov

@viktorkombov viktorkombov commented Jun 18, 2026

Copy link
Copy Markdown
Contributor

Closes #338

Replaces the previous MDX-stripping approach (stripMdxForLlms) with a full
HTML→Markdown conversion pipeline that operates on the already-rendered Astro
output. Every .md file is now generated from the final rendered HTML — with
all JSX components resolved, ApiLinks pointing to real URLs, platform-specific
content inlined, and Shiki syntax highlighting stripped to raw code.

Why HTML→Markdown

The previous approach worked on MDX source files and required reverse-engineering
what each component renders to. Each new component or edge case (code fences,
DocsAside callouts, inline <br>, <div> wrappers) meant another fragile regex.
Working on the rendered HTML eliminates that class of problem entirely: the
browser-facing content is the source of truth.

Changes

New: src/html-to-md.ts

  • buildHtmlToMdConverter() — configured Turndown + GFM tables, created once
    per build and shared across all pages for performance
  • htmlPageToMd(htmlPath, siteUrl, td, sourceRef) — converts a single built
    HTML file to Markdown; returns '' when the file is missing or has no content
  • Shiki code blocks extracted via raw-HTML regex before JSDOM parses, so
    token-span soup never reaches Turndown and code content is never mutated
  • DocsAside callouts → GFM blockquotes (> **Info:**)
  • Sample iframes → [title](url) links (fixes empty "And the result is:" seams)
  • Typography normalization via TYPOGRAPHIC_MAP — converts curly quotes, dashes,
    NBSP, narrow no-break space (\u202F), ®, etc. to ASCII without
    romanizing CJK characters (preserving Japanese/Korean builds)
  • <meta charset> stripped before JSDOM to prevent double-encoding mojibake
  • Internal doc links absolutized with .md extension for LLM-to-LLM navigation
  • sourceRef parameter: warnings point at the source .mdx file, not the
    built .html artifact

Updated: src/integration.ts

  • generateLlmsMdFiles() extracted with typed GenerateLlmsMdOptions interface
  • withUtf8Bom() utility — idempotent BOM prepend, enabled for non-English
    builds (navLang !== 'en') so static preview servers render JP/KR correctly
  • Per-page and build-level selector guards: warns or fails with exitCode 1
    when Shiki selectors stop matching or >10% of pages produce empty output
  • Progress logging every 50 pages with elapsed time and fence-count summary
  • llms-small.txt fence regex uses backreference to handle nested Markdown correctly

Acceptance checklist

  • Code blocks present and content-correct in generated .md files
  • No mojibake in any language build (EN, JP, KR)
  • DocsAside callouts render as > **Info:** / > **Warning:** blockquotes
  • Sample seams resolved: [title](demo-url) links where iframes were
  • API References section populated with resolved links
  • Internal links absolutized with .md suffix
  • Angular, React, WebComponents, Blazor (EN + JP) builds pass selector guard

@viktorkombov viktorkombov added ❌ status: awaiting-test PRs awaiting manual verification 🛠️ status: in-development Issues and PRs with active development on them and removed ❌ status: awaiting-test PRs awaiting manual verification labels Jun 18, 2026
@viktorkombov viktorkombov marked this pull request as draft June 18, 2026 13:47
viktorkombov and others added 3 commits June 19, 2026 14:02
- Introduced a new module `html-to-md.ts` for converting built Astro HTML pages to Markdown.
- Integrated `turndown` and `turndown-plugin-gfm` for Markdown formatting.
- Updated `package.json` to include new dependencies for Markdown conversion.
- Modified `integration.ts` to utilize the new HTML to Markdown conversion, replacing the previous MDX stripping logic.
- Enhanced logging for Markdown generation process and added selector guard for code blocks.
Comment thread src/html-to-md.ts Fixed
Comment thread src/html-to-md.ts Fixed
- generate Markdown from rendered HTML
- create full, abridged, and topic-specific LLM bundles
- improve Unicode, code fence, link, and metadata handling
- add conversion diagnostics and update related documentation
@viktorkombov viktorkombov added ❌ status: awaiting-test PRs awaiting manual verification and removed 🛠️ status: in-development Issues and PRs with active development on them labels Jun 22, 2026
@viktorkombov viktorkombov marked this pull request as ready for review June 22, 2026 16:39
@viktorkombov viktorkombov changed the title feat(llms): convert ApiLink components to markdown links in generated .md files feat(llms): replace MDX source stripping with HTML -> Markdown pipeline for .md generation Jun 23, 2026
Comment thread .github/workflows/lint.yml Outdated
Comment thread docs/angular/src/content/en/components/maps/map-api.mdx Outdated
viktorkombov and others added 5 commits June 23, 2026 19:07
Description fixes in MDX source files:
- Replace dash-prefixed API-list and changelog-fragment llms.description
  values with proper prose sentences (6 files: map-api EN/xplat, slider-ticks
  EN/JP, general-changelog-dv-react/wc JP)
- Remove stray trailing `{` from 14 JP llms.description fields (partial
  copy-paste of a template token)
- Fix AI toolchain description in xplat JP to use {ProductName} instead of
  hardcoded platform list (was incorrect for Blazor JP)
- Fix toc.json typo: チャートのテータ注釈 → チャートのデータ注釈

llms.ts / buildLlmsTxt localization:
- Add JP navigation-bucket labels to IGDOCS_BROAD_SECTIONS so 概要 and
  other JP toc headers are stripped from label prefixes (fixes "概要 ... Overview"
  double-label in angular-jp llms.txt)
- Add LLMS_TXT_STRINGS map with JP translations for the Documentation sets
  section heading and the two built-in doc-set links
- Add navLang and localizedDescription parameters to buildLlmsTxt so JP
  builds emit localized header blockquote and section labels

integration.ts / astro configs:
- Thread localizedDescription through CreateDocsSiteOptions → createDocsSite
  → siteMetaIntegration → buildLlmsTxt
- Supply JP localizedDescription in docs/angular/astro.config.ts and
  docs/xplat/astro.config.ts (per-platform JP descriptions)
- Make diagnosticSourceCandidates async (fsp.access instead of fs.existsSync)
  and replace hardcoded grid-type regex with generic segment-based lookup
…tion references (#354)

* Fix malformed tags, platform inconsistencies, and incorrect documentation references

* revert not needed change in igniteui licensing

* Revert not needed changes

* Additional fixed code snippets

* Resolve comments from code review
…o vkombov/convert-api-links-to-md-in-llms-md
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

❌ status: awaiting-test PRs awaiting manual verification

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Improve generated .md documentation for LLMs across all platforms

3 participants