DOC-13760 produce markdown per page (WIP) by osfameron · Pull Request #863 · couchbase/docs-site

osfameron · 2025-11-27T17:49:20Z

This PR uses Antora's extension mechanism to generate .md files at the time of publishing the site.

We publish them as *.md in the same directory, and link to the file with a <link rel="alternate" ...> for each page.

Current approach

To avoid the memory issues detailed below, this run only creates the .md files and llms.txt

This means the generation will need to be split into 2 parts:

the main job runs as normal, with the UI adding the to each .html page
(At this point, Antora doesn't know anything about the .md files at all)
a job running with this supplied extension will generate the .md files and llms.txt
The built output will get copied and merged in the live S3 bucket.
This means that the .html and .md files will be eventually consistent
There is some additional work neeed to make sure that old .md files are purged when the corresponding .html file is deleted.

Old description

There are a few issues to resolve.

FATAL: memory usage is too high and the Node runtime exits with FATAL ERROR: Ineffective mark-compacts near heap limit Allocation failed - JavaScript heap out of memory
I tried to account for this with NODE_OPTIONS: --max_old_space_size=16384 (doubled from previous 8192, already increased from default 4096) with no impact.
As this creates a file for Every Page stored within memory, this was always a possibility.
MITIGATION 1 (TODO): increase the heap size again
MITIGATION 2 (TODO): try only producing the .md file for the latest versions of components?
MARKUP production: as we are going from Asciidoc -> HTML -> Markdown, this is a little lossy. This POC inspects the HTML to try to rebuild Admonitions (as Github Flavoured markdown Alerts)
We could try using OpenDevise's https://github.com/opendevise/downdoc but the Antora pipeline doesn't currently seem to give the collated Asciidoc markup. e.g. looking at the Generator Events it seems like at contentClassified stage we have Asciidoc source (but without Includes processed) and at the following documentsConverted phase, we get the output HTML, but there's no intermediate step.
We could potentially use Antora Assembler, but that works on a whole site nav, whereas we're looking at producing Markdown for every single file.

Copilot

Pull request overview

This PR implements an Antora extension to generate Markdown files from HTML output for LLM consumption. The extension creates .md files alongside HTML pages and adds alternate link metadata for each page.

Key Changes:

New Antora extension that converts HTML to Markdown using JSDOM and semantic markdown conversion
Configuration updates across preview and staging playbooks to enable the extension
Addition of new npm dependencies for HTML-to-Markdown conversion

Reviewed changes

Copilot reviewed 6 out of 7 changed files in this pull request and generated 4 comments.

Show a summary per file

File	Description
package.json	Adds dependencies for DOM parsing and HTML-to-Markdown conversion
lib/markdown-for-llm.js	Implements the core extension logic for converting pages to Markdown
home/preview/DOC-13760-produce-markdown-per-page.yml	Configures UI bundle override for preview environment
antora-playbook.preview.yml	Registers the markdown extension and adds analytics branch
antora-playbook-staging-chatbot.yml	Registers extension, updates UI bundle, and modifies SDK branches
antora-playbook-staging-chatbot.diff.yml	Adds extension registration and uncomments UI bundle configuration

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Copilot · 2025-11-27T17:50:16Z

lib/markdown-for-llm.js

+function overrideElementProcessing (element) {
+
+  if (element.tagName?.toLowerCase() === 'a' 
+   && element.className === 'anchor' ) {


The condition uses className (string) for comparison, but the admonition check below uses classList.contains(). Use classList.contains('anchor') for consistency and to handle multiple classes correctly.

Suggested change

&& element.className === 'anchor' ) {

&& element.classList?.contains('anchor') ) {

Copilot · 2025-11-27T17:50:16Z

lib/markdown-for-llm.js

+    element.classList.remove('admonitionblock')
+    const admonition = element.className.toUpperCase()


After removing 'admonitionblock' from classList, element.className may contain other classes or be empty. This will produce incorrect admonition types. Extract the admonition class name before removing 'admonitionblock' or use a more specific selector to identify the admonition type.

Suggested change

element.classList.remove('admonitionblock')

const admonition = element.className.toUpperCase()

// Extract admonition type before removing 'admonitionblock'

const admonitionType = Array.from(element.classList).find(cls => cls !== 'admonitionblock') || '';

element.classList.remove('admonitionblock')

const admonition = admonitionType.toUpperCase()

Copilot · 2025-11-27T17:50:17Z

lib/markdown-for-llm.js

+    const path = page.out.path.replace(/\.html$/, '.md')
+
+    // tell docs-ui to output <link rel="alternate" ...> for the markdown page.
+    page.asciidoc.attributes["page-markdown-alt"] = `${page.out.rootPath}/${path}`
+
+    siteCatalog.addFile({
+      contents: Buffer.from(markdown),
+      out: { path }


[nitpick] The variable name path shadows the outer scope if Node's path module is imported elsewhere. Consider renaming to mdPath or markdownPath to avoid potential confusion.

Suggested change

const path = page.out.path.replace(/\.html$/, '.md')

// tell docs-ui to output <link rel="alternate" ...> for the markdown page.

page.asciidoc.attributes["page-markdown-alt"] = `${page.out.rootPath}/${path}`

siteCatalog.addFile({

contents: Buffer.from(markdown),

out: { path }

const mdPath = page.out.path.replace(/\.html$/, '.md')

// tell docs-ui to output <link rel="alternate" ...> for the markdown page.

page.asciidoc.attributes["page-markdown-alt"] = `${page.out.rootPath}/${mdPath}`

siteCatalog.addFile({

contents: Buffer.from(markdown),

out: { path: mdPath }

Copilot · 2025-11-27T17:50:17Z

lib/markdown-for-llm.js

+module.exports.register = function ({ playbook, config }) {
+    const logger = this.getLogger('markdown-for-llm')
+
+    this.on('navigationBuilt', async ({ playbook, siteAsciiDocConfig, siteCatalog, uiCatalog, contentCatalog }) => {


The event handler is marked async but contains no await operations. Remove the async keyword as it's unnecessary and may create unneeded promise overhead when processing all pages.

Suggested change

this.on('navigationBuilt', async ({ playbook, siteAsciiDocConfig, siteCatalog, uiCatalog, contentCatalog }) => {

this.on('navigationBuilt', ({ playbook, siteAsciiDocConfig, siteCatalog, uiCatalog, contentCatalog }) => {

copilot review suggests this may introduce unneeded overhead. As we're getting Node memory errors, then we may as well try!

Reduce memory usage by *only* saving the Markdown version in this run. We can do this as a separate build, and scp the output to the same bucket. e.g. this content would be overlaid on the *standard* build (which would only need to add the <link href="..."> URL feature. Rewrite relative hrefs to .md target.

fix page.pub.url to ensure that the site nav is also updated

Use the one used by https://github.com/cerbos/antora-llm-generator

osfameron added 2 commits November 27, 2025 16:15

DOC-13760 produce markup file per published page

7c1cfa4

DOC-13760 test on staging chatbot

b04dafc

osfameron requested review from a team, Copilot and pavel-blagodov November 27, 2025 17:49

Copilot AI reviewed Nov 27, 2025

View reviewed changes

osfameron added 11 commits November 27, 2025 18:48

DOC-13760 remove unneeded async

9423a4a

copilot review suggests this may introduce unneeded overhead. As we're getting Node memory errors, then we may as well try!

DOC-13760 tidy up

974763a

fix page.pub.url to ensure that the site nav is also updated

DOC-13760 create an initial llms.txt

1d0c518

tidy

74aa07b

tidy

88e984e

Merge branch 'master' into DOC-13760-produce-markdown-per-page

f8e9ca5

reduce memory usage by explicitly freeing DOM

3c3d75b

change markdown parsing dependency

1756c9d

Use the one used by https://github.com/cerbos/antora-llm-generator

remove chunking, fix links

e166068

workaround for mangled noopener links in nav

ebcdf96

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DOC-13760 produce markdown per page (WIP)#863

DOC-13760 produce markdown per page (WIP)#863
osfameron wants to merge 13 commits intomasterfrom
DOC-13760-produce-markdown-per-page

osfameron commented Nov 27, 2025 •

edited

Loading

Uh oh!

Copilot AI left a comment

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Copilot AI Nov 27, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

	&& element.className === 'anchor' ) {
	&& element.classList?.contains('anchor') ) {

		element.classList.remove('admonitionblock')
		const admonition = element.className.toUpperCase()

	this.on('navigationBuilt', async ({ playbook, siteAsciiDocConfig, siteCatalog, uiCatalog, contentCatalog }) => {
	this.on('navigationBuilt', ({ playbook, siteAsciiDocConfig, siteCatalog, uiCatalog, contentCatalog }) => {

Conversation

osfameron commented Nov 27, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Current approach

Old description

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Copilot AI Nov 27, 2025

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

osfameron commented Nov 27, 2025 •

edited

Loading