Skip to content

feat(web): markdown content negotiation via Accept header#21

Merged
jhb-dev merged 14 commits intomainfrom
claude/review-open-prs-Bc4WA
Apr 5, 2026
Merged

feat(web): markdown content negotiation via Accept header#21
jhb-dev merged 14 commits intomainfrom
claude/review-open-prs-Bc4WA

Conversation

@jhb-dev
Copy link
Copy Markdown
Contributor

@jhb-dev jhb-dev commented Apr 4, 2026

Summary

Adds markdown content negotiation — when a client sends Accept: text/markdown, the server returns a Markdown version of the page instead of HTML.

How it works

  1. web/src/integrations/htmlToMarkdown.ts — Astro build integration that hooks into astro:build:done. After all static HTML files are emitted, it extracts <main> content, converts it to Markdown via turndown, prepends YAML frontmatter (title + description), and writes a .md file alongside each .html file.

  2. web/middleware.ts — Native Vercel middleware using @vercel/functions rewrite(). When Accept: text/markdown is present, rewrites the request path to the .md file. Excludes /api, /preview, /_astro, /assets, /favicon, and paths with file extensions.

Usage

# HTML (default)
curl https://jhb.software/de/articles/my-article

# Markdown
curl -H "Accept: text/markdown" https://jhb.software/de/articles/my-article

Test plan

  • pnpm build in web/ succeeds and .md files are generated alongside .html
  • curl -H "Accept: text/markdown" <url> returns markdown
  • Normal browser requests still return HTML
  • /preview, /api, and static assets are excluded from rewriting
  • pnpm lint and pnpm check pass in web/

Closes #15, supersedes #16

https://claude.ai/code/session_01CwURAMew6D9fXyr4kA1w2Q

Copilot AI and others added 6 commits March 15, 2026 11:15
- Add `lexicalToMarkdown` utility that converts Payload's Lexical
  rich-text node tree to a Markdown string (headings, paragraphs,
  lists, blockquotes, links, images, code blocks, inline formatting)
- Add `[lang]/[...path].md.ts` Astro static endpoint that
  pre-generates a `.md` file with YAML front-matter for every article
  in every locale at build time
- Update `web/vercel.json` to prepend a `has`-based rewrite rule that
  transparently serves the pre-generated `.md` file to any request
  whose `Accept` header contains `text/markdown`"

Co-authored-by: jhb-dev <20878653+jhb-dev@users.noreply.github.com>
- Remove articles-only filter: getStaticPaths now generates .md files
  for every page collection (articles, projects, authors, pages)
- Add per-collection frontmatter and richtext extraction:
  - projects: body richtext + customer/tags metadata
  - authors: description richtext + name/profession metadata
  - pages: extract all RichTextBlocks from sections[] with section
    titles (##) and subtitles (###) as Markdown headings
- Add richTextToMarkdown() helper to reduce duplication
- Guard project.tags with null-safety and use empty-string join for
  contentParts to avoid collapsing existing double newlines

Co-authored-by: jhb-dev <20878653+jhb-dev@users.noreply.github.com>
… conversion

- Remove `[lang]/[...path].md.ts` Astro endpoint and `lexicalToMarkdown.ts`
  utility (Lexical-specific approach, limited to certain CMS fields)
- Remove has-based markdown rewrite from vercel.json

- Add `scripts/html-to-markdown.mjs`: post-build script that scans
  `.vercel/output/static/` for all HTML pages, extracts `<main>` content,
  converts it to Markdown via turndown (removing nav/aside/header/footer),
  prepends YAML frontmatter (title + description from <head>) and saves as
  .md alongside the .html — generic, works for all current and future pages

- Add `src/middleware.ts`: Astro Edge Middleware (edgeMiddleware: true was
  already set) that rewrites requests carrying `Accept: text/markdown` to
  `/path.md`; skips known static asset extensions to avoid false rewrites

- Add turndown@7.2.2 and node-html-parser@7.1.0 to devDependencies
- Update build script: `astro build && node scripts/html-to-markdown.mjs`

Co-authored-by: jhb-dev <20878653+jhb-dev@users.noreply.github.com>
…ation

- Add `web/src/integrations/htmlToMarkdown.ts`: TypeScript Astro integration
  using the `astro:build:done` hook; uses async fs/promises APIs throughout;
  uses AstroIntegrationLogger for diagnostics
- Register `htmlToMarkdown()` in `web/astro.config.mjs` under `integrations`
- Remove `web/scripts/html-to-markdown.mjs` (replaced by integration)
- Revert `web/package.json` build script to just `astro build`
- Add `@types/turndown@^5.0.6` as devDependency; restore removed `dotenv`

Co-authored-by: jhb-dev <20878653+jhb-dev@users.noreply.github.com>
- Replace Astro edge middleware with native Vercel middleware.ts using
  @vercel/functions rewrite, avoiding upstream bug (withastro/astro#16156)
  where Astro's edge middleware drops HTTP method and body
- Lazy-load TurndownService and node-html-parser inside the build hook
  to avoid unnecessary work during dev server startup
- Improve YAML frontmatter escaping to handle newlines, carriage returns,
  and tabs in addition to backslashes and quotes
- Restore accidentally removed eslint devDependency
- Add @vercel/functions as explicit dependency
- Remove edgeMiddleware: true from Astro vercel adapter config

https://claude.ai/code/session_01CwURAMew6D9fXyr4kA1w2Q
@cursor
Copy link
Copy Markdown

cursor bot commented Apr 4, 2026

You have used all of your free Bugbot PR reviews.

To receive reviews on all of your PRs, visit the Cursor dashboard to activate Pro and start your 14-day free trial.

@vercel
Copy link
Copy Markdown

vercel bot commented Apr 4, 2026

The latest updates on your projects. Learn more about Vercel for GitHub.

Project Deployment Actions Updated (UTC)
website Ready Ready Preview, Comment Apr 5, 2026 5:40pm
website-cms Ready Ready Preview, Comment Apr 5, 2026 5:40pm

Request Review

@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

PR Review

Potential Bugs

Root path / not handled by middleware

The matcher regex /((?!api|preview|_astro|assets|favicon)(?!.*\\.\\w+$).+) uses .+ which requires at least one character after the leading /. This means a request to the root path / with Accept: text/markdown won't be matched — the capture group is empty, so .+ fails and middleware is never invoked. The homepage would always return HTML regardless of the Accept header.

Fix: change .+ to .* or add a separate matcher entry for /.


Performance Considerations

console.log / console.warn in the hot path (web/middleware.ts, lines 26–32)

These run on every matched request in production. The console.warn is a defensive guard that should never fire in practice (it only triggers if Vercel's matcher misfires), and the console.log logs every markdown rewrite. Consider removing them or gating behind a debug flag (e.g., an env var) to avoid unnecessary I/O on each request.


Minor / Style

  • Unused logic for method: The method variable (line 21) is extracted only to be interpolated into the console.log. If the log is removed, the variable can go with it.
  • User-controlled header in logs: accept.slice(0, 80) is logged — truncation is good, but if this ends up in a structured logging system, even truncated user-controlled values can cause noise. Not a real security issue here, just worth being aware of.

Everything else looks good

@jhb-dev jhb-dev changed the base branch from copilot/add-markdown-negotiation to main April 4, 2026 10:57
@jhb-dev jhb-dev changed the title refactor: use Vercel middleware for markdown content negotiation feat: markdown content negotiation via Accept header Apr 4, 2026
@jhb-dev
Copy link
Copy Markdown
Contributor Author

jhb-dev commented Apr 4, 2026

No root is not a problem since there is no root (only localized).

@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

Review

Potential Bug: URL path mismatch between middleware and integration

The middleware rewrites /path to /path.md, but Astro's static build typically outputs path/index.html for clean URLs — meaning the integration creates path/index.md, not path.md. The rewrite would 404 for most pages.

This only works if Astro is configured with trailingSlash: 'never' (which outputs flat path.html files). The current astro.config.mjs diff doesn't show this setting. Verify the output structure — does the build produce dist/de/articles/my-article.html or dist/de/articles/my-article/index.html?

If output is index.html-based, the middleware rewrite target needs to be path + '/index.md' (or handle both cases).

Missing Content-Type response header

The middleware rewrites to the static .md file but doesn't set Content-Type: text/markdown. Vercel will serve .md files as text/plain by default, so the response won't advertise the correct media type even though the client explicitly requested text/markdown.

Consider adding a vercel.json headers rule to set Content-Type: text/markdown; charset=utf-8 for all /**/*.md paths.

.md files are publicly accessible without the Accept header

The generated .md files are static artifacts, so any client can fetch /about.md directly without sending Accept: text/markdown. May be intentional — worth documenting either way.

console.log on every markdown request

The log in middleware.ts fires on every markdown request in production. Consider gating behind a DEBUG env var or removing before merge.

Minor: unused method variable

const method = request.method is only referenced in the debug log. If the log is removed, it becomes dead code.


Overall: Clean approach — the integration/middleware split is a solid workaround for the Astro edge middleware bug (withastro/astro#16156). Main concern is path resolution: verify the dist/ output structure to confirm path.md is what Vercel resolves when serving /path.

claude added 2 commits April 4, 2026 11:27
- Add <link rel="alternate" type="text/markdown"> to every page's <head>
  so crawlers/agents can discover the markdown version
- Add /llms.txt endpoint listing all pages per locale with markdown
  negotiation instructions, following the llmstxt.org convention
- Reference llms.txt in robots.txt for discoverability

https://claude.ai/code/session_01CwURAMew6D9fXyr4kA1w2Q
- Add Content-Type: text/markdown header for .md files in vercel.json
- Remove console.log/warn from middleware (production hot path)
- Remove unused method variable

https://claude.ai/code/session_01CwURAMew6D9fXyr4kA1w2Q
@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

PR Review: Markdown Content Negotiation

Overall this is a clean, well-structured implementation. A few issues worth addressing:

Bugs / Correctness

1. Redundant exclusion logic in middleware
The config.matcher already excludes paths with file extensions ((?!.*\\.\\w+$)), making the hasFileExtension guard inside the function body redundant. The early return on hasFileExtension can never be reached for requests that pass the matcher. Consider removing one or aligning them.

2. <link rel="alternate"> emitted unconditionally
SEOMetadata.astro always emits <link rel="alternate" type="text/markdown" href="{canonicalUrl}.md" />, but the integration skips pages without a <main> element (no .md file is written for them). This means the alternate link advertises a URL that will 404 for those pages.

3. YAML frontmatter escaping is incomplete
escapeYaml handles backslash, double-quote, and a few control characters. Since values are always double-quoted in the output this is mostly safe, but using the yaml package (already in the lockfile) to serialize frontmatter would be more robust and eliminate the custom escaping entirely.

4. llms.txt uses raw paths as link labels
The sitemap entries use the raw path as the link label (e.g. - [/de/articles/my-article](https://...)). Human-readable titles would be more useful for LLM consumers. If getSitemap does not expose titles, using the last path segment is at minimum better than the full path.

Performance

5. Sequential file I/O in the build integration
The for loop over htmlFiles awaits each readFile/writeFile in sequence. Replacing with Promise.all(htmlFiles.map(...)) (or a concurrency-limited variant) would parallelize the work significantly for large sites.

6. Two HTML parsers in the build toolchain
node-html-parser handles <main>, <title>, and <meta> extraction, while turndown brings its own DOM implementation (@mixmark-io/domino). Both are build-only dependencies so runtime impact is zero, but worth noting.

Security / Edge Cases

7. No file-size guard in the build integration
readFile loads each HTML file fully into memory. A simple size check before processing would make this more defensive against unexpectedly large generated pages.

8. noindex pages get a public .md counterpart
Pages with noindex still get a .md file written and served. The alternate link also points to it. This is probably acceptable for a static site, but worth being aware of if any pages are intentionally kept low-profile.


The middleware design rationale (native Vercel vs Astro edge middleware) is well-documented and the upstream bug reference is a helpful trail for future debugging.

Prevent search engines from indexing the markdown versions of pages
and the llms.txt endpoint via X-Robots-Tag: noindex header.

https://claude.ai/code/session_01CwURAMew6D9fXyr4kA1w2Q
@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

Review

Overall this is a well-structured feature. A few issues worth addressing:

Bugs

Trailing slash → wrong .md path (middleware.ts:25)

Paths like /de/articles/my-article/ (which Astro commonly emits) would be rewritten to /de/articles/my-article/.md instead of /de/articles/my-article.md. The rewrite should strip a trailing slash first:

```ts
const normalizedPath = path.endsWith('/') && path !== '/' ? path.slice(0, -1) : path
return rewrite(new URL(normalizedPath + '.md', request.url))
```

Root path / not matched (middleware.ts:30)

The matcher pattern /.+ requires at least one character after the leading /, so a request to / with Accept: text/markdown is never rewritten. Probably not an issue if the homepage always has a locale prefix, but worth confirming.

link rel="alternate" for pages without a .md file (SEOMetadata.astro:26)

The <link rel="alternate" type="text/markdown" href="..."> is added to every page, but htmlToMarkdown.ts skips pages that have no <main> element. Clients following the alternate link would get a 404.

Performance

Sequential file I/O in the build hook (htmlToMarkdown.ts:74–92)

The for...of loop with await readFile / await writeFile processes files one at a time. On a large site this can noticeably slow the build. Consider batching with Promise.all:

```ts
await Promise.all(htmlFiles.map(async (htmlPath) => { … }))
```

Nits

  • The hasFileExtension guard on line 22 of middleware.ts is redundant — the config.matcher regex already excludes paths with file extensions ((?!.*\\.\\w+$)). Removing it simplifies the condition to just if (!wantsMarkdown) return.
  • turndown and node-html-parser are build-only dependencies but are listed under devDependencies in package.json — looks correct. Just confirming intentional.

Dynamic import() fails in astro:build:done because Vite's module
runner is already closed by that point. Move turndown and
node-html-parser back to top-level imports.

https://claude.ai/code/session_01CwURAMew6D9fXyr4kA1w2Q
The Vercel adapter copies static files to .vercel/output/static/
before astro:build:done fires, so .md files written only to the
Astro output dir are not included in the deployment. Now also
processes the Vercel static output directory if it exists.

https://claude.ai/code/session_01CwURAMew6D9fXyr4kA1w2Q
@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

PR Review

Overall: Clean implementation. A few items worth addressing before merge.

Bugs / Issues

1. Fragile YAML escaping (htmlToMarkdown.ts)
The custom escapeYaml function does not handle all YAML edge cases - strings starting with colon, bracket, or brace, or containing bare hash can still produce invalid frontmatter. Using JSON.stringify() is safer since JSON strings are valid YAML scalars.

2. Possible invalid rel=alternate href (SEOMetadata.astro)
If canonicalUrl ends with a trailing slash (e.g. https://example.com/de/articles/foo/), the alternate href becomes https://example.com/de/articles/foo/.md - a broken URL. Strip the trailing slash before appending with canonicalUrl.replace(//$/, '') + '.md'.

Performance

Sequential file I/O in the integration
The for...of with await serialises all HTML reads and writes. For a large site, concurrent processing would be faster using Promise.all with htmlFiles.map.

Security / Visibility

Static .md files are publicly accessible by direct URL
The files are emitted as static assets, so /de/articles/foo.md is reachable without sending Accept: text/markdown. The X-Robots-Tag: noindex header handles crawlers, but these files are fully public. Probably an acceptable trade-off, but worth calling out explicitly.

Accept header matching is a substring check
accept.includes('text/markdown') would match a hypothetical text/markdown-extra. A stricter regex check would be more correct for proper content negotiation.

Minor

  • The LLMs-Txt: directive in robots.txt is non-standard; current robots.txt parsers will silently ignore it. The comment pointing to llmstxt.org is helpful context, but this has no effect on compliant crawlers today.
  • The hasFileExtension check inside middleware.ts is redundant given the matcher regex already excludes such paths - harmless, but could be removed for clarity.

@claude
Copy link
Copy Markdown

claude bot commented Apr 4, 2026

Code Review

Overall this is well-structured. A few things worth addressing:

Potential Bugs

Inconsistent logging in integration (web/src/integrations/htmlToMarkdown.ts): The integration mixes console.log/console.warn with the Astro logger object. The logger is already available in the hook — use it throughout for consistent output (e.g. logger.warn instead of console.warn, logger.info instead of console.log).

Redundant file extension check in middleware (web/middleware.ts): The hasFileExtension runtime check at line 21 is already covered by the config.matcher regex ((?!.*\\.\\w+$)). One of them is unreachable — simplify by removing the runtime check.

Performance

Sequential file I/O (htmlToMarkdown.ts lines 55–74): The for loop processes files one at a time. For larger sites this will be slow. Consider batching with Promise.all (or a concurrency-limited variant):

await Promise.all(htmlFiles.map(async (htmlPath) => { ... }))

Double processing: The integration runs convertHtmlToMarkdown on both astroDir and vercelStaticDir. For large sites this doubles the conversion work. It would be worth logging a warning if the Vercel directory is detected so it's clear this is expected behaviour (it's a bit surprising at first read).

Security

No significant concerns. The YAML frontmatter escaping is adequate for double-quoted values. X-Robots-Tag: noindex on .md files is a good call.

Minor

  • turndown in devDependencies and @vercel/functions in dependencies — correct, since turndown is build-time only.
  • The link rel="alternate" in SEOMetadata.astro works correctly given trailingSlash: 'never' in the Astro config.

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

placeholder comment test

@claude
Copy link
Copy Markdown

claude bot commented Apr 5, 2026

PR Review: feat: markdown content negotiation via Accept header

Overall this is a clean implementation. A few issues worth addressing:

Bugs

1. Root index.html generates an out-of-directory .md file (htmlToMarkdown.ts:75-84)

The index.html to parent-dir .md logic breaks for the site root. For dist/index.html:

  • parentDir = dist/
  • parentParentDir = parent of dist
  • mdPath = ../dist.md -- written outside the static output dir

The middleware would rewrite / to /.md, but no .md file is ever created at that path. This results in a 404 for root-path markdown requests.

2. Pages without <main> silently 404 on markdown requests (htmlToMarkdown.ts:56-59)

When no <main> element is found, the page is skipped and no .md file is written. But the middleware still rewrites Accept: text/markdown requests for those URLs to .md, resulting in a user-facing 404. Consider generating a minimal fallback markdown file or adding these paths to the middleware exclusion list.

3. Query parameters preserved in rewrite (middleware.ts:26)

rewrite(new URL(path + '.md', request.url)) preserves query params from the original request URL (e.g. /page?foo=bar to /page.md?foo=bar). Static file serving ignores them, but it could confuse caching layers. Consider: rewrite(new URL(path + '.md', url.origin)).

Performance

4. Sequential HTML conversion (htmlToMarkdown.ts:60)

The main loop processes files one at a time with await inside for...of. For sites with many pages this slows the build unnecessarily. await Promise.all(htmlFiles.map(...)) would parallelize I/O and conversion.

5. Double conversion for Vercel output

The integration converts all HTML twice when the Vercel output dir exists. Converting once and copying the .md files would be more efficient.

Minor

  • Redundant hasFileExtension guard (middleware.ts:22): The config.matcher regex already excludes paths with file extensions ((?!.*\\.\\w+$)), making the in-function check dead code.
  • llms.txt link text (llms.txt.ts:31): Uses entry.path as the display text rather than page titles -- LLMs benefit more from descriptive titles per the llmstxt.org spec.

@jhb-dev jhb-dev changed the title feat: markdown content negotiation via Accept header feat(web): markdown content negotiation via Accept header Apr 5, 2026
@jhb-dev jhb-dev merged commit 095256d into main Apr 5, 2026
7 of 8 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

markdown negotiation

3 participants