Skip to content

Comments

Strip inline data URI images from LLM markdown output#2974

Open
gadenbuie wants to merge 1 commit intor-lib:mainfrom
gadenbuie:feat/strip-base64-images
Open

Strip inline data URI images from LLM markdown output#2974
gadenbuie wants to merge 1 commit intor-lib:mainfrom
gadenbuie:feat/strip-base64-images

Conversation

@gadenbuie
Copy link
Contributor

@gadenbuie gadenbuie commented Feb 20, 2026

Fixes #2973

Adds simplify_inline_images() to the HTML-to-markdown conversion pipeline in convert_md(). This replaces <img> tags with data: URI sources (including base64-encoded images) with text placeholders before pandoc conversion, preventing large encoded strings from polluting the .md output consumed by LLMs.

  • Images with alt text become [Image: <alt>]
  • Images without alt text become [Image]

Replace base64-encoded and other data URI images with text placeholders
(using alt text when available) during HTML-to-markdown conversion. This
prevents large base64 strings from wasting LLM context tokens.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Don't include base64 encoded images in LLM-facing markdown files

1 participant