Skip to content

Feature/distill token optimization#3181

Open
Tomo1912 wants to merge 16 commits intomodelcontextprotocol:mainfrom
Tomo1912:feature/distill-token-optimization
Open

Feature/distill token optimization#3181
Tomo1912 wants to merge 16 commits intomodelcontextprotocol:mainfrom
Tomo1912:feature/distill-token-optimization

Conversation

@Tomo1912
Copy link
Copy Markdown

@Tomo1912 Tomo1912 commented Jan 5, 2026

Description

Add optional distill parameter to the fetch server that aggressively cleans HTML content to minimize token usage. When enabled, removes scripts, styles, navigation, headers, footers, ads, and other non-essential elements before conversion to markdown. Achieves 72.8% average token reduction across real-world tests.

Server Details

  • Server: fetch
  • Changes to: tools (added distill parameter to fetch tool)

Motivation and Context

LLM token costs are a significant operational expense. Current web-fetch returns full HTML including navigation menus, ads, scripts, and UI clutter - wasting tokens on non-content elements.

Test Results:

Website Standard Tokens Distilled Tokens Reduction
MCP Docs 2,154 13 99.4%
TechCrunch 506 263 48.0%
Python Docs 662 629 5.0%
Average 1,107 302 72.8%

How Has This Been Tested?

  • Tested with Claude Desktop as MCP client
  • Tested against multiple real-world websites (documentation sites, news sites, technical docs)
  • Verified backward compatibility - existing calls without distill parameter work unchanged

Breaking Changes

None. The distill parameter defaults to False, maintaining full backward compatibility.

Types of changes

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds functionality)
  • Breaking change (fix or feature that would cause existing functionality to change)
  • Documentation update

Checklist

  • I have read the MCP Protocol Documentation
  • My changes follows MCP security best practices
  • I have updated the server's README accordingly
  • I have tested this with an LLM client
  • My code follows the repository's style guidelines
  • New and existing tests pass locally
  • I have added appropriate error handling
  • I have documented all environment variables and configuration options

Additional context

No new dependencies required - uses existing readabilipy for content extraction.

@Tomo1912 Tomo1912 force-pushed the feature/distill-token-optimization branch 5 times, most recently from 4322b3e to a3cff79 Compare January 8, 2026 21:40
Add distill parameter to aggressively clean HTML before processing:
- Remove scripts, styles, navigation, headers, footers
- Remove ads, sidebars, popups, cookie banners
- Remove social widgets and non-content elements
- Normalize whitespace

Typical token reduction: 60-85%

This is an opt-in feature (distill=false by default) to maintain
backward compatibility.
@Tomo1912 Tomo1912 force-pushed the feature/distill-token-optimization branch from a3cff79 to 98c0dd3 Compare January 8, 2026 22:04
@cliffhall cliffhall added enhancement New feature or request server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026
@cliffhall
Copy link
Copy Markdown
Member

@claude review

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented Apr 21, 2026

Claude finished @cliffhall's task in 1m 33s —— View job


PR Review: Feature/distill token optimization

  • Gather context and read changed files
  • Review server.py changes
  • Review README.md changes
  • Post full review feedback

Summary

The feature idea — reducing token usage from fetched web content — is a reasonable one, but this implementation has several significant technical problems that should be addressed before merging.


Issues

1. Redundant / counterproductive architecture (Major)

The distill_html function is applied before readabilipy.simple_json_from_html_string(html, use_readability=True). The readabilipy library is a Python port of Mozilla's Readability.js, which already does content extraction: stripping navs, footers, ads, sidebars, and returning only article-body HTML. Pre-processing with distill_html may actually impair Readability's heuristics by removing structural signals it relies on to locate the main content.

The 99.4% token reduction on MCP Docs (2,154 → 13 tokens) is a red flag here — 13 tokens is nearly empty output, suggesting core content is being stripped alongside boilerplate.

The distill path is effectively doing: strip HTML with fragile regexes → feed already-degraded HTML to Readability → convert to Markdown. A better design would either post-process Readability's output, or explore whether readabilipy has additional options to tighten its output.

2. ReDoS risk in the ad-pattern regex (Major / Security)

server.py line 63:

r'<[^>]+(class|id)=["\'][^"\']*\b(ad|ads|...)\b[^"\']*["\'][^>]*>[\s\S]*?</[^>]+>'
  • [\s\S]*?</[^>]+> uses lazy .* which is O(n²) worst-case on backtracking engines when the closing pattern isn't found (e.g., unclosed tags from malformed HTML).
  • The closing tag </[^>]+> matches any closing tag, not the specific matching opening tag — so <div class="ad">...</span> would incorrectly match and consume content.
  • On adversarially crafted or simply malformed HTML from the web, this could cause significant slowdowns or incorrect behaviour.

Fix this →

3. Nested-tag regex is inherently broken (Major)

The per-tag regex:

html = re.sub(rf'<{tag}[^>]*>[\s\S]*?</{tag}>', '', html, flags=re.IGNORECASE)

Fails silently on:

  • Nested tags: <nav><nav>inner</nav>outer</nav> — only the inner </nav> is consumed; outer</nav> remains.
  • Self-closing or void usage: <nav /> is not matched.
  • Attributes containing >: rare but valid in older HTML; [^>]* breaks on this.

HTML cannot be reliably parsed with regex. This should use lxml (already available as a transitive dependency of readabilipy) or html.parser from the stdlib.

4. No tests (Major)

The repository already has a test suite for the fetch server (test(fetch): add unit tests — commit 71d5d71). This PR adds no tests for distill_html or the distill parameter integration. Given the complexity of the regex logic, unit tests are especially important here.

5. Undefined raw=True + distill=True interaction (Minor)

When force_raw=True, extract_content_from_html is never called, so distill silently has no effect. The parameter description and README should document this — or the combination should raise an error/warning.

# fetch_url line 204 — distill is silently ignored when raw=True
if is_page_html and not force_raw:
    return extract_content_from_html(page_raw, distill=distill), ""

6. Performance: multiple regex passes over large HTML (Minor)

Six separate re.sub calls iterate over potentially megabyte-scale HTML strings before handing off to Readability. An HTML-parser-based approach (decompose by tag, decompose by class/id) would be a single parse pass and far more efficient.


README change

The documentation update (README.md line 19) is accurate and well-written. No issues there.


Verdict

Request changes. The feature concept is valid and worth pursuing, but the core implementation should be reworked to:

  1. Use an actual HTML parser (lxml/html.parser/BeautifulSoup) instead of regex for structural stripping
  2. Reconsider whether pre-processing before Readability or post-processing after Readability produces better results
  3. Add unit tests covering distill_html and the distill=True fetch path
  4. Document the raw=True + distill=True interaction

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request server-fetch Reference implementation for the Fetch MCP server - src/fetch

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants