Feature/distill token optimization by Tomo1912 · Pull Request #3181 · modelcontextprotocol/servers

Tomo1912 · 2026-01-05T14:27:25Z

Description

Add optional distill parameter to the fetch server that aggressively cleans HTML content to minimize token usage. When enabled, removes scripts, styles, navigation, headers, footers, ads, and other non-essential elements before conversion to markdown. Achieves 72.8% average token reduction across real-world tests.

Server Details

Server: fetch
Changes to: tools (added distill parameter to fetch tool)

Motivation and Context

LLM token costs are a significant operational expense. Current web-fetch returns full HTML including navigation menus, ads, scripts, and UI clutter - wasting tokens on non-content elements.

Test Results:

Website	Standard Tokens	Distilled Tokens	Reduction
MCP Docs	2,154	13	99.4%
TechCrunch	506	263	48.0%
Python Docs	662	629	5.0%
Average	1,107	302	72.8%

How Has This Been Tested?

Tested with Claude Desktop as MCP client
Tested against multiple real-world websites (documentation sites, news sites, technical docs)
Verified backward compatibility - existing calls without distill parameter work unchanged

Breaking Changes

None. The distill parameter defaults to False, maintaining full backward compatibility.

Types of changes

Bug fix (non-breaking change which fixes an issue)
New feature (non-breaking change which adds functionality)
Breaking change (fix or feature that would cause existing functionality to change)
Documentation update

Checklist

I have read the MCP Protocol Documentation
My changes follows MCP security best practices
I have updated the server's README accordingly
I have tested this with an LLM client
My code follows the repository's style guidelines
New and existing tests pass locally
I have added appropriate error handling
I have documented all environment variables and configuration options

Additional context

No new dependencies required - uses existing readabilipy for content extraction.

Add distill parameter to aggressively clean HTML before processing: - Remove scripts, styles, navigation, headers, footers - Remove ads, sidebars, popups, cookie banners - Remove social widgets and non-content elements - Normalize whitespace Typical token reduction: 60-85% This is an opt-in feature (distill=false by default) to maintain backward compatibility.

cliffhall · 2026-04-21T14:46:45Z

@claude review

claude · 2026-04-21T14:47:03Z

Claude finished @cliffhall's task in 1m 33s —— View job

PR Review: Feature/distill token optimization

Gather context and read changed files
Review server.py changes
Review README.md changes
Post full review feedback

Summary

The feature idea — reducing token usage from fetched web content — is a reasonable one, but this implementation has several significant technical problems that should be addressed before merging.

Issues

1. Redundant / counterproductive architecture (Major)

The distill_html function is applied before readabilipy.simple_json_from_html_string(html, use_readability=True). The readabilipy library is a Python port of Mozilla's Readability.js, which already does content extraction: stripping navs, footers, ads, sidebars, and returning only article-body HTML. Pre-processing with distill_html may actually impair Readability's heuristics by removing structural signals it relies on to locate the main content.

The 99.4% token reduction on MCP Docs (2,154 → 13 tokens) is a red flag here — 13 tokens is nearly empty output, suggesting core content is being stripped alongside boilerplate.

The distill path is effectively doing: strip HTML with fragile regexes → feed already-degraded HTML to Readability → convert to Markdown. A better design would either post-process Readability's output, or explore whether readabilipy has additional options to tighten its output.

2. ReDoS risk in the ad-pattern regex (Major / Security)

server.py line 63:

r'<[^>]+(class|id)=["\'][^"\']*\b(ad|ads|...)\b[^"\']*["\'][^>]*>[\s\S]*?</[^>]+>'

[\s\S]*?</[^>]+> uses lazy .* which is O(n²) worst-case on backtracking engines when the closing pattern isn't found (e.g., unclosed tags from malformed HTML).
The closing tag </[^>]+> matches any closing tag, not the specific matching opening tag — so <div class="ad">...</span> would incorrectly match and consume content.
On adversarially crafted or simply malformed HTML from the web, this could cause significant slowdowns or incorrect behaviour.

Fix this →

3. Nested-tag regex is inherently broken (Major)

The per-tag regex:

html = re.sub(rf'<{tag}[^>]*>[\s\S]*?</{tag}>', '', html, flags=re.IGNORECASE)

Fails silently on:

Nested tags: <nav><nav>inner</nav>outer</nav> — only the inner </nav> is consumed; outer</nav> remains.
Self-closing or void usage: <nav /> is not matched.
Attributes containing >: rare but valid in older HTML; [^>]* breaks on this.

HTML cannot be reliably parsed with regex. This should use lxml (already available as a transitive dependency of readabilipy) or html.parser from the stdlib.

4. No tests (Major)

The repository already has a test suite for the fetch server (test(fetch): add unit tests — commit 71d5d71). This PR adds no tests for distill_html or the distill parameter integration. Given the complexity of the regex logic, unit tests are especially important here.

5. Undefined `raw=True` + `distill=True` interaction (Minor)

When force_raw=True, extract_content_from_html is never called, so distill silently has no effect. The parameter description and README should document this — or the combination should raise an error/warning.

# fetch_url line 204 — distill is silently ignored when raw=True
if is_page_html and not force_raw:
    return extract_content_from_html(page_raw, distill=distill), ""

6. Performance: multiple regex passes over large HTML (Minor)

Six separate re.sub calls iterate over potentially megabyte-scale HTML strings before handing off to Readability. An HTML-parser-based approach (decompose by tag, decompose by class/id) would be a single parse pass and far more efficient.

README change

The documentation update (README.md line 19) is accurate and well-written. No issues there.

Verdict

Request changes. The feature concept is valid and worth pursuing, but the core implementation should be reworked to:

Use an actual HTML parser (lxml/html.parser/BeautifulSoup) instead of regex for structural stripping
Reconsider whether pre-processing before Readability or post-processing after Readability produces better results
Add unit tests covering distill_html and the distill=True fetch path
Document the raw=True + distill=True interaction

Tomo1912 force-pushed the feature/distill-token-optimization branch 5 times, most recently from 4322b3e to a3cff79 Compare January 8, 2026 21:40

Tomo1912 force-pushed the feature/distill-token-optimization branch from a3cff79 to 98c0dd3 Compare January 8, 2026 22:04

Tomo1912 added 15 commits January 12, 2026 00:24

Merge branch 'main' into feature/distill-token-optimization

9a87473

Merge branch 'main' into feature/distill-token-optimization

7c8076a

Merge branch 'main' into feature/distill-token-optimization

68a9645

Merge branch 'main' into feature/distill-token-optimization

87faea6

Merge branch 'main' into feature/distill-token-optimization

31ba9b4

Merge branch 'main' into feature/distill-token-optimization

3c6bd18

Merge branch 'main' into feature/distill-token-optimization

7d2199e

Merge branch 'main' into feature/distill-token-optimization

497a830

Merge branch 'main' into feature/distill-token-optimization

bde99d6

Merge branch 'main' into feature/distill-token-optimization

c123d01

Merge branch 'main' into feature/distill-token-optimization

e157e8d

Merge branch 'main' into feature/distill-token-optimization

f10db76

Merge branch 'main' into feature/distill-token-optimization

7a45324

Merge branch 'main' into feature/distill-token-optimization

6fff651

Merge branch 'main' into feature/distill-token-optimization

a19956e

cliffhall added enhancement New feature or request server-fetch Reference implementation for the Fetch MCP server - src/fetch labels Apr 20, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Feature/distill token optimization#3181

Feature/distill token optimization#3181
Tomo1912 wants to merge 16 commits intomodelcontextprotocol:mainfrom
Tomo1912:feature/distill-token-optimization

Tomo1912 commented Jan 5, 2026 •

edited

Loading

Uh oh!

cliffhall commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 •

edited

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Tomo1912 commented Jan 5, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Description

Server Details

Motivation and Context

How Has This Been Tested?

Breaking Changes

Types of changes

Checklist

Additional context

Uh oh!

cliffhall commented Apr 21, 2026

Uh oh!

claude Bot commented Apr 21, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

PR Review: Feature/distill token optimization

Summary

Issues

1. Redundant / counterproductive architecture (Major)

2. ReDoS risk in the ad-pattern regex (Major / Security)

3. Nested-tag regex is inherently broken (Major)

4. No tests (Major)

5. Undefined raw=True + distill=True interaction (Minor)

6. Performance: multiple regex passes over large HTML (Minor)

README change

Verdict

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Tomo1912 commented Jan 5, 2026 •

edited

Loading

claude Bot commented Apr 21, 2026 •

edited

Loading

5. Undefined `raw=True` + `distill=True` interaction (Minor)