Feature/distill token optimization#3181
Feature/distill token optimization#3181Tomo1912 wants to merge 16 commits intomodelcontextprotocol:mainfrom
Conversation
4322b3e to
a3cff79
Compare
Add distill parameter to aggressively clean HTML before processing: - Remove scripts, styles, navigation, headers, footers - Remove ads, sidebars, popups, cookie banners - Remove social widgets and non-content elements - Normalize whitespace Typical token reduction: 60-85% This is an opt-in feature (distill=false by default) to maintain backward compatibility.
a3cff79 to
98c0dd3
Compare
|
@claude review |
|
Claude finished @cliffhall's task in 1m 33s —— View job PR Review: Feature/distill token optimization
SummaryThe feature idea — reducing token usage from fetched web content — is a reasonable one, but this implementation has several significant technical problems that should be addressed before merging. Issues1. Redundant / counterproductive architecture (Major)The The 99.4% token reduction on MCP Docs (2,154 → 13 tokens) is a red flag here — 13 tokens is nearly empty output, suggesting core content is being stripped alongside boilerplate. The 2. ReDoS risk in the ad-pattern regex (Major / Security)r'<[^>]+(class|id)=["\'][^"\']*\b(ad|ads|...)\b[^"\']*["\'][^>]*>[\s\S]*?</[^>]+>'
3. Nested-tag regex is inherently broken (Major)The per-tag regex: html = re.sub(rf'<{tag}[^>]*>[\s\S]*?</{tag}>', '', html, flags=re.IGNORECASE)Fails silently on:
HTML cannot be reliably parsed with regex. This should use 4. No tests (Major)The repository already has a test suite for the fetch server ( 5. Undefined
|
Description
Add optional
distillparameter to the fetch server that aggressively cleans HTML content to minimize token usage. When enabled, removes scripts, styles, navigation, headers, footers, ads, and other non-essential elements before conversion to markdown. Achieves 72.8% average token reduction across real-world tests.Server Details
distillparameter tofetchtool)Motivation and Context
LLM token costs are a significant operational expense. Current
web-fetchreturns full HTML including navigation menus, ads, scripts, and UI clutter - wasting tokens on non-content elements.Test Results:
How Has This Been Tested?
distillparameter work unchangedBreaking Changes
None. The
distillparameter defaults toFalse, maintaining full backward compatibility.Types of changes
Checklist
Additional context
No new dependencies required - uses existing
readabilipyfor content extraction.