Skip to content

seszele64/scrapling-mcp

Repository files navigation

Scrapling MCP Server

Python Version License Project Status

A Model Context Protocol (MCP) server that provides web scraping capabilities with an integrated stealth-aware scraping engine. Built on top of FastMCP and leveraging the Scrapling library.

Overview

The Scrapling MCP Server enables AI agents to:

  • Fetch web content reliably from websites with varying levels of anti-bot protection
  • Render JavaScript when necessary to access dynamically loaded content
  • Bypass common anti-bot measures through configurable stealth settings
  • Handle session-based scraping for websites requiring authentication
  • Extract structured data using CSS selectors from scraped pages

Key Features

Feature Description
JavaScript Rendering Full browser-based rendering for dynamic content
Stealth Modes Multiple pre-configured stealth levels (Minimal, Standard, Maximum)
Cloudflare Support Automatic Cloudflare challenge detection and solving
Session Management Persistent sessions for stateful scraping
Proxy Rotation Support for proxy lists with automatic rotation
Retry Logic Exponential backoff with configurable retry attempts
CSS Extraction Structured data extraction using CSS selectors
URL Validation Built-in SSRF protection and security checks
MCP Integration Native MCP protocol support for AI agent integration
Spider Framework Scrapy-like API with async callbacks and concurrent crawling
Camoufox Integration Modified Firefox browser with stealth patches

Installation

# Clone the repository
git clone https://github.com/seszele64/scrapling-mcp.git
cd mcp-scraper

# Install dependencies
pip install -e .

# Or install with dev dependencies
pip install -e ".[dev]"

Quick Start

from mcp_scraper.stealth import scrape_with_retry, format_response, get_standard_stealth

async def scrape_example():
    # Use standard stealth settings
    config = get_standard_stealth()
    
    # Scrape a URL
    page = await scrape_with_retry(
        url="https://example.com",
        config=config,
        max_retries=3
    )
    
    # Format the response
    result = format_response(page, "https://example.com")
    print(f"Title: {result.get('title')}")
    print(f"Content: {result.get('text')[:200]}...")

Available MCP Tools

Tool Description
scrape_simple Fast HTTP scraping with TLS fingerprinting
scrape_stealth Browser automation with configurable stealth
scrape_session Session-based scraping with persistent state
extract_structured Extract structured data using CSS selectors
scrape_batch Process multiple URLs in sequence

For detailed documentation on all tools and their parameters, see AGENTS.md.

Documentation

Configuration

Create a .env file based on .env.example:

# Proxy URL for requests (optional)
PROXY_URL=

# Default timeout in seconds (1-300)
DEFAULT_TIMEOUT=30

# Logging level
LOG_LEVEL=INFO

# Maximum retry attempts (0-10)
MAX_RETRIES=3

Stealth Levels

Level Use Case Timeout
Minimal Fast, simple sites 15s
Standard Most scraping tasks 30s
Maximum Protected sites, Cloudflare 60s

License

MIT License - see LICENSE for details.

About

No description or website provided.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors