Scrapfly Crawler API Examples

This directory contains examples demonstrating the Scrapfly Crawler API integration.

Setup

Get Your API Key

Get your API key from https://scrapfly.io/dashboard

Configure Your API Key

You have two options to provide your API key:

Option A: Environment Variable (Recommended)

Export the API key in your terminal:

export SCRAPFLY_API_KEY='scp-live-your-key-here'

Then run any example:

python3 sync_crawl.py

Option B: .env File

Copy the example .env file:

cp .env.example .env

Edit .env and replace the placeholder with your actual API key:

SCRAPFLY_API_KEY=scp-live-your-actual-key-here

Run any example (the .env file will be loaded automatically):

python3 sync_crawl.py

Note: Install python-dotenv for automatic .env file loading: pip install python-dotenv

If you don't install it, the examples will still work with environment variables exported in your shell.

Quick Start

The easiest way to use the Crawler API is with the high-level Crawl object (see quickstart.py):

from scrapfly import ScrapflyClient, CrawlerConfig, Crawl

client = ScrapflyClient(key='your-key')

# Method chaining for concise usage
crawl = Crawl(
    client,
    CrawlerConfig(
        url='https://web-scraping.dev/products',
        page_limit=5
    )
).crawl().wait()

# Get results
pages = crawl.warc().get_pages()
for page in pages:
    print(f"{page['url']} ({page['status_code']})")

Examples

quickstart.py - Simplest example using high-level Crawl API with method chaining
sync_crawl.py - Low-level API example showing start, poll, and download workflow
demo_markdown.py - Build LLM.txt files from crawled documentation with batch content retrieval
webhook_example.py - Handle Crawler API webhooks for real-time event notifications

Crawl Object Features

The Crawl object provides a stateful, high-level interface:

Methods

crawl() - Start the crawler job
wait(poll_interval=5, max_wait=None, verbose=False) - Wait for completion
status(refresh=True) - Get current status
warc(artifact_type='warc') - Download WARC artifact
har() - Download HAR (HTTP Archive) artifact with timing data
read(url, format='html') - Get content for specific URL
read_batch(urls, formats=['html']) - Get content for multiple URLs efficiently (up to 100 per request)
read_iter(pattern, format='html') - Iterate through URLs matching wildcard pattern
stats() - Get comprehensive statistics

Properties

uuid - Crawler job UUID
started - Whether crawler has been started

Usage Patterns

1. Method Chaining (Most Concise)

crawl = Crawl(client, config).crawl().wait()
pages = crawl.warc().get_pages()

2. Step-by-Step (More Control)

crawl = Crawl(client, config)
crawl.crawl()
crawl.wait(verbose=True, max_wait=300)

# Check status
status = crawl.status()
print(f"Crawled {status.urls_crawled} URLs")

# Get results
artifact = crawl.warc()
pages = artifact.get_pages()

3. Read Specific URLs

# Get content for a specific URL
html = crawl.read('https://example.com/page1')
if html:
    print(html.decode('utf-8'))

4. Statistics

stats = crawl.stats()
print(f"URLs discovered: {stats['urls_discovered']}")
print(f"URLs crawled: {stats['urls_crawled']}")
print(f"Crawl rate: {stats['crawl_rate']:.1f}%")
print(f"Total size: {stats['total_size_kb']:.2f} KB")

Configuration Options

The CrawlerConfig class supports all crawler parameters:

config = CrawlerConfig(
    url='https://example.com',
    page_limit=100,
    max_depth=3,
    exclude_paths=['/admin/*', '/api/*'],
    include_paths=['/products/*'],
    content_formats=['html', 'markdown'],
    # ... and many more options
)

See CrawlerConfig class documentation for all available parameters.

Artifact Formats

WARC Format

The crawler returns results in WARC (Web ARChive) format by default, which is automatically parsed:

artifact = crawl.warc()

# Easy way: Get all pages as dictionaries
pages = artifact.get_pages()
for page in pages:
    url = page['url']
    status_code = page['status_code']
    headers = page['headers']
    content = page['content']  # bytes

# Memory-efficient: Iterate one record at a time
for record in artifact.iter_responses():
    print(f"{record.url}: {len(record.content)} bytes")

# Save to file
artifact.save('results.warc.gz')

HAR Format

HAR (HTTP Archive) format includes detailed timing information for performance analysis:

artifact = crawl.har()

# Access timing data
for entry in artifact.iter_responses():
    print(f"{entry.url}")
    print(f"  Status: {entry.status_code}")
    print(f"  Total time: {entry.time}ms")
    print(f"  Content type: {entry.content_type}")

    # Detailed timing breakdown
    timings = entry.timings
    print(f"  DNS: {timings.get('dns', 0)}ms")
    print(f"  Connect: {timings.get('connect', 0)}ms")
    print(f"  Wait: {timings.get('wait', 0)}ms")
    print(f"  Receive: {timings.get('receive', 0)}ms")

# Same easy interface as WARC
pages = artifact.get_pages()

Error Handling

from scrapfly import Crawl, CrawlerConfig

try:
    crawl = Crawl(client, config)
    crawl.crawl().wait(max_wait=300)

    if crawl.status().is_complete:
        pages = crawl.warc().get_pages()
        print(f"Success! Got {len(pages)} pages")
    elif crawl.status().is_failed:
        print("Crawler failed")

except RuntimeError as e:
    print(f"Error: {e}")

Troubleshooting

"SCRAPFLY_API_KEY environment variable not set"

Make sure you've either:

Exported the environment variable: export SCRAPFLY_API_KEY='your-key'
Created a .env file with your API key

"Invalid API key" error

Double-check that:

Your API key is correct and starts with scp-live-
You have an active Scrapfly subscription
You're using the correct API key from your dashboard

Import errors for dotenv

The python-dotenv package is optional. If you see import warnings, you can either:

Install it: pip install python-dotenv
Ignore them - environment variables will still work

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Scrapfly Crawler API Examples

Setup

Get Your API Key

Configure Your API Key

Option A: Environment Variable (Recommended)

Option B: .env File

Quick Start

Examples

Crawl Object Features

Methods

Properties

Usage Patterns

1. Method Chaining (Most Concise)

2. Step-by-Step (More Control)

3. Read Specific URLs

4. Statistics

Configuration Options

Artifact Formats

WARC Format

HAR Format

Error Handling

Troubleshooting

"SCRAPFLY_API_KEY environment variable not set"

"Invalid API key" error

Import errors for dotenv

Learn More

FilesExpand file tree

README.md

Latest commit

History

README.md

File metadata and controls

Scrapfly Crawler API Examples

Setup

Get Your API Key

Configure Your API Key

Option A: Environment Variable (Recommended)

Option B: .env File

Quick Start

Examples

Crawl Object Features

Methods

Properties

Usage Patterns

1. Method Chaining (Most Concise)

2. Step-by-Step (More Control)

3. Read Specific URLs

4. Statistics

Configuration Options

Artifact Formats

WARC Format

HAR Format

Error Handling

Troubleshooting

"SCRAPFLY_API_KEY environment variable not set"

"Invalid API key" error

Import errors for dotenv

Learn More