This directory contains examples demonstrating the Scrapfly Crawler API integration.
Get your API key from https://scrapfly.io/dashboard
You have two options to provide your API key:
Export the API key in your terminal:
export SCRAPFLY_API_KEY='scp-live-your-key-here'Then run any example:
python3 sync_crawl.py- Copy the example .env file:
cp .env.example .env- Edit
.envand replace the placeholder with your actual API key:
SCRAPFLY_API_KEY=scp-live-your-actual-key-here
- Run any example (the .env file will be loaded automatically):
python3 sync_crawl.pyNote: Install
python-dotenvfor automatic .env file loading:pip install python-dotenvIf you don't install it, the examples will still work with environment variables exported in your shell.
The easiest way to use the Crawler API is with the high-level Crawl object (see quickstart.py):
from scrapfly import ScrapflyClient, CrawlerConfig, Crawl
client = ScrapflyClient(key='your-key')
# Method chaining for concise usage
crawl = Crawl(
client,
CrawlerConfig(
url='https://web-scraping.dev/products',
page_limit=5
)
).crawl().wait()
# Get results
pages = crawl.warc().get_pages()
for page in pages:
print(f"{page['url']} ({page['status_code']})")- quickstart.py - Simplest example using high-level
CrawlAPI with method chaining - sync_crawl.py - Low-level API example showing start, poll, and download workflow
- demo_markdown.py - Build LLM.txt files from crawled documentation with batch content retrieval
- webhook_example.py - Handle Crawler API webhooks for real-time event notifications
The Crawl object provides a stateful, high-level interface:
crawl()- Start the crawler jobwait(poll_interval=5, max_wait=None, verbose=False)- Wait for completionstatus(refresh=True)- Get current statuswarc(artifact_type='warc')- Download WARC artifacthar()- Download HAR (HTTP Archive) artifact with timing dataread(url, format='html')- Get content for specific URLread_batch(urls, formats=['html'])- Get content for multiple URLs efficiently (up to 100 per request)read_iter(pattern, format='html')- Iterate through URLs matching wildcard patternstats()- Get comprehensive statistics
uuid- Crawler job UUIDstarted- Whether crawler has been started
crawl = Crawl(client, config).crawl().wait()
pages = crawl.warc().get_pages()crawl = Crawl(client, config)
crawl.crawl()
crawl.wait(verbose=True, max_wait=300)
# Check status
status = crawl.status()
print(f"Crawled {status.urls_crawled} URLs")
# Get results
artifact = crawl.warc()
pages = artifact.get_pages()# Get content for a specific URL
html = crawl.read('https://example.com/page1')
if html:
print(html.decode('utf-8'))stats = crawl.stats()
print(f"URLs discovered: {stats['urls_discovered']}")
print(f"URLs crawled: {stats['urls_crawled']}")
print(f"Crawl rate: {stats['crawl_rate']:.1f}%")
print(f"Total size: {stats['total_size_kb']:.2f} KB")The CrawlerConfig class supports all crawler parameters:
config = CrawlerConfig(
url='https://example.com',
page_limit=100,
max_depth=3,
exclude_paths=['/admin/*', '/api/*'],
include_paths=['/products/*'],
content_formats=['html', 'markdown'],
# ... and many more options
)See CrawlerConfig class documentation for all available parameters.
The crawler returns results in WARC (Web ARChive) format by default, which is automatically parsed:
artifact = crawl.warc()
# Easy way: Get all pages as dictionaries
pages = artifact.get_pages()
for page in pages:
url = page['url']
status_code = page['status_code']
headers = page['headers']
content = page['content'] # bytes
# Memory-efficient: Iterate one record at a time
for record in artifact.iter_responses():
print(f"{record.url}: {len(record.content)} bytes")
# Save to file
artifact.save('results.warc.gz')HAR (HTTP Archive) format includes detailed timing information for performance analysis:
artifact = crawl.har()
# Access timing data
for entry in artifact.iter_responses():
print(f"{entry.url}")
print(f" Status: {entry.status_code}")
print(f" Total time: {entry.time}ms")
print(f" Content type: {entry.content_type}")
# Detailed timing breakdown
timings = entry.timings
print(f" DNS: {timings.get('dns', 0)}ms")
print(f" Connect: {timings.get('connect', 0)}ms")
print(f" Wait: {timings.get('wait', 0)}ms")
print(f" Receive: {timings.get('receive', 0)}ms")
# Same easy interface as WARC
pages = artifact.get_pages()from scrapfly import Crawl, CrawlerConfig
try:
crawl = Crawl(client, config)
crawl.crawl().wait(max_wait=300)
if crawl.status().is_complete:
pages = crawl.warc().get_pages()
print(f"Success! Got {len(pages)} pages")
elif crawl.status().is_failed:
print("Crawler failed")
except RuntimeError as e:
print(f"Error: {e}")Make sure you've either:
- Exported the environment variable:
export SCRAPFLY_API_KEY='your-key' - Created a
.envfile with your API key
Double-check that:
- Your API key is correct and starts with
scp-live- - You have an active Scrapfly subscription
- You're using the correct API key from your dashboard
The python-dotenv package is optional. If you see import warnings, you can either:
- Install it:
pip install python-dotenv - Ignore them - environment variables will still work