A high-performance, distributed, not-so-baby web crawler built in Go.
BabyCrawler is a scalable, cloud-native web crawler designed to harvest billions of web pages. It decouples Fetching (Network I/O) from Parsing (CPU) to maximize throughput, using Redis for coordination and S3/MinIO for storage.
BabyCrawler implements a Distributed Microservices Architecture using the Producer-Consumer and Claim Check patterns.
- The Frontier (Redis): Manages the queue of URLs to be visited and handles deduplication.
- Fetcher Service (The Hunter):
- Pulls URLs from the Frontier.
- Checks robots.txt and Domain Rate Limits.
- Downloads HTML and uploads it to S3 (MinIO).
- Pushes a "Claim Check" (Reference ID) to the Parsing Queue.
- Parser Service (The Butcher):
- Pulls the Claim Check from the Parsing Queue.
- Downloads the raw HTML from S3.
- Extracts new links and normalizes them.
- Pushes new links back to the Frontier.
- Distributed Design: Scale Fetchers and Parsers independently.
- Politeness: Per-domain rate limiting (Redis Token Bucket/Spin-lock).
- Compliance: Automatic robots.txt parsing and enforcement.
- Fault Tolerance: Dead Letter Queue (DLQ) for failed requests.
- Storage Efficient: Uses Claim Check Pattern to keep Redis lightweight (HTML stored in S3).
- Observability: Structured JSON logging via zerolog.
- Cloud Native: Fully containerized with Docker & Docker Compose.
- Metrics: Prometheus Metrics.
Best Use: Highly efficient fetching / storing of static content
- Collecting high volume of data for LLM datasets
- SEO / Link analysis (site graphs, finding broken links, etc.)
- Web archiving
- Vulnerability Scanning (filtering for secrets, keys, etc.)
This project is under an MIT license. Please feel free to reach out if you find other creative uses for this project.
- Docker & Docker Compose
- Go 1.21+ (Optional, for local dev)
- Make (Optional)
The easiest way to run the full stack (Redis + MinIO + Crawler + Parser):
# 1. Start the infrastructure and services
make up
# OR
docker-compose up --buildThis will automatically:
- Start Redis & MinIO.
- Create the S3 bucket (crawled-data).
- Launch the Fetcher.
- Launch the Parser.
To increase parsing throughput, scale the parser service horizontally:
# Run 5 concurrent parser instances
make scale
# OR
docker-compose up -d --scale parser=5 --no-recreateBabyCrawler is built with cobra, offering a robust CLI.
--redis-addr: Address of Redis server (default:localhost:6379)--redis-pass: Password for Redis--redis-db: Redis DB number (default:0)--s3-endpoint: S3 Endpoint URL (default:http://localhost:9000)--s3-bucket: S3 Bucket name (default:crawled-data)--s3-region: S3 Region (default:us-east-1)--s3-user: S3 Access Key / User (default:admin)--s3-pass: S3 Secret Key / Password (default:password)
--seed: Comma-separated list of start URLs.--workers: Number of crawler workers (default:9190)--metrics-port: Port for Metrics server (crawler).
--workers: Number of parser workers (default:10)--metrics-port: Port for Metrics server (parser) [default:9191].--cross-domain: Allow Crawler to crawl links across different domains (default:false)
go run cmd/crawler/main.go --help
Usage:
crawler [flags]
Flags:
--seed string Comma-separated list of start URLs
--redis-addr string Address of Redis server (default "localhost:6379")
--s3-endpoint string S3 Endpoint URL (default "http://localhost:9000")
--s3-bucket string S3 Bucket name (default "crawled-data")Example:
go run cmd/crawler/main.go --seed "https://github.com,https://google.com"go run cmd/parser/main.go --help
Usage:
parser [flags]
Flags:
--redis-addr string Address of Redis server (default "localhost:6379")
--s3-endpoint string S3 Endpoint URL (default "http://localhost:9000")If you want to run the Go binaries outside of Docker (for debugging), ensure you have Redis and MinIO running:
# Start Infra
docker-compose up -d redis minio create-buckets
# Run Crawler
make run-crawler
# Run Parser (in a separate terminal)
make run-parser