BabyCrawler 🕷️

A high-performance, distributed, not-so-baby web crawler built in Go.

BabyCrawler is a scalable, cloud-native web crawler designed to harvest billions of web pages. It decouples Fetching (Network I/O) from Parsing (CPU) to maximize throughput, using Redis for coordination and S3/MinIO for storage.

💡 Inspo

🏗️ Architecture

BabyCrawler implements a Distributed Microservices Architecture using the Producer-Consumer and Claim Check patterns.

The Frontier (Redis): Manages the queue of URLs to be visited and handles deduplication.
Fetcher Service (The Hunter):
- Pulls URLs from the Frontier.
- Checks robots.txt and Domain Rate Limits.
- Downloads HTML and uploads it to S3 (MinIO).
- Pushes a "Claim Check" (Reference ID) to the Parsing Queue.
Parser Service (The Butcher):
- Pulls the Claim Check from the Parsing Queue.
- Downloads the raw HTML from S3.
- Extracts new links and normalizes them.
- Pushes new links back to the Frontier.

✨ Features

Distributed Design: Scale Fetchers and Parsers independently.
Politeness: Per-domain rate limiting (Redis Token Bucket/Spin-lock).
Compliance: Automatic robots.txt parsing and enforcement.
Fault Tolerance: Dead Letter Queue (DLQ) for failed requests.
Storage Efficient: Uses Claim Check Pattern to keep Redis lightweight (HTML stored in S3).
Observability: Structured JSON logging via zerolog.
Cloud Native: Fully containerized with Docker & Docker Compose.
Metrics: Prometheus Metrics.

Use Cases

Best Use: Highly efficient fetching / storing of static content

Collecting high volume of data for LLM datasets
SEO / Link analysis (site graphs, finding broken links, etc.)
Web archiving
Vulnerability Scanning (filtering for secrets, keys, etc.)

This project is under an MIT license. Please feel free to reach out if you find other creative uses for this project.

🚀 Getting Started

Prerequisites

Docker & Docker Compose
Go 1.21+ (Optional, for local dev)
Make (Optional)

Quick Start (Docker)

The easiest way to run the full stack (Redis + MinIO + Crawler + Parser):

# 1. Start the infrastructure and services
make up
# OR
docker-compose up --build

This will automatically:

Start Redis & MinIO.
Create the S3 bucket (crawled-data).
Launch the Fetcher.
Launch the Parser.

Scaling

To increase parsing throughput, scale the parser service horizontally:

# Run 5 concurrent parser instances
make scale
# OR
docker-compose up -d --scale parser=5 --no-recreate

🛠️ CLI Usage

BabyCrawler is built with cobra, offering a robust CLI.

Common Flags (Crawler AND Parser)

--redis-addr: Address of Redis server (default: localhost:6379)
--redis-pass: Password for Redis
--redis-db: Redis DB number (default: 0)
--s3-endpoint: S3 Endpoint URL (default: http://localhost:9000)
--s3-bucket: S3 Bucket name (default: crawled-data)
--s3-region: S3 Region (default: us-east-1)
--s3-user: S3 Access Key / User (default: admin)
--s3-pass: S3 Secret Key / Password (default: password)

Crawler Specific Flags

--seed: Comma-separated list of start URLs.
--workers: Number of crawler workers (default: 9190)
--metrics-port: Port for Metrics server (crawler).

Parser Specific Flags

--workers: Number of parser workers (default: 10)
--metrics-port: Port for Metrics server (parser) [default: 9191].
--cross-domain: Allow Crawler to crawl links across different domains (default: false)

The Fetcher (Crawler)

go run cmd/crawler/main.go --help

Usage:
  crawler [flags]

Flags:
      --seed string         Comma-separated list of start URLs
      --redis-addr string   Address of Redis server (default "localhost:6379")
      --s3-endpoint string  S3 Endpoint URL (default "http://localhost:9000")
      --s3-bucket string    S3 Bucket name (default "crawled-data")

Example:

go run cmd/crawler/main.go --seed "https://github.com,https://google.com"

The Parser

go run cmd/parser/main.go --help

Usage:
  parser [flags]

Flags:
      --redis-addr string   Address of Redis server (default "localhost:6379")
      --s3-endpoint string  S3 Endpoint URL (default "http://localhost:9000")

🧪 Development

Running Locally

If you want to run the Go binaries outside of Docker (for debugging), ensure you have Redis and MinIO running:

# Start Infra
docker-compose up -d redis minio create-buckets

# Run Crawler
make run-crawler

# Run Parser (in a separate terminal)
make run-parser

Name		Name	Last commit message	Last commit date
Latest commit History 32 Commits
cmd		cmd
internal		internal
.DS_Store		.DS_Store
.env		.env
.gitignore		.gitignore
LICENSE		LICENSE
Makefile		Makefile
README.md		README.md
docker-compose.yml		docker-compose.yml
go.mod		go.mod
go.sum		go.sum
prometheus.yml		prometheus.yml
redis.conf		redis.conf

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BabyCrawler 🕷️

💡 Inspo

🏗️ Architecture

✨ Features

Use Cases

🚀 Getting Started

Prerequisites

Quick Start (Docker)

Scaling

🛠️ CLI Usage

Common Flags (Crawler AND Parser)

Crawler Specific Flags

Parser Specific Flags

The Fetcher (Crawler)

The Parser

🧪 Development

Running Locally

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

BabyCrawler 🕷️

💡 Inspo

🏗️ Architecture

✨ Features

Use Cases

🚀 Getting Started

Prerequisites

Quick Start (Docker)

Scaling

🛠️ CLI Usage

Common Flags (Crawler AND Parser)

Crawler Specific Flags

Parser Specific Flags

The Fetcher (Crawler)

The Parser

🧪 Development

Running Locally

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages