Housefly: A Hands-On Web Scraping Playground

Housefly is an interactive learning project designed to teach web scraping through structured challenges. Each chapter includes a companion website built specifically to be scraped, allowing you to practice in a controlled environment.

🌐 Translations: العربية · Español · فارسی · ગુજરાતી · हिन्दी · 日本語 · Русский · தமிழ் · Türkçe · اردو · 中文

Features

Realistic Web Scraping Challenges – Work with purpose-built websites.
Structured Learning – Progress through 11 guided exercises.
Automated Solution Checking – Verify your scrapers against expected outputs.
Progressive Hints – Get help when you're stuck without seeing the full solution.
Watch Mode – Auto-validate as you code.

Getting Started

Clone the Repository

git clone https://github.com/jonaylor89/housefly.git
cd housefly

Install Dependencies

pnpm install

Start the Chapter Servers

turbo dev

This starts all chapter target websites on fixed local ports (3001–3011).

Navigate to a Chapter

Each chapter is in exercises/chapter-NN/ with a starter workspace, expected output, hints, and a reference solution.

Write Your Scraper

Edit the starter code in exercises/chapter-NN/starter/src/index.ts.

Validate Your Answer

# Using pnpm scripts
pnpm run validate -- <chapter>

# Or directly
pnpm tsx packages/cli/src/main.ts validate <chapter>

# Short alias
pnpm run ca <chapter>

Get Hints

pnpm run hint -- <chapter>

Watch Mode (auto-revalidate on save)

pnpm run watch -- <chapter>

Project Structure

housefly/
├── apps/
│   ├── tutorial/               # Next.js tutorial site
│   ├── chapter1/               # Target website for Chapter 1 (port 3001)
│   ├── chapter2/               # Target website for Chapter 2 (port 3002)
│   └── ...                     # Chapters 3–11
├── exercises/
│   ├── chapter-01/
│   │   ├── starter/src/        # Student workspace (edit this!)
│   │   ├── solution/src/       # Reference solution
│   │   ├── expected/           # Expected output
│   │   ├── chapter.config.ts   # Chapter metadata & hints
│   │   └── hints.md            # Progressive hints
│   └── ...                     # Chapters 02–11
├── packages/
│   ├── scraper-kit/            # Shared scraping utilities
│   ├── test-harness/           # Validation engine (Node/tsx)
│   └── cli/                    # housefly CLI tool
├── scripts/
│   └── verify_rearchitecture.sh  # Smoke-test script
└── turbo.json                  # Turborepo configuration

CLI Commands

Command	Description
`housefly run <chapter>`	Execute a chapter's starter code
`housefly validate <chapter>`	Run + compare against expected output
`housefly validate --all`	Validate all chapters (CI mode)
`housefly watch <chapter>`	Re-validate on file changes
`housefly hint <chapter>`	Show next progressive hint
`housefly reset <chapter>`	Restore starter files to original
`housefly open <chapter>`	Open exercise folder

Chapters

#	Topic	Techniques
1	Hello World Scraping	HTTP fetch, Cheerio basics
2	Lists and Selectors	CSS selectors, data extraction
3	AI-Assisted Scraping	OpenAI API, LLM parsing
4	Dynamic Content	Playwright, JS-rendered pages
5	Infinite Scroll	Scroll detection, lazy loading
6	Multi-Page Crawling	Crawlee, link following
7	API Pagination	REST APIs, pagination
8	Authentication & Forms	Login flows, multi-step forms
9	GraphQL Scraping	GraphQL queries, mutations
10	Media Extraction	PDFs, images, videos
11	Polite Scraping	robots.txt, rate limiting, CAPTCHAs

Add Env Vars (Optional)

Some challenges require 3rd party APIs (e.g., OpenAI). Copy the template and fill in your keys:

cp .env.template .env

Contributing

Pull requests and suggestions are welcome! Feel free to open issues for bug reports or feature requests.

License

MIT License

Ready to Start Scraping?

👉 Try Housefly Now

Disclaimer

This is for educational purposes. Web scraping on websites that don't want you to can violate ToS and potentially get you in trouble if done at an industrial scale.

Name		Name	Last commit message	Last commit date
Latest commit History 156 Commits
.claude		.claude
.github		.github
.husky		.husky
apps		apps
docs		docs
exercises		exercises
packages		packages
scripts		scripts
.env.template		.env.template
.gitignore		.gitignore
.npmrc		.npmrc
LICENSE		LICENSE
README.md		README.md
TODO.md		TODO.md
biome.json		biome.json
package.json		package.json
pnpm-lock.yaml		pnpm-lock.yaml
pnpm-workspace.yaml		pnpm-workspace.yaml
search-error.png		search-error.png
turbo.json		turbo.json

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Housefly: A Hands-On Web Scraping Playground

Features

Getting Started

Project Structure

CLI Commands

Chapters

Add Env Vars (Optional)

Contributing

License

Ready to Start Scraping?

Disclaimer

About

Uh oh!

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages

Uh oh!

Folders and files

Latest commit

History

Repository files navigation

Housefly: A Hands-On Web Scraping Playground

Features

Getting Started

Project Structure

CLI Commands

Chapters

Add Env Vars (Optional)

Contributing

License

Ready to Start Scraping?

Disclaimer

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Sponsor this project

Uh oh!

Uh oh!

Contributors

Uh oh!

Languages