Housefly is an interactive learning project designed to teach web scraping through structured challenges. Each chapter includes a companion website built specifically to be scraped, allowing you to practice in a controlled environment.
🌐 Translations: العربية · Español · فارسی · ગુજરાતી · हिन्दी · 日本語 · Русский · தமிழ் · Türkçe · اردو · 中文
- Realistic Web Scraping Challenges – Work with purpose-built websites.
- Structured Learning – Progress through 11 guided exercises.
- Automated Solution Checking – Verify your scrapers against expected outputs.
- Progressive Hints – Get help when you're stuck without seeing the full solution.
- Watch Mode – Auto-validate as you code.
- Clone the Repository
git clone https://github.com/jonaylor89/housefly.git
cd housefly- Install Dependencies
pnpm install- Start the Chapter Servers
turbo devThis starts all chapter target websites on fixed local ports (3001–3011).
- Navigate to a Chapter
Each chapter is in exercises/chapter-NN/ with a starter workspace, expected output, hints, and a reference solution.
- Write Your Scraper
Edit the starter code in exercises/chapter-NN/starter/src/index.ts.
- Validate Your Answer
# Using pnpm scripts
pnpm run validate -- <chapter>
# Or directly
pnpm tsx packages/cli/src/main.ts validate <chapter>
# Short alias
pnpm run ca <chapter>- Get Hints
pnpm run hint -- <chapter>- Watch Mode (auto-revalidate on save)
pnpm run watch -- <chapter>housefly/
├── apps/
│ ├── tutorial/ # Next.js tutorial site
│ ├── chapter1/ # Target website for Chapter 1 (port 3001)
│ ├── chapter2/ # Target website for Chapter 2 (port 3002)
│ └── ... # Chapters 3–11
├── exercises/
│ ├── chapter-01/
│ │ ├── starter/src/ # Student workspace (edit this!)
│ │ ├── solution/src/ # Reference solution
│ │ ├── expected/ # Expected output
│ │ ├── chapter.config.ts # Chapter metadata & hints
│ │ └── hints.md # Progressive hints
│ └── ... # Chapters 02–11
├── packages/
│ ├── scraper-kit/ # Shared scraping utilities
│ ├── test-harness/ # Validation engine (Node/tsx)
│ └── cli/ # housefly CLI tool
├── scripts/
│ └── verify_rearchitecture.sh # Smoke-test script
└── turbo.json # Turborepo configuration
| Command | Description |
|---|---|
housefly run <chapter> |
Execute a chapter's starter code |
housefly validate <chapter> |
Run + compare against expected output |
housefly validate --all |
Validate all chapters (CI mode) |
housefly watch <chapter> |
Re-validate on file changes |
housefly hint <chapter> |
Show next progressive hint |
housefly reset <chapter> |
Restore starter files to original |
housefly open <chapter> |
Open exercise folder |
| # | Topic | Techniques |
|---|---|---|
| 1 | Hello World Scraping | HTTP fetch, Cheerio basics |
| 2 | Lists and Selectors | CSS selectors, data extraction |
| 3 | AI-Assisted Scraping | OpenAI API, LLM parsing |
| 4 | Dynamic Content | Playwright, JS-rendered pages |
| 5 | Infinite Scroll | Scroll detection, lazy loading |
| 6 | Multi-Page Crawling | Crawlee, link following |
| 7 | API Pagination | REST APIs, pagination |
| 8 | Authentication & Forms | Login flows, multi-step forms |
| 9 | GraphQL Scraping | GraphQL queries, mutations |
| 10 | Media Extraction | PDFs, images, videos |
| 11 | Polite Scraping | robots.txt, rate limiting, CAPTCHAs |
Some challenges require 3rd party APIs (e.g., OpenAI). Copy the template and fill in your keys:
cp .env.template .envPull requests and suggestions are welcome! Feel free to open issues for bug reports or feature requests.
MIT License
This is for educational purposes. Web scraping on websites that don't want you to can violate ToS and potentially get you in trouble if done at an industrial scale.
