|
| 1 | +# Firecrawl Shopping Scraper |
| 2 | + |
| 3 | +A product extraction agent built with the **Upsonic AI Agent Framework** and **FirecrawlTools**. Point it at any shopping website and it scrapes the page, extracts product names, prices, and descriptions, and returns the results as a clean, sorted table. |
| 4 | + |
| 5 | +The example targets [books.toscrape.com](http://books.toscrape.com), a publicly available scraping-safe demo bookstore, but the same pattern works for any publicly accessible e-commerce site. |
| 6 | + |
| 7 | +## Features |
| 8 | + |
| 9 | +- **Single-page scraping**: Fetches a shop page and converts it to clean Markdown via Firecrawl |
| 10 | +- **LLM-powered extraction**: The agent reads the Markdown and pulls out structured product data without custom parsers or CSS selectors |
| 11 | +- **Minimal tool surface**: Only `scrape_url` is enabled so the agent cannot accidentally crawl, search, or batch-scrape |
| 12 | +- **Sorted output**: Products are returned as a Markdown table ordered by price descending, with a summary line showing total count and price range |
| 13 | +- **Extensible**: Switch to `crawl_website` for multi-page crawling or `extract_data` for schema-driven JSON extraction |
| 14 | + |
| 15 | +## Prerequisites |
| 16 | + |
| 17 | +- Python 3.10+ |
| 18 | +- Firecrawl API key (sign up for free at [firecrawl.dev](https://firecrawl.dev)) |
| 19 | +- Anthropic API key (or swap the model for any Upsonic-supported provider) |
| 20 | + |
| 21 | +## Installation |
| 22 | + |
| 23 | +1. Navigate to this directory: |
| 24 | + |
| 25 | + ```bash |
| 26 | + cd examples/firecrawl_shopping_scraper |
| 27 | + ``` |
| 28 | + |
| 29 | +2. Create and activate a virtual environment: |
| 30 | + |
| 31 | + ```bash |
| 32 | + # With uv (recommended) |
| 33 | + uv venv && source .venv/bin/activate |
| 34 | + |
| 35 | + # With pip |
| 36 | + python3 -m venv .venv && source .venv/bin/activate |
| 37 | + ``` |
| 38 | + |
| 39 | +3. Install dependencies: |
| 40 | + |
| 41 | + ```bash |
| 42 | + # With uv |
| 43 | + uv pip install -r requirements.txt |
| 44 | + |
| 45 | + # With pip |
| 46 | + pip install -r requirements.txt |
| 47 | + ``` |
| 48 | + |
| 49 | +4. Set up your environment variables: |
| 50 | + |
| 51 | + ```bash |
| 52 | + cp .env.example .env |
| 53 | + ``` |
| 54 | + |
| 55 | + Then open `.env` and fill in your keys: |
| 56 | + |
| 57 | + ```bash |
| 58 | + FIRECRAWL_API_KEY=fc-your-key-here |
| 59 | + ANTHROPIC_API_KEY=your-anthropic-key-here |
| 60 | + ``` |
| 61 | + |
| 62 | +## Usage |
| 63 | + |
| 64 | +Run the agent: |
| 65 | + |
| 66 | +```bash |
| 67 | +python main.py |
| 68 | +# or |
| 69 | +uv run main.py |
| 70 | +``` |
| 71 | + |
| 72 | +Example output: |
| 73 | + |
| 74 | +``` |
| 75 | +Found 20 products · Price range: £10.00 - £59.69 |
| 76 | +
|
| 77 | +| # | Book Title | Price | Rating | |
| 78 | +|----|----------------------------------------------|--------|--------| |
| 79 | +| 1 | Libertarianism for Beginners | £59.69 | Two | |
| 80 | +| 2 | It's Only the Himalayas | £52.29 | Two | |
| 81 | +| 3 | The Black Maria | £52.15 | One | |
| 82 | +| 4 | Starving Hearts (Triangular Trade Trilogy...) | £13.99 | Two | |
| 83 | +... |
| 84 | +``` |
| 85 | + |
| 86 | +To target a different shop, change the URL in the task description inside `main.py`: |
| 87 | + |
| 88 | +```python |
| 89 | +task = Task( |
| 90 | + description=""" |
| 91 | + Scrape https://your-target-shop.com and extract all visible products. |
| 92 | + For each product return name, price, and a short description (1-2 sentences). |
| 93 | + Format as a Markdown table sorted by price descending. |
| 94 | + """ |
| 95 | +) |
| 96 | +``` |
| 97 | + |
| 98 | +## Project Structure |
| 99 | + |
| 100 | +``` |
| 101 | +firecrawl_shopping_scraper/ |
| 102 | +├── main.py # Agent setup and task definition |
| 103 | +├── requirements.txt # Python dependencies |
| 104 | +├── .env.example # Environment variable template |
| 105 | +└── README.md # This file |
| 106 | +``` |
| 107 | + |
| 108 | +## How It Works |
| 109 | + |
| 110 | +1. **FirecrawlTools is configured** with only `scrape_url` enabled. This keeps the agent focused and prevents it from issuing unnecessary crawl or search calls. |
| 111 | + |
| 112 | +2. **The task description** tells the agent what page to scrape and exactly what to extract. No custom parser is needed; the LLM reads the Markdown Firecrawl returns and identifies product blocks by structure and context. |
| 113 | + |
| 114 | +3. **Firecrawl fetches the page** and returns it as clean Markdown, stripping navigation, ads, and boilerplate so the LLM gets a compact, structured representation of the content. |
| 115 | + |
| 116 | +4. **The agent extracts and formats** each product row into a Markdown table, sorts by price descending, and prepends a summary line. |
| 117 | + |
| 118 | +### Extending the example |
| 119 | + |
| 120 | +To crawl multiple pages instead of just the homepage, enable `crawl_website`: |
| 121 | + |
| 122 | +```python |
| 123 | +firecrawl = FirecrawlTools( |
| 124 | + enable_scrape=False, |
| 125 | + enable_crawl=True, |
| 126 | + enable_crawl_management=True, |
| 127 | +) |
| 128 | + |
| 129 | +task = Task( |
| 130 | + description=""" |
| 131 | + Crawl http://books.toscrape.com up to 5 pages and extract every product: |
| 132 | + name, price, and rating. Return a single Markdown table sorted by price descending. |
| 133 | + """ |
| 134 | +) |
| 135 | +``` |
| 136 | + |
| 137 | +To get structured JSON output directly from Firecrawl's LLM extraction layer, enable `extract_data`: |
| 138 | + |
| 139 | +```python |
| 140 | +firecrawl = FirecrawlTools( |
| 141 | + enable_scrape=False, |
| 142 | + enable_extract=True, |
| 143 | +) |
| 144 | + |
| 145 | +task = Task( |
| 146 | + description=""" |
| 147 | + Use extract_data on http://books.toscrape.com/* with this schema: |
| 148 | + {"products": [{"name": "string", "price": "string", "rating": "string"}]} |
| 149 | + Return the raw result. |
| 150 | + """ |
| 151 | +) |
| 152 | +``` |
0 commit comments