ForgeCrawl — Master Design Document

Version: 1.0
Date: March 3, 2026
Author: cschweda
Status: Draft

1. Project Overview

ForgeCrawl is a self-hosted, authenticated web scraper that converts website content into clean Markdown optimized for LLM consumption. Inspired by Firecrawl, ForgeCrawl fills a gap in the open-source ecosystem: a self-hosted solution with proper authentication, a management UI, and LLM-ready output — without SaaS dependencies or public endpoints.

Core Value Proposition

Self-hosted & private: Runs on your own infrastructure with no public API endpoints
Authenticated: Built-in bcrypt + JWT auth with first-run admin registration, multi-user support, and API key access
LLM-optimized: Outputs clean Markdown with optional RAG-ready chunking and metadata
Full rendering: Puppeteer-based JavaScript rendering for SPAs and dynamic content
Login-gated scraping: Can authenticate against target sites to scrape protected content

What ForgeCrawl Is Not

A public API service or SaaS platform
A search engine crawler or indexer
A general-purpose web archiver

2. Tech Stack

Layer	Technology	Version
Framework	Nuxt 4	>=4.3.1
UI	Nuxt UI	4+
Package Manager	pnpm	latest
Runtime	Node.js	latest LTS
Auth	Built-in (bcrypt + jose JWT)	—
Database	SQLite via better-sqlite3	latest
ORM	Drizzle ORM	latest
JS Rendering	Puppeteer	latest
Content Extraction	Mozilla Readability	latest
HTML to Markdown	Turndown	latest
Process Manager	PM2	latest
Hosting	DigitalOcean Droplet	see Deployment section

Key Libraries

puppeteer — headless Chromium for JS-rendered pages
@mozilla/readability — article content extraction (same engine as Firefox Reader View)
turndown — HTML-to-Markdown conversion with plugin support
turndown-plugin-gfm — GitHub Flavored Markdown tables, strikethrough, task lists
better-sqlite3 — embedded SQLite database (zero-config, WAL mode)
drizzle-orm — type-safe ORM for SQLite (and Postgres, if upgrading later)
bcrypt — password hashing (12 salt rounds)
jose — JWT signing and verification (HS256)
cheerio — fast HTML parsing for non-JS pages
robots-parser — robots.txt compliance

3. Architecture

+-----------------------------------------------------+
|                   Nuxt 4 Application                 |
|                                                      |
|  +--------------+  +----------------------------+    |
|  |  Nuxt UI     |  |  Nitro Server              |    |
|  |  Admin Panel  |  |                            |    |
|  |              |  |  +----------------------+   |    |
|  |  - Dashboard |  |  |  Auth Middleware      |   |    |
|  |  - Scrapes   |  |  |  (bcrypt/JWT +       |   |    |
|  |  - Users     |  |  |   API Key validation)|   |    |
|  |  - Settings  |  |  +----------+-----------+   |    |
|  |  - API Keys  |  |             |               |    |
|  +--------------+  |  +----------v-----------+   |    |
|                    |  |  API Routes           |   |    |
|                    |  |  /api/scrape          |   |    |
|                    |  |  /api/crawl           |   |    |
|                    |  |  /api/jobs            |   |    |
|                    |  |  /api/admin/*         |   |    |
|                    |  +----------+-----------+   |    |
|                    |             |               |    |
|                    |  +----------v-----------+   |    |
|                    |  |  Scraping Engine      |   |    |
|                    |  |                       |   |    |
|                    |  |  HTTP Fetch ---+      |   |    |
|                    |  |  Puppeteer ----+      |   |    |
|                    |  |               v      |   |    |
|                    |  |  Readability -> Clean |   |    |
|                    |  |  Turndown --> MD      |   |    |
|                    |  |  Chunker --> RAG      |   |    |
|                    |  +-----------------------+   |    |
|                    +----------------------------+    |
+-------------------------+----------------------------+
                          |
            +-------------v--------------+
            |    SQLite (local file)     |
            |                            |
            |  - Auth (users, sessions)  |
            |  - Jobs, scrape history    |
            |  - WAL mode, Drizzle ORM   |
            +----------------------------+

            +----------------------------+
            |   Local Filesystem (opt.)  |
            |                            |
            |  /data/scrapes/            |
            |    +-- {id}/               |
            |    |   +-- raw.html        |
            |    |   +-- content.md      |
            |    |   +-- chunks.json     |
            +----------------------------+

Request Flow

Client sends request (browser session or API key in header)
Nitro middleware validates authentication
Route handler dispatches to scraping engine
Engine fetches page (HTTP or Puppeteer based on config)
Readability extracts article content
Turndown converts to Markdown
Optional: Chunker splits into RAG-ready segments
Results stored per configuration (database, filesystem, or both)
Response returned to client

4. Data Model

Core Tables (SQLite via Drizzle ORM)

The database schema is defined in TypeScript using Drizzle ORM (server/db/schema.ts). See Document 11 for the complete schema definition. Key tables:

app_config — first-run detection and application settings
users — built-in auth (email, bcrypt password hash, role)
api_keys — bcrypt-hashed API keys with prefix for lookup
scrape_jobs — job tracking (single, crawl, batch)
scrape_results — scraped content (Markdown, HTML, metadata)
scrape_chunks — RAG-ready content chunks with token counts
job_queue — SQLite-backed job queue (pending, locked, completed)
site_credentials — encrypted credentials for login-gated scraping
usage_log — per-user action tracking

The database auto-creates on first run. SQLite runs in WAL mode with busy_timeout = 5000 for concurrent read/write safety. Drizzle Kit handles migrations automatically on startup.

Supabase/Postgres migration path: The same Drizzle schema can target Postgres by swapping sqliteTable for pgTable. Set NUXT_DB_BACKEND=supabase in your env to switch. See Document 11, Section 10.

5. Authentication Architecture

First-Run Flow

App boots, checks app_config for setup_complete key
If not found, redirects all routes to /setup
/setup page collects admin email + password
Hashes password with bcrypt (12 rounds), creates user row with role: 'admin'
Sets setup_complete: true in app_config
Issues JWT session cookie and redirects to dashboard

Session Auth (Browser)

Login endpoint validates password with bcrypt, issues signed JWT (jose, HS256)
JWT stored in HTTP-only, Secure, SameSite=Lax cookie (forgecrawl_session)
Server middleware validates JWT on every /api/* request
7-day expiry with sliding renewal

API Key Auth (Programmatic)

Admin or user generates API key in dashboard
Key shown once, stored as bcrypt hash
Requests include Authorization: Bearer bc_xxxxxxxxxxxx
Server middleware checks api_keys table, validates hash
Falls back to session cookie JWT if no API key header

Auth Middleware Priority

1. Check for API key header -> validate against api_keys table
2. Check for session cookie -> validate JWT signature and expiry
3. Reject with 401

6. Scraping Engine Design

Pipeline

URL Input
    |
    +-- Config Check: render_js?
    |   +-- YES -> Puppeteer (launch browser, navigate, wait)
    |   +-- NO  -> HTTP fetch (lighter, faster)
    |
    v
Raw HTML
    |
    +-- Readability: extract article content
    |   (removes nav, ads, sidebars, footers)
    |
    v
Clean HTML
    |
    +-- Turndown: convert to Markdown
    |   (with GFM plugin for tables)
    |
    v
Markdown Output
    |
    +-- Optional: RAG Chunker
    |   (split by headings, paragraphs, token count)
    |
    v
Storage (Database / Filesystem / Both)

Puppeteer Management

Single browser instance shared across requests (reuse pages)
Configurable concurrency limit (default: 3 pages)
Auto-restart on crash
Page timeout: 30 seconds default
Network idle wait: networkidle2 (2 or fewer connections for 500ms)

Content Extraction Options

Option	Description	Default
`render_js`	Use Puppeteer for JS rendering	`true`
`wait_for`	CSS selector to wait for before extraction	`null`
`timeout`	Page load timeout (ms)	`30000`
`include_links`	Preserve hyperlinks in Markdown	`true`
`include_images`	Include image references	`false`
`selectors.include`	CSS selectors to include	`null`
`selectors.exclude`	CSS selectors to remove before extraction	`null`
`chunk.enabled`	Split into RAG chunks	`false`
`chunk.max_tokens`	Max tokens per chunk	`512`
`chunk.overlap`	Token overlap between chunks	`50`

7. Deployment — DigitalOcean Droplet

Minimum Specifications

Resource	Minimum	Recommended
CPU	2 vCPU	4 vCPU
RAM	4 GB	8 GB
Storage	50 GB SSD	80 GB SSD
OS	Ubuntu 24.04 LTS	Ubuntu 24.04 LTS

Why 4GB minimum: Puppeteer/Chromium needs ~200-400MB per browser instance. With the Node process, OS overhead, and headroom for concurrent scrapes, 2GB will OOM. 4GB gives comfortable room for 2-3 concurrent Puppeteer pages.

PM2 Configuration

// ecosystem.config.cjs
// Public defaults come from forgecrawl.config.ts (baked into the build).
// Only secrets and deployment-specific overrides go here.
module.exports = {
  apps: [{
    name: 'forgecrawl',
    script: '.output/server/index.mjs',
    instances: 1,
    exec_mode: 'fork',
    env: {
      NODE_ENV: 'production',
      PORT: 3000,
      // Secrets loaded from .env — do not hardcode here
    },
    max_memory_restart: '3G',
    log_date_format: 'YYYY-MM-DD HH:mm:ss',
    error_file: '/var/log/forgecrawl/error.log',
    out_file: '/var/log/forgecrawl/output.log',
  }]
};

Nginx Reverse Proxy

server {
    listen 80;
    server_name forgecrawl.yourdomain.com;
    return 301 https://$server_name$request_uri;
}

server {
    listen 443 ssl http2;
    server_name forgecrawl.yourdomain.com;

    ssl_certificate /etc/letsencrypt/live/forgecrawl.yourdomain.com/fullchain.pem;
    ssl_certificate_key /etc/letsencrypt/live/forgecrawl.yourdomain.com/privkey.pem;

    add_header X-Frame-Options "SAMEORIGIN" always;
    add_header X-Content-Type-Options "nosniff" always;
    add_header Referrer-Policy "strict-origin-when-cross-origin" always;

    location / {
        proxy_pass http://127.0.0.1:3000;
        proxy_http_version 1.1;
        proxy_set_header Upgrade $http_upgrade;
        proxy_set_header Connection 'upgrade';
        proxy_set_header Host $host;
        proxy_set_header X-Real-IP $remote_addr;
        proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
        proxy_set_header X-Forwarded-Proto $scheme;
        proxy_cache_bypass $http_upgrade;
    }

    proxy_read_timeout 120s;
    proxy_send_timeout 120s;
    client_max_body_size 10M;
}

8. Phase Plan

Phase	Deliverable	Key Features
1	Foundation & Auth	Nuxt 4 scaffold, SQLite + Drizzle ORM, built-in bcrypt/JWT auth, first-run admin setup, basic single-URL HTTP scrape returning Markdown, Docker Compose
2	Puppeteer & Storage	JS rendering engine, Readability extraction, configurable storage (database + filesystem), improved Markdown quality
3	Job Queue & Crawling	SQLite-backed queue with Redis-swappable interface, multi-page site crawling, depth limits, robots.txt, rate limiting, crawl progress UI
4	API Keys & Multi-User	API key generation and management, user CRUD for admin, per-user usage tracking, programmatic API access
5	RAG & Advanced	Token-aware chunking with metadata, login-gated site scraping, export formats (JSON, JSONL), production hardening, monitoring

Each phase produces a testable, working deliverable. No phase depends on future phases to function.

9. Project Structure

forgecrawl/
+-- forgecrawl.config.ts           # Single source of truth for public config
+-- docker-compose.yml              # Primary deployment
+-- docker-compose.prod.yml         # Production overlay (Nginx + SSL)
+-- .env                            # Secrets only (gitignored)
+-- .env.example                    # Secret key templates (committed)
+-- .dockerignore
+-- .gitignore
+-- pnpm-workspace.yaml
+-- package.json                    # Root workspace config
+-- pnpm-lock.yaml
+-- .npmrc
+-- README.md
+-- LICENSE                         # MIT
+--
+-- packages/
|   +-- app/                        # The scraper application (Nuxt 4)
|   |   +-- Dockerfile
|   |   +-- package.json            # @forgecrawl/app
|   |   +-- nuxt.config.ts          # Imports from ../../forgecrawl.config.ts
|   |   +-- ecosystem.config.cjs    # PM2 config (bare-metal only)
|   |   +-- app/                     # Nuxt 4 app directory (srcDir)
|   |   |   +-- app.vue
|   |   |   +-- pages/
|   |   |   |   +-- index.vue              # Dashboard
|   |   |   |   +-- setup.vue              # First-run registration
|   |   |   |   +-- login.vue
|   |   |   |   +-- scrapes/
|   |   |   |   |   +-- index.vue          # Scrape history
|   |   |   |   |   +-- [id].vue           # Scrape detail
|   |   |   |   +-- crawls/
|   |   |   |   |   +-- index.vue
|   |   |   |   |   +-- [id].vue
|   |   |   |   +-- admin/
|   |   |   |       +-- users.vue
|   |   |   |       +-- api-keys.vue
|   |   |   |       +-- settings.vue
|   |   |   |       +-- credentials.vue
|   |   |   +-- components/
|   |   |   |   +-- ScrapeForm.vue
|   |   |   |   +-- MarkdownPreview.vue
|   |   |   |   +-- JobProgress.vue
|   |   |   |   +-- AdminLayout.vue
|   |   |   +-- composables/
|   |   |   |   +-- useScrape.ts
|   |   |   |   +-- useAuth.ts
|   |   |   |   +-- useAdmin.ts
|   |   |   +-- middleware/
|   |   |   |   +-- auth.ts
|   |   |   |   +-- setup.global.ts
|   |   |   +-- assets/
|   |   |       +-- css/
|   |   |           +-- main.css          # App-level CSS overrides
|   |   +-- shared/                  # Nuxt 4: code shared between app/ and server/
|   |   +-- server/
|   |   |   +-- middleware/
|   |   |   |   +-- auth.ts
|   |   |   +-- api/
|   |   |   |   +-- scrape.post.ts
|   |   |   |   +-- scrape/
|   |   |   |   |   +-- batch.post.ts      # Batch scrape (Phase 3)
|   |   |   |   +-- crawl.post.ts
|   |   |   |   +-- jobs/
|   |   |   |   +-- results/
|   |   |   |   +-- admin/
|   |   |   |   |   +-- cleanup.post.ts    # Maintenance (Phase 5)
|   |   |   |   +-- auth/
|   |   |   |   |   +-- setup.post.ts
|   |   |   |   |   +-- login.post.ts
|   |   |   |   |   +-- logout.post.ts
|   |   |   |   |   +-- me.get.ts
|   |   |   |   |   +-- api-keys.post.ts
|   |   |   |   +-- health.get.ts
|   |   |   +-- engine/
|   |   |   |   +-- scraper.ts
|   |   |   |   +-- fetcher.ts
|   |   |   |   +-- extractor.ts
|   |   |   |   +-- converter.ts
|   |   |   |   +-- chunker.ts
|   |   |   |   +-- browser.ts
|   |   |   |   +-- cache.ts
|   |   |   |   +-- pdf-extractor.ts    # PDF support (Phase 2)
|   |   |   |   +-- docx-extractor.ts   # DOCX support (Phase 2)
|   |   |   |   +-- sitemap.ts          # Sitemap discovery (Phase 3)
|   |   |   +-- queue/
|   |   |   |   +-- interface.ts
|   |   |   |   +-- sqlite.ts
|   |   |   +-- db/
|   |   |   |   +-- index.ts
|   |   |   |   +-- schema.ts
|   |   |   |   +-- backend.ts
|   |   |   |   +-- migrations/
|   |   |   +-- auth/
|   |   |   |   +-- password.ts
|   |   |   |   +-- jwt.ts
|   |   |   +-- storage/
|   |   |   |   +-- interface.ts
|   |   |   |   +-- database.ts
|   |   |   |   +-- filesystem.ts
|   |   |   +-- utils/
|   |   |       +-- robots.ts
|   |   |       +-- rate-limiter.ts
|   |   |       +-- url.ts
|   |   +-- data/                   # Filesystem storage (gitignored)
|   |
|   +-- web/                        # Marketing and documentation site
|       +-- package.json            # @forgecrawl/web
|       +-- nuxt.config.ts
|       +-- app/
|       +-- public/
+--
+-- nginx/                          # Production Nginx config
    +-- nginx.conf

10. Configuration

Public Configuration — `forgecrawl.config.ts`

All non-secret project variables live in a single file at the monorepo root: forgecrawl.config.ts. This is the single source of truth for defaults (ports, timeouts, concurrency, storage mode, app metadata, etc.). The Nuxt app imports this file in nuxt.config.ts via the toRuntimeConfig() helper.

Do not scatter public configuration defaults across .env, nuxt.config.ts, or individual source files. If a value is not a secret, it belongs in forgecrawl.config.ts.

Secret Variables — `.env`

Only secrets and environment-specific overrides belong in .env (gitignored). The .env.example file (committed) provides key templates without values.

# .env — secrets only
NUXT_AUTH_SECRET=                  # Min 32 chars, signs JWTs (auto-generated if empty)
NUXT_ENCRYPTION_KEY=               # AES-256-GCM key for site credentials (Phase 5)
NUXT_ALERT_WEBHOOK=                # Discord/Slack webhook URL (optional)

# Optional: Supabase backend (replaces SQLite)
# NUXT_SUPABASE_URL=
# NUXT_SUPABASE_KEY=
# NUXT_SUPABASE_SERVICE_KEY=

How It Works

forgecrawl.config.ts (public defaults)
        │
        ├──→ packages/app/nuxt.config.ts (imports toRuntimeConfig())
        │         └──→ runtimeConfig available in server via useRuntimeConfig()
        │
        └──→ packages/web/nuxt.config.ts (imports config.app for metadata)

.env (secrets only)
        └──→ Nuxt auto-maps NUXT_* vars to runtimeConfig at startup

11. Design Decisions & Rationale

Decision	Rationale
`forgecrawl.config.ts` as config source of truth	Single file for all public defaults prevents scattered configuration across `.env`, `nuxt.config.ts`, and source files. Secrets stay in `.env` only.
Nuxt 4 directory structure (`app/`, `server/`, `shared/`)	Clear separation of client and server code. `app/` for pages/components/composables, `server/` at package root, `shared/` for cross-boundary code.
SQLite over Supabase as default	Zero-config, zero-cost, no external dependency. Single file backup. Supabase available as optional upgrade via Drizzle ORM backend swap.
Puppeteer over Playwright	Smaller footprint, Chromium-only. Puppeteer's networkidle2 is well-suited for scraping.
SQLite queue over Redis	Fewer moving parts. Queue interface is abstract so Redis swap is config-only. SQLite WAL mode handles concurrent reads during writes.
pnpm workspace monorepo	Separates scraper app (packages/app) from marketing site (packages/web). Shared tooling, independent deploys.
PM2 fork mode (not cluster)	Puppeteer shares browser state. Cluster mode would spawn multiple browser instances and OOM. Scale via queue concurrency instead.
Turndown over remark/rehype	Purpose-built for HTML to Markdown. Plugin system handles edge cases. Lighter than remark pipeline.
Configurable storage	Database-only hits limits with large HTML blobs. Filesystem is cheaper for raw storage. Both gives metadata queries + cheap blob storage.

12. Success Criteria

Each phase is considered complete when:

All specified features are implemented and functional
The application can be deployed to a fresh DO droplet via documented steps
Authentication works (first-run setup or login required for all routes)
The scraping pipeline produces clean, usable Markdown from test URLs
No critical security vulnerabilities in the deployment

References

Firecrawl (github.com/mendableai/firecrawl) — Inspiration
Mozilla Readability (github.com/mozilla/readability)
Turndown (github.com/mixmark-io/turndown)
Puppeteer (pptr.dev)
Drizzle ORM (orm.drizzle.team)
better-sqlite3 (github.com/WiseLibs/better-sqlite3)
Nuxt 4 (nuxt.com/docs)
Nuxt UI (ui.nuxt.com)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ForgeCrawl — Master Design Document

1. Project Overview

Core Value Proposition

What ForgeCrawl Is Not

2. Tech Stack

Key Libraries

3. Architecture

Request Flow

4. Data Model

Core Tables (SQLite via Drizzle ORM)

5. Authentication Architecture

First-Run Flow

Session Auth (Browser)

API Key Auth (Programmatic)

Auth Middleware Priority

6. Scraping Engine Design

Pipeline

Puppeteer Management

Content Extraction Options

7. Deployment — DigitalOcean Droplet

Minimum Specifications

PM2 Configuration

Nginx Reverse Proxy

8. Phase Plan

9. Project Structure

10. Configuration

Public Configuration — `forgecrawl.config.ts`

Secret Variables — `.env`

How It Works

11. Design Decisions & Rationale

12. Success Criteria

References

FilesExpand file tree

forgecrawl-00-master-design.md

Latest commit

History

forgecrawl-00-master-design.md

File metadata and controls

ForgeCrawl — Master Design Document

1. Project Overview

Core Value Proposition

What ForgeCrawl Is Not

2. Tech Stack

Key Libraries

3. Architecture

Request Flow

4. Data Model

Core Tables (SQLite via Drizzle ORM)

5. Authentication Architecture

First-Run Flow

Session Auth (Browser)

API Key Auth (Programmatic)

Auth Middleware Priority

6. Scraping Engine Design

Pipeline

Puppeteer Management

Content Extraction Options

7. Deployment — DigitalOcean Droplet

Minimum Specifications

PM2 Configuration

Nginx Reverse Proxy

8. Phase Plan

9. Project Structure

10. Configuration

Public Configuration — forgecrawl.config.ts

Secret Variables — .env

How It Works

11. Design Decisions & Rationale

12. Success Criteria

References

Public Configuration — `forgecrawl.config.ts`

Secret Variables — `.env`