Version: 1.0
Date: March 3, 2026
Author: cschweda
Status: Draft
ForgeCrawl is a self-hosted, authenticated web scraper that converts website content into clean Markdown optimized for LLM consumption. Inspired by Firecrawl, ForgeCrawl fills a gap in the open-source ecosystem: a self-hosted solution with proper authentication, a management UI, and LLM-ready output — without SaaS dependencies or public endpoints.
- Self-hosted & private: Runs on your own infrastructure with no public API endpoints
- Authenticated: Built-in bcrypt + JWT auth with first-run admin registration, multi-user support, and API key access
- LLM-optimized: Outputs clean Markdown with optional RAG-ready chunking and metadata
- Full rendering: Puppeteer-based JavaScript rendering for SPAs and dynamic content
- Login-gated scraping: Can authenticate against target sites to scrape protected content
- A public API service or SaaS platform
- A search engine crawler or indexer
- A general-purpose web archiver
| Layer | Technology | Version |
|---|---|---|
| Framework | Nuxt 4 | >=4.3.1 |
| UI | Nuxt UI | 4+ |
| Package Manager | pnpm | latest |
| Runtime | Node.js | latest LTS |
| Auth | Built-in (bcrypt + jose JWT) | — |
| Database | SQLite via better-sqlite3 | latest |
| ORM | Drizzle ORM | latest |
| JS Rendering | Puppeteer | latest |
| Content Extraction | Mozilla Readability | latest |
| HTML to Markdown | Turndown | latest |
| Process Manager | PM2 | latest |
| Hosting | DigitalOcean Droplet | see Deployment section |
- puppeteer — headless Chromium for JS-rendered pages
- @mozilla/readability — article content extraction (same engine as Firefox Reader View)
- turndown — HTML-to-Markdown conversion with plugin support
- turndown-plugin-gfm — GitHub Flavored Markdown tables, strikethrough, task lists
- better-sqlite3 — embedded SQLite database (zero-config, WAL mode)
- drizzle-orm — type-safe ORM for SQLite (and Postgres, if upgrading later)
- bcrypt — password hashing (12 salt rounds)
- jose — JWT signing and verification (HS256)
- cheerio — fast HTML parsing for non-JS pages
- robots-parser — robots.txt compliance
+-----------------------------------------------------+
| Nuxt 4 Application |
| |
| +--------------+ +----------------------------+ |
| | Nuxt UI | | Nitro Server | |
| | Admin Panel | | | |
| | | | +----------------------+ | |
| | - Dashboard | | | Auth Middleware | | |
| | - Scrapes | | | (bcrypt/JWT + | | |
| | - Users | | | API Key validation)| | |
| | - Settings | | +----------+-----------+ | |
| | - API Keys | | | | |
| +--------------+ | +----------v-----------+ | |
| | | API Routes | | |
| | | /api/scrape | | |
| | | /api/crawl | | |
| | | /api/jobs | | |
| | | /api/admin/* | | |
| | +----------+-----------+ | |
| | | | |
| | +----------v-----------+ | |
| | | Scraping Engine | | |
| | | | | |
| | | HTTP Fetch ---+ | | |
| | | Puppeteer ----+ | | |
| | | v | | |
| | | Readability -> Clean | | |
| | | Turndown --> MD | | |
| | | Chunker --> RAG | | |
| | +-----------------------+ | |
| +----------------------------+ |
+-------------------------+----------------------------+
|
+-------------v--------------+
| SQLite (local file) |
| |
| - Auth (users, sessions) |
| - Jobs, scrape history |
| - WAL mode, Drizzle ORM |
+----------------------------+
+----------------------------+
| Local Filesystem (opt.) |
| |
| /data/scrapes/ |
| +-- {id}/ |
| | +-- raw.html |
| | +-- content.md |
| | +-- chunks.json |
+----------------------------+
- Client sends request (browser session or API key in header)
- Nitro middleware validates authentication
- Route handler dispatches to scraping engine
- Engine fetches page (HTTP or Puppeteer based on config)
- Readability extracts article content
- Turndown converts to Markdown
- Optional: Chunker splits into RAG-ready segments
- Results stored per configuration (database, filesystem, or both)
- Response returned to client
The database schema is defined in TypeScript using Drizzle ORM (server/db/schema.ts). See Document 11 for the complete schema definition. Key tables:
- app_config — first-run detection and application settings
- users — built-in auth (email, bcrypt password hash, role)
- api_keys — bcrypt-hashed API keys with prefix for lookup
- scrape_jobs — job tracking (single, crawl, batch)
- scrape_results — scraped content (Markdown, HTML, metadata)
- scrape_chunks — RAG-ready content chunks with token counts
- job_queue — SQLite-backed job queue (pending, locked, completed)
- site_credentials — encrypted credentials for login-gated scraping
- usage_log — per-user action tracking
The database auto-creates on first run. SQLite runs in WAL mode with busy_timeout = 5000 for concurrent read/write safety. Drizzle Kit handles migrations automatically on startup.
Supabase/Postgres migration path: The same Drizzle schema can target Postgres by swapping sqliteTable for pgTable. Set NUXT_DB_BACKEND=supabase in your env to switch. See Document 11, Section 10.
- App boots, checks
app_configforsetup_completekey - If not found, redirects all routes to
/setup /setuppage collects admin email + password- Hashes password with bcrypt (12 rounds), creates user row with
role: 'admin' - Sets
setup_complete: trueinapp_config - Issues JWT session cookie and redirects to dashboard
- Login endpoint validates password with bcrypt, issues signed JWT (jose, HS256)
- JWT stored in HTTP-only, Secure, SameSite=Lax cookie (
forgecrawl_session) - Server middleware validates JWT on every
/api/*request - 7-day expiry with sliding renewal
- Admin or user generates API key in dashboard
- Key shown once, stored as bcrypt hash
- Requests include
Authorization: Bearer bc_xxxxxxxxxxxx - Server middleware checks
api_keystable, validates hash - Falls back to session cookie JWT if no API key header
1. Check for API key header -> validate against api_keys table
2. Check for session cookie -> validate JWT signature and expiry
3. Reject with 401
URL Input
|
+-- Config Check: render_js?
| +-- YES -> Puppeteer (launch browser, navigate, wait)
| +-- NO -> HTTP fetch (lighter, faster)
|
v
Raw HTML
|
+-- Readability: extract article content
| (removes nav, ads, sidebars, footers)
|
v
Clean HTML
|
+-- Turndown: convert to Markdown
| (with GFM plugin for tables)
|
v
Markdown Output
|
+-- Optional: RAG Chunker
| (split by headings, paragraphs, token count)
|
v
Storage (Database / Filesystem / Both)
- Single browser instance shared across requests (reuse pages)
- Configurable concurrency limit (default: 3 pages)
- Auto-restart on crash
- Page timeout: 30 seconds default
- Network idle wait:
networkidle2(2 or fewer connections for 500ms)
| Option | Description | Default |
|---|---|---|
render_js |
Use Puppeteer for JS rendering | true |
wait_for |
CSS selector to wait for before extraction | null |
timeout |
Page load timeout (ms) | 30000 |
include_links |
Preserve hyperlinks in Markdown | true |
include_images |
Include image references | false |
selectors.include |
CSS selectors to include | null |
selectors.exclude |
CSS selectors to remove before extraction | null |
chunk.enabled |
Split into RAG chunks | false |
chunk.max_tokens |
Max tokens per chunk | 512 |
chunk.overlap |
Token overlap between chunks | 50 |
| Resource | Minimum | Recommended |
|---|---|---|
| CPU | 2 vCPU | 4 vCPU |
| RAM | 4 GB | 8 GB |
| Storage | 50 GB SSD | 80 GB SSD |
| OS | Ubuntu 24.04 LTS | Ubuntu 24.04 LTS |
Why 4GB minimum: Puppeteer/Chromium needs ~200-400MB per browser instance. With the Node process, OS overhead, and headroom for concurrent scrapes, 2GB will OOM. 4GB gives comfortable room for 2-3 concurrent Puppeteer pages.
// ecosystem.config.cjs
// Public defaults come from forgecrawl.config.ts (baked into the build).
// Only secrets and deployment-specific overrides go here.
module.exports = {
apps: [{
name: 'forgecrawl',
script: '.output/server/index.mjs',
instances: 1,
exec_mode: 'fork',
env: {
NODE_ENV: 'production',
PORT: 3000,
// Secrets loaded from .env — do not hardcode here
},
max_memory_restart: '3G',
log_date_format: 'YYYY-MM-DD HH:mm:ss',
error_file: '/var/log/forgecrawl/error.log',
out_file: '/var/log/forgecrawl/output.log',
}]
};server {
listen 80;
server_name forgecrawl.yourdomain.com;
return 301 https://$server_name$request_uri;
}
server {
listen 443 ssl http2;
server_name forgecrawl.yourdomain.com;
ssl_certificate /etc/letsencrypt/live/forgecrawl.yourdomain.com/fullchain.pem;
ssl_certificate_key /etc/letsencrypt/live/forgecrawl.yourdomain.com/privkey.pem;
add_header X-Frame-Options "SAMEORIGIN" always;
add_header X-Content-Type-Options "nosniff" always;
add_header Referrer-Policy "strict-origin-when-cross-origin" always;
location / {
proxy_pass http://127.0.0.1:3000;
proxy_http_version 1.1;
proxy_set_header Upgrade $http_upgrade;
proxy_set_header Connection 'upgrade';
proxy_set_header Host $host;
proxy_set_header X-Real-IP $remote_addr;
proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
proxy_set_header X-Forwarded-Proto $scheme;
proxy_cache_bypass $http_upgrade;
}
proxy_read_timeout 120s;
proxy_send_timeout 120s;
client_max_body_size 10M;
}| Phase | Deliverable | Key Features |
|---|---|---|
| 1 | Foundation & Auth | Nuxt 4 scaffold, SQLite + Drizzle ORM, built-in bcrypt/JWT auth, first-run admin setup, basic single-URL HTTP scrape returning Markdown, Docker Compose |
| 2 | Puppeteer & Storage | JS rendering engine, Readability extraction, configurable storage (database + filesystem), improved Markdown quality |
| 3 | Job Queue & Crawling | SQLite-backed queue with Redis-swappable interface, multi-page site crawling, depth limits, robots.txt, rate limiting, crawl progress UI |
| 4 | API Keys & Multi-User | API key generation and management, user CRUD for admin, per-user usage tracking, programmatic API access |
| 5 | RAG & Advanced | Token-aware chunking with metadata, login-gated site scraping, export formats (JSON, JSONL), production hardening, monitoring |
Each phase produces a testable, working deliverable. No phase depends on future phases to function.
forgecrawl/
+-- forgecrawl.config.ts # Single source of truth for public config
+-- docker-compose.yml # Primary deployment
+-- docker-compose.prod.yml # Production overlay (Nginx + SSL)
+-- .env # Secrets only (gitignored)
+-- .env.example # Secret key templates (committed)
+-- .dockerignore
+-- .gitignore
+-- pnpm-workspace.yaml
+-- package.json # Root workspace config
+-- pnpm-lock.yaml
+-- .npmrc
+-- README.md
+-- LICENSE # MIT
+--
+-- packages/
| +-- app/ # The scraper application (Nuxt 4)
| | +-- Dockerfile
| | +-- package.json # @forgecrawl/app
| | +-- nuxt.config.ts # Imports from ../../forgecrawl.config.ts
| | +-- ecosystem.config.cjs # PM2 config (bare-metal only)
| | +-- app/ # Nuxt 4 app directory (srcDir)
| | | +-- app.vue
| | | +-- pages/
| | | | +-- index.vue # Dashboard
| | | | +-- setup.vue # First-run registration
| | | | +-- login.vue
| | | | +-- scrapes/
| | | | | +-- index.vue # Scrape history
| | | | | +-- [id].vue # Scrape detail
| | | | +-- crawls/
| | | | | +-- index.vue
| | | | | +-- [id].vue
| | | | +-- admin/
| | | | +-- users.vue
| | | | +-- api-keys.vue
| | | | +-- settings.vue
| | | | +-- credentials.vue
| | | +-- components/
| | | | +-- ScrapeForm.vue
| | | | +-- MarkdownPreview.vue
| | | | +-- JobProgress.vue
| | | | +-- AdminLayout.vue
| | | +-- composables/
| | | | +-- useScrape.ts
| | | | +-- useAuth.ts
| | | | +-- useAdmin.ts
| | | +-- middleware/
| | | | +-- auth.ts
| | | | +-- setup.global.ts
| | | +-- assets/
| | | +-- css/
| | | +-- main.css # App-level CSS overrides
| | +-- shared/ # Nuxt 4: code shared between app/ and server/
| | +-- server/
| | | +-- middleware/
| | | | +-- auth.ts
| | | +-- api/
| | | | +-- scrape.post.ts
| | | | +-- scrape/
| | | | | +-- batch.post.ts # Batch scrape (Phase 3)
| | | | +-- crawl.post.ts
| | | | +-- jobs/
| | | | +-- results/
| | | | +-- admin/
| | | | | +-- cleanup.post.ts # Maintenance (Phase 5)
| | | | +-- auth/
| | | | | +-- setup.post.ts
| | | | | +-- login.post.ts
| | | | | +-- logout.post.ts
| | | | | +-- me.get.ts
| | | | | +-- api-keys.post.ts
| | | | +-- health.get.ts
| | | +-- engine/
| | | | +-- scraper.ts
| | | | +-- fetcher.ts
| | | | +-- extractor.ts
| | | | +-- converter.ts
| | | | +-- chunker.ts
| | | | +-- browser.ts
| | | | +-- cache.ts
| | | | +-- pdf-extractor.ts # PDF support (Phase 2)
| | | | +-- docx-extractor.ts # DOCX support (Phase 2)
| | | | +-- sitemap.ts # Sitemap discovery (Phase 3)
| | | +-- queue/
| | | | +-- interface.ts
| | | | +-- sqlite.ts
| | | +-- db/
| | | | +-- index.ts
| | | | +-- schema.ts
| | | | +-- backend.ts
| | | | +-- migrations/
| | | +-- auth/
| | | | +-- password.ts
| | | | +-- jwt.ts
| | | +-- storage/
| | | | +-- interface.ts
| | | | +-- database.ts
| | | | +-- filesystem.ts
| | | +-- utils/
| | | +-- robots.ts
| | | +-- rate-limiter.ts
| | | +-- url.ts
| | +-- data/ # Filesystem storage (gitignored)
| |
| +-- web/ # Marketing and documentation site
| +-- package.json # @forgecrawl/web
| +-- nuxt.config.ts
| +-- app/
| +-- public/
+--
+-- nginx/ # Production Nginx config
+-- nginx.conf
All non-secret project variables live in a single file at the monorepo root: forgecrawl.config.ts. This is the single source of truth for defaults (ports, timeouts, concurrency, storage mode, app metadata, etc.). The Nuxt app imports this file in nuxt.config.ts via the toRuntimeConfig() helper.
Do not scatter public configuration defaults across .env, nuxt.config.ts, or individual source files. If a value is not a secret, it belongs in forgecrawl.config.ts.
Only secrets and environment-specific overrides belong in .env (gitignored). The .env.example file (committed) provides key templates without values.
# .env — secrets only
NUXT_AUTH_SECRET= # Min 32 chars, signs JWTs (auto-generated if empty)
NUXT_ENCRYPTION_KEY= # AES-256-GCM key for site credentials (Phase 5)
NUXT_ALERT_WEBHOOK= # Discord/Slack webhook URL (optional)
# Optional: Supabase backend (replaces SQLite)
# NUXT_SUPABASE_URL=
# NUXT_SUPABASE_KEY=
# NUXT_SUPABASE_SERVICE_KEY=forgecrawl.config.ts (public defaults)
│
├──→ packages/app/nuxt.config.ts (imports toRuntimeConfig())
│ └──→ runtimeConfig available in server via useRuntimeConfig()
│
└──→ packages/web/nuxt.config.ts (imports config.app for metadata)
.env (secrets only)
└──→ Nuxt auto-maps NUXT_* vars to runtimeConfig at startup
| Decision | Rationale |
|---|---|
forgecrawl.config.ts as config source of truth |
Single file for all public defaults prevents scattered configuration across .env, nuxt.config.ts, and source files. Secrets stay in .env only. |
Nuxt 4 directory structure (app/, server/, shared/) |
Clear separation of client and server code. app/ for pages/components/composables, server/ at package root, shared/ for cross-boundary code. |
| SQLite over Supabase as default | Zero-config, zero-cost, no external dependency. Single file backup. Supabase available as optional upgrade via Drizzle ORM backend swap. |
| Puppeteer over Playwright | Smaller footprint, Chromium-only. Puppeteer's networkidle2 is well-suited for scraping. |
| SQLite queue over Redis | Fewer moving parts. Queue interface is abstract so Redis swap is config-only. SQLite WAL mode handles concurrent reads during writes. |
| pnpm workspace monorepo | Separates scraper app (packages/app) from marketing site (packages/web). Shared tooling, independent deploys. |
| PM2 fork mode (not cluster) | Puppeteer shares browser state. Cluster mode would spawn multiple browser instances and OOM. Scale via queue concurrency instead. |
| Turndown over remark/rehype | Purpose-built for HTML to Markdown. Plugin system handles edge cases. Lighter than remark pipeline. |
| Configurable storage | Database-only hits limits with large HTML blobs. Filesystem is cheaper for raw storage. Both gives metadata queries + cheap blob storage. |
Each phase is considered complete when:
- All specified features are implemented and functional
- The application can be deployed to a fresh DO droplet via documented steps
- Authentication works (first-run setup or login required for all routes)
- The scraping pipeline produces clean, usable Markdown from test URLs
- No critical security vulnerabilities in the deployment
- Firecrawl (github.com/mendableai/firecrawl) — Inspiration
- Mozilla Readability (github.com/mozilla/readability)
- Turndown (github.com/mixmark-io/turndown)
- Puppeteer (pptr.dev)
- Drizzle ORM (orm.drizzle.team)
- better-sqlite3 (github.com/WiseLibs/better-sqlite3)
- Nuxt 4 (nuxt.com/docs)
- Nuxt UI (ui.nuxt.com)