ClawBench is a benchmarking framework for evaluating web agents in real browser environments. It records user/agent interactions, HTTP requests, action screenshots, and full MP4 video recordings of each session.
Each test case runs in an isolated container (Docker or Podman) with a Chrome browser, a custom recording extension, and an AI agent. The framework captures everything the agent does and uses a request interceptor to detect task completion.
- Dependencies
- Quick Start
- Architecture
- Data Output
- Building the Container
- API Endpoints
- OpenClaw Agent Integration
- Synthetic User Profile
- Tool Restrictions
- Request Interceptor
- Test Driver
- Human Mode
- License
- Acknowledgments
- Python 3.11+
- uv (Python package manager — installs all Python deps automatically)
- Docker or Podman (container runtime)
Python dependencies (fpdf2, huggingface_hub, pyyaml) are managed by uv and installed automatically on first run.
# 1. Set up PurelyMail credentials:
cp .env.example .env
# Edit .env and fill in PURELY_MAIL_API_KEY and PURELY_MAIL_DOMAIN
# 2. Launch the interactive menu:
./run.shThe menu will guide you through configuring models, selecting test cases, and running benchmarks.
If you choose the run_single, the program will start the the test-driver/run.py to run on the docker sandbox:
uv run --project test-driver test-driver/run.py test-cases/[test-case] [model]
This will trigger the run.py main() function.
- The sandbox will be built.
- This function will fetch the email and then save the personal login information to the sandbox.
- start run
docker_runin the run.py.
┌─────────────────────────────────────────────────┐
│ Container (Docker / Podman) │
│ │
│ ┌───────────┐ DOM events ┌──────────────┐ │
│ │ content.js├──────────────►│ background.js│ │
│ │ (per tab) │ │ (service │ │
│ └───────────┘ │ worker) │ │
│ └──┬──────┬────┘ │
│ │ │ │
│ actions │ │ screenshots
│ │ │ │
│ ┌──────────┐ ┌──────▼──────▼────┐ │
│ │ Xvfb │◄──ffmpeg──►│ FastAPI Server │ │
│ │ :99 │ x11grab │ :7878 │ │
│ └──────────┘ └──────────────────┘ │
│ │ │
│ ┌──────────┐ ┌───────▼─────────┐ │
│ │ Chromium │ │ /data │ │
│ │ :9222 CDP│ │ actions.jsonl │ │
│ └──────────┘ │ requests.jsonl │ │
│ │ screenshots/ │ │
│ │ recording.mp4 │ │
│ └─────────────────┘ │
└─────────────────────────────────────────────────┘
All data is stored under /data in the container:
/data/
actions.jsonl # One JSON object per line, every DOM event
requests.jsonl # One JSON object per line, every HTTP request
agent-messages.jsonl # OpenClaw conversation transcript (thinking, text, tool calls)
screenshots/ # Timestamped PNGs, one per action
1710000001234.png
1710000002345.png
recording.mp4 # Full session video (H.264, 15fps)
interception.json # Details of any blocked requests by the interceptor (if applicable)
Each line is a JSON object:
{"type": "click", "timestamp": 1710000001234, "url": "https://example.com/", "target": {"tagName": "A", "id": "", "className": "btn", "textContent": "Submit", "xpath": "/html[1]/body[1]/div[1]/a[1]"}, "x": 255, "y": 245}
{"type": "keydown", "timestamp": 1710000002345, "url": "https://example.com/", "target": {...}, "key": "Enter"}
{"type": "input", "timestamp": 1710000003456, "url": "https://example.com/", "target": {...}, "value": "search query"}
{"type": "pageLoad", "timestamp": 1710000004567, "url": "https://example.com/results", "title": "Results"}Captured event types: pageLoad, click, keydown, keyup, input, scroll, change, submit.
agent-messages.jsonl contains the full OpenClaw conversation transcript. Each line is a JSON object:
type: "session"— session metadata (version, id, timestamp)type: "message"— conversation turn, withmessage.roleandmessage.content[]:
message.role |
Content types | Description |
|---|---|---|
user |
text |
The instruction prompt |
assistant |
text, thinking, toolCall |
Model response, reasoning, actions |
toolResult |
text |
Tool execution results |
requests.jsonl logs every HTTP request made by the browser during the session (excluding internal extension/server traffic). Each line:
{"timestamp": 1710000001.234, "url": "https://example.com/api?q=test", "method": "POST", "headers": {"Content-Type": "application/json"}, "body": {"action": "send"}, "query_params": {"q": "test"}, "resource_type": "XHR"}| Field | Description |
|---|---|
timestamp |
Unix epoch (float) |
url |
Full request URL |
method |
HTTP method (GET, POST, etc.) |
headers |
Request headers (object) |
body |
Parsed request body (JSON object, form dict, raw string, or null) |
query_params |
Parsed URL query parameters (object) |
resource_type |
Resource type: Document, Script, Stylesheet, XHR, Fetch, Image, Font, etc. |
Requests to localhost:7878 (extension server) and chrome-extension:// URLs are filtered out.
The framework supports both Docker and Podman (which works without root privileges). Set the CONTAINER_ENGINE environment variable to force one:
export CONTAINER_ENGINE=podman # or dockerIf unset, it auto-detects the one available on the system.
# Using docker:
docker build -t clawbench .
# Using podman (rootless, no sudo required):
podman build -t clawbench .| Port | Service | Purpose |
|---|---|---|
| 7878 | FastAPI server | Action/screenshot API, session control |
| 9223 | CDP (via socat) | Playwright/DevTools connection to Chromium |
| Method | Path | Description |
|---|---|---|
| GET | /api/status |
Health check |
| POST | /api/action |
Record a browser action (JSON body) |
| POST | /api/screenshot |
Store a screenshot (base64 PNG in JSON) |
| POST | /api/stop |
Signal session stop, finalize bookkeeping |
| POST | /api/stop-recording |
Stop ffmpeg recording, finalize MP4 |
The container uses OpenClaw as the agent driver to perform actions on the in-container Chromium browser via CDP. All agent actions are transparently recorded by the existing extension and server infrastructure.
| Variable | Example | Description |
|---|---|---|
MODEL_NAME |
claude-sonnet-4-6, gemini-3-flash-preview |
Model identifier |
BASE_URL |
https://api.openai.com/v1 |
API base URL |
API_TYPE |
openai-completions |
API type (openai-completions, anthropic-messages, etc.) |
API_KEY |
sk-ant-..., AIza... |
API key |
INSTRUCTION |
"Go to example.com and…" |
Task prompt for the agent |
TIME_LIMIT_S |
300 |
Watchdog timeout in seconds (default: 600) |
THINKING_LEVEL |
high, low, off |
Reasoning depth (default: medium) |
TEMPERATURE |
0.5 |
Sampling temperature (optional) |
MAX_TOKENS |
4096 |
Max output tokens (optional) |
The entrypoint (entrypoint.sh) orchestrates the following sequence:
- Xvfb — virtual display at
:99(1920x1080) - FastAPI server — data collection API on port 7878, starts ffmpeg screen recording
- Chromium — with Chrome extension loaded, CDP on port 9222
- socat — forwards port 9223 (external) to 9222 (internal CDP)
- setup-openclaw.sh — generates
~/.openclaw/openclaw.jsonand auth credentials from env vars - CDP health check — polls
http://127.0.0.1:9222/json/versionuntil Chrome is ready - OpenClaw gateway — local mode, manages agent execution and browser tool
- OpenClaw agent — runs
openclaw agent --session-id clawbench --message "$INSTRUCTION" --local - Watchdog — monitors
/data/actions.jsonl; stops when the eval interceptor matches (via/data/.stop-requested), after 900s of no new actions, or whenTIME_LIMIT_Sis reached - Cleanup — kills OpenClaw processes, calls
POST /api/stopfor bookkeeping, waits 15s grace period for recording to capture end result, callsPOST /api/stop-recordingto finalize MP4, exits
setup-openclaw.sh generates two files at runtime:
~/.openclaw/openclaw.json— gateway config (local mode), model provider settings, and browser profile pointing tohttp://127.0.0.1:9222(the in-container Chrome CDP endpoint)~/.openclaw/agents/main/agent/auth-profiles.json— API key credentials for the configured provider
The provider's baseUrl and api type are passed directly from the model config in models.yaml via BASE_URL and API_TYPE environment variables.
OpenClaw's built-in browser tool uses chrome-devtools-mcp to control the browser. However, as of v2026.3.13, the existing-session driver hardcodes --autoConnect when launching chrome-devtools-mcp, which only discovers Chrome via the DevToolsActivePort file in the default user data directory. It ignores the cdpUrl set in the browser profile config and never passes --browserUrl to chrome-devtools-mcp. This means the browser tool cannot connect to our Chromium instance running on port 9222. See openclaw/openclaw#47879.
Workaround applied in the Dockerfile:
- OpenClaw is pinned to
v2026.3.13 - A
sedpatch replaces"--autoConnect"with"--browserUrl","http://127.0.0.1:9222"across all bundled dist files - Chromium is launched with
--remote-allow-origins=*(required for chrome-devtools-mcp's internal WebSocket connections on Chrome 132+)
Once #47879 is resolved upstream, the version pin and patch can be removed.
Each container has a /my-info/ directory (read-only) containing a dummy user's identity and credentials:
/my-info/
alex_green_personal_info.json # Full profile (contact, address, education, work, etc.)
email_credentials.json # Auto-generated email + password + login URL
alex_green_resume.pdf # Resume PDF with dynamic email in header
The email field in both the personal info JSON and the resume PDF is updated each run to match the disposable email created for that session. The agent is instructed to read these files when it needs personal details for form filling, registration, etc.
Source templates live in shared/ (personal info) and test-driver/resume_template.json (resume). The PDF is generated at runtime by test-driver/generate_resume_pdf.py.
The exec tool is set to allowlist security mode in the generated OpenClaw config. Only safe, read-only commands are permitted (ls, cat, find, file, grep, sort, head, tail, jq, cut, uniq, tr, wc). Commands that could bypass the browser (e.g., curl, python, node, wget, smtplib) are blocked. The agent uses cat to read files in /my-info/ (the core files are listed in the instruction prompt, but ls is still available for extra info discovery).
The agent instruction prompt also explicitly requires browser-only task completion.
The interceptor blocks critical, irreversible HTTP requests (checkout, form submission, review posting, etc.) to prevent the agent from causing real-world side effects during evaluation. It does not validate task completion — evaluation is handled separately by evaluators post-session.
- Mount a JSON config at
/eval-schema.jsonin the container - The server connects to Chrome via CDP (
Fetchdomain) and intercepts all requests - When a request matches the
url_pattern(regex),method, and optionalbody/paramsfilters, the request is blocked - The blocked request's details are saved to
interception.json, the agent is killed, and the recording stops
The eval schema has two required fields (url_pattern, method) and two optional fields (body, params) for disambiguation.
{
"url_pattern": "inbox\\.purelymail\\.com",
"method": "POST",
"body": { "_action": "send" }
}The optional body and params are flat key-value maps — each key must match exactly in the request data. Use them when the same URL + method serves multiple actions (e.g., login vs send on the same endpoint, or different GraphQL operations).
For tasks behind payment walls or other natural blockers (agent has no valid credit card), use the placeholder pattern that never matches:
{
"url_pattern": "__PLACEHOLDER_WILL_NOT_MATCH__",
"method": "POST"
}The interceptor is only needed for actions that would have irreversible real-world consequences without a payment wall:
| Block | Examples |
|---|---|
| Yes | Public reviews, listings, job applications, contact forms, email sends, appointment bookings, website creation |
| No | Purchases, subscriptions, donations (payment wall), cart additions (reversible), searches (reversible), account creation (benign) |
/data/interception.json:
{
"intercepted": true,
"request": {
"url": "https://inbox.purelymail.com/action",
"method": "POST",
"params": {},
"body": {"_action": "send", "_to": "user@example.com"}
}
}The test driver (test-driver/run.py) automates running test cases end-to-end: creates a disposable email via PurelyMail, launches a container, enforces a time limit, collects results, and cleans up. Test cases are defined in test-cases/ with a task.json validated by test-cases/task.schema.json.
# Interactive menu (configure models, select cases, choose run mode):
./run.sh
# Or run directly:
uv run --project test-driver test-driver/run.py test-cases/886-entertainment-hobbies-experience-topgolf qwen3.5-397b-a17b
# Human mode (no agent — you control the browser via noVNC):
uv run --project test-driver test-driver/run.py test-cases/886-entertainment-hobbies-experience-topgolf --human
# Batch: all models x cases 1-50, 3 concurrent
uv run test-driver/batch.py --all-models --case-range 1-50 --max-concurrent 3
See test-driver/README.md for full documentation.
Human mode lets you perform test cases manually in the browser instead of using an AI agent. This is useful for collecting human baselines, debugging test cases, or verifying that a task is completable.
The container runs the same infrastructure (Xvfb, Chromium, extension, FastAPI server, ffmpeg recording) but instead of launching an OpenClaw agent, it exposes the browser via noVNC — a browser-based VNC client.
# Via interactive menu (select "Human mode"):
./run.sh
# Or directly:
uv run --project test-driver test-driver/run.py test-cases/886-entertainment-hobbies-experience-topgolf --humanAfter the container starts, open the noVNC URL printed in the terminal:
noVNC ready: http://localhost:6080/vnc.html
You'll see the Chromium browser in your web browser. Complete the task manually — all your actions, screenshots, and HTTP requests are recorded just like in agent mode.
The session stops when any of these happen:
- Eval interceptor matches — the target HTTP request was detected (task completed)
- VNC disconnect — you close the noVNC tab (15s grace period for reconnection)
- Time limit — the configured time limit expires
After the session ends, results are collected in the same format as agent runs (actions, screenshots, recording, interception).
This project is licensed under the Apache License 2.0 — see the LICENSE file for details.
ClawBench uses the following open-source projects:
- noVNC (MPL 2.0) — browser-based VNC client for human mode
- websockify (LGPL 3.0) — WebSocket-to-TCP proxy for VNC
- OpenClaw — AI agent driver for browser automation