ClawBench

ClawBench is a benchmarking framework for evaluating web agents in real browser environments. It records user/agent interactions, HTTP requests, action screenshots, and full MP4 video recordings of each session.

Each test case runs in an isolated container (Docker or Podman) with a Chrome browser, a custom recording extension, and an AI agent. The framework captures everything the agent does and uses a request interceptor to detect task completion.

Dependencies

Python 3.11+
uv (Python package manager — installs all Python deps automatically)
Docker or Podman (container runtime)

Python dependencies (fpdf2, huggingface_hub, pyyaml) are managed by uv and installed automatically on first run.

Quick Start

# 1. Set up PurelyMail credentials:
cp .env.example .env
# Edit .env and fill in PURELY_MAIL_API_KEY and PURELY_MAIL_DOMAIN

# 2. Launch the interactive menu:
./run.sh

The menu will guide you through configuring models, selecting test cases, and running benchmarks.

What's under the hood?

If you choose the run_single, the program will start the the test-driver/run.py to run on the docker sandbox:

uv run --project test-driver test-driver/run.py test-cases/[test-case] [model]

This will trigger the run.py main() function.

The sandbox will be built.
This function will fetch the email and then save the personal login information to the sandbox.
start run docker_run in the run.py.

Architecture

┌─────────────────────────────────────────────────┐
│  Container (Docker / Podman)                    │
│                                                 │
│  ┌───────────┐   DOM events  ┌──────────────┐   │
│  │ content.js├──────────────►│ background.js│   │
│  │ (per tab) │               │  (service    │   │
│  └───────────┘               │   worker)    │   │
│                              └──┬──────┬────┘   │
│                                 │      │        │
│                         actions │      │ screenshots
│                                 │      │        │
│  ┌──────────┐            ┌──────▼──────▼────┐   │
│  │  Xvfb    │◄──ffmpeg──►│  FastAPI Server  │   │
│  │ :99      │  x11grab   │  :7878           │   │
│  └──────────┘            └──────────────────┘   │
│                                  │              │
│  ┌──────────┐            ┌───────▼─────────┐    │
│  │ Chromium │            │     /data       │    │
│  │ :9222 CDP│            │  actions.jsonl  │    │
│  └──────────┘            │  requests.jsonl │    │
│                          │  screenshots/   │    │
│                          │  recording.mp4  │    │
│                          └─────────────────┘    │
└─────────────────────────────────────────────────┘

Data Output

All data is stored under /data in the container:

/data/
  actions.jsonl          # One JSON object per line, every DOM event
  requests.jsonl         # One JSON object per line, every HTTP request
  agent-messages.jsonl   # OpenClaw conversation transcript (thinking, text, tool calls)
  screenshots/           # Timestamped PNGs, one per action
    1710000001234.png
    1710000002345.png
  recording.mp4          # Full session video (H.264, 15fps)
  interception.json      # Details of any blocked requests by the interceptor (if applicable)

Action Format (JSONL)

Each line is a JSON object:

{"type": "click", "timestamp": 1710000001234, "url": "https://example.com/", "target": {"tagName": "A", "id": "", "className": "btn", "textContent": "Submit", "xpath": "/html[1]/body[1]/div[1]/a[1]"}, "x": 255, "y": 245}
{"type": "keydown", "timestamp": 1710000002345, "url": "https://example.com/", "target": {...}, "key": "Enter"}
{"type": "input", "timestamp": 1710000003456, "url": "https://example.com/", "target": {...}, "value": "search query"}
{"type": "pageLoad", "timestamp": 1710000004567, "url": "https://example.com/results", "title": "Results"}

Captured event types: pageLoad, click, keydown, keyup, input, scroll, change, submit.

Agent Messages Format (JSONL)

agent-messages.jsonl contains the full OpenClaw conversation transcript. Each line is a JSON object:

type: "session" — session metadata (version, id, timestamp)
type: "message" — conversation turn, with message.role and message.content[]:

`message.role`	Content types	Description
`user`	`text`	The instruction prompt
`assistant`	`text`, `thinking`, `toolCall`	Model response, reasoning, actions
`toolResult`	`text`	Tool execution results

HTTP Requests Format (JSONL)

requests.jsonl logs every HTTP request made by the browser during the session (excluding internal extension/server traffic). Each line:

{"timestamp": 1710000001.234, "url": "https://example.com/api?q=test", "method": "POST", "headers": {"Content-Type": "application/json"}, "body": {"action": "send"}, "query_params": {"q": "test"}, "resource_type": "XHR"}

Field	Description
`timestamp`	Unix epoch (float)
`url`	Full request URL
`method`	HTTP method (GET, POST, etc.)
`headers`	Request headers (object)
`body`	Parsed request body (JSON object, form dict, raw string, or null)
`query_params`	Parsed URL query parameters (object)
`resource_type`	Resource type: Document, Script, Stylesheet, XHR, Fetch, Image, Font, etc.

Requests to localhost:7878 (extension server) and chrome-extension:// URLs are filtered out.

Building the Container

Container engine

The framework supports both Docker and Podman (which works without root privileges). Set the CONTAINER_ENGINE environment variable to force one:

export CONTAINER_ENGINE=podman  # or docker

If unset, it auto-detects the one available on the system.

Build

# Using docker:
docker build -t clawbench .

# Using podman (rootless, no sudo required):
podman build -t clawbench .

Ports

Port	Service	Purpose
7878	FastAPI server	Action/screenshot API, session control
9223	CDP (via socat)	Playwright/DevTools connection to Chromium

API Endpoints

Method	Path	Description
GET	`/api/status`	Health check
POST	`/api/action`	Record a browser action (JSON body)
POST	`/api/screenshot`	Store a screenshot (base64 PNG in JSON)
POST	`/api/stop`	Signal session stop, finalize bookkeeping
POST	`/api/stop-recording`	Stop ffmpeg recording, finalize MP4

OpenClaw Agent Integration

The container uses OpenClaw as the agent driver to perform actions on the in-container Chromium browser via CDP. All agent actions are transparently recorded by the existing extension and server infrastructure.

Environment Variables

Variable	Example	Description
`MODEL_NAME`	`claude-sonnet-4-6`, `gemini-3-flash-preview`	Model identifier
`BASE_URL`	`https://api.openai.com/v1`	API base URL
`API_TYPE`	`openai-completions`	API type (`openai-completions`, `anthropic-messages`, etc.)
`API_KEY`	`sk-ant-...`, `AIza...`	API key
`INSTRUCTION`	`"Go to example.com and…"`	Task prompt for the agent
`TIME_LIMIT_S`	`300`	Watchdog timeout in seconds (default: 600)
`THINKING_LEVEL`	`high`, `low`, `off`	Reasoning depth (default: `medium`)
`TEMPERATURE`	`0.5`	Sampling temperature (optional)
`MAX_TOKENS`	`4096`	Max output tokens (optional)

Container Lifecycle with OpenClaw

The entrypoint (entrypoint.sh) orchestrates the following sequence:

Xvfb — virtual display at :99 (1920x1080)
FastAPI server — data collection API on port 7878, starts ffmpeg screen recording
Chromium — with Chrome extension loaded, CDP on port 9222
socat — forwards port 9223 (external) to 9222 (internal CDP)
setup-openclaw.sh — generates ~/.openclaw/openclaw.json and auth credentials from env vars
CDP health check — polls http://127.0.0.1:9222/json/version until Chrome is ready
OpenClaw gateway — local mode, manages agent execution and browser tool
OpenClaw agent — runs openclaw agent --session-id clawbench --message "$INSTRUCTION" --local
Watchdog — monitors /data/actions.jsonl; stops when the eval interceptor matches (via /data/.stop-requested), after 900s of no new actions, or when TIME_LIMIT_S is reached
Cleanup — kills OpenClaw processes, calls POST /api/stop for bookkeeping, waits 15s grace period for recording to capture end result, calls POST /api/stop-recording to finalize MP4, exits

OpenClaw Configuration

setup-openclaw.sh generates two files at runtime:

~/.openclaw/openclaw.json — gateway config (local mode), model provider settings, and browser profile pointing to http://127.0.0.1:9222 (the in-container Chrome CDP endpoint)
~/.openclaw/agents/main/agent/auth-profiles.json — API key credentials for the configured provider

The provider's baseUrl and api type are passed directly from the model config in models.yaml via BASE_URL and API_TYPE environment variables.

OpenClaw Browser Patch

OpenClaw's built-in browser tool uses chrome-devtools-mcp to control the browser. However, as of v2026.3.13, the existing-session driver hardcodes --autoConnect when launching chrome-devtools-mcp, which only discovers Chrome via the DevToolsActivePort file in the default user data directory. It ignores the cdpUrl set in the browser profile config and never passes --browserUrl to chrome-devtools-mcp. This means the browser tool cannot connect to our Chromium instance running on port 9222. See openclaw/openclaw#47879.

Workaround applied in the Dockerfile:

OpenClaw is pinned to v2026.3.13
A sed patch replaces "--autoConnect" with "--browserUrl","http://127.0.0.1:9222" across all bundled dist files
Chromium is launched with --remote-allow-origins=* (required for chrome-devtools-mcp's internal WebSocket connections on Chrome 132+)

Once #47879 is resolved upstream, the version pin and patch can be removed.

Synthetic User Profile

Each container has a /my-info/ directory (read-only) containing a dummy user's identity and credentials:

/my-info/
  alex_green_personal_info.json   # Full profile (contact, address, education, work, etc.)
  email_credentials.json          # Auto-generated email + password + login URL
  alex_green_resume.pdf           # Resume PDF with dynamic email in header

The email field in both the personal info JSON and the resume PDF is updated each run to match the disposable email created for that session. The agent is instructed to read these files when it needs personal details for form filling, registration, etc.

Source templates live in shared/ (personal info) and test-driver/resume_template.json (resume). The PDF is generated at runtime by test-driver/generate_resume_pdf.py.

Tool Restrictions

The exec tool is set to allowlist security mode in the generated OpenClaw config. Only safe, read-only commands are permitted (ls, cat, find, file, grep, sort, head, tail, jq, cut, uniq, tr, wc). Commands that could bypass the browser (e.g., curl, python, node, wget, smtplib) are blocked. The agent uses cat to read files in /my-info/ (the core files are listed in the instruction prompt, but ls is still available for extra info discovery).

The agent instruction prompt also explicitly requires browser-only task completion.

Request Interceptor

The interceptor blocks critical, irreversible HTTP requests (checkout, form submission, review posting, etc.) to prevent the agent from causing real-world side effects during evaluation. It does not validate task completion — evaluation is handled separately by evaluators post-session.

How It Works

Mount a JSON config at /eval-schema.json in the container
The server connects to Chrome via CDP (Fetch domain) and intercepts all requests
When a request matches the url_pattern (regex), method, and optional body/params filters, the request is blocked
The blocked request's details are saved to interception.json, the agent is killed, and the recording stops

Schema Format

The eval schema has two required fields (url_pattern, method) and two optional fields (body, params) for disambiguation.

{
  "url_pattern": "inbox\\.purelymail\\.com",
  "method": "POST",
  "body": { "_action": "send" }
}

The optional body and params are flat key-value maps — each key must match exactly in the request data. Use them when the same URL + method serves multiple actions (e.g., login vs send on the same endpoint, or different GraphQL operations).

For tasks behind payment walls or other natural blockers (agent has no valid credit card), use the placeholder pattern that never matches:

{
  "url_pattern": "__PLACEHOLDER_WILL_NOT_MATCH__",
  "method": "POST"
}

When to Block

The interceptor is only needed for actions that would have irreversible real-world consequences without a payment wall:

Block	Examples
Yes	Public reviews, listings, job applications, contact forms, email sends, appointment bookings, website creation
No	Purchases, subscriptions, donations (payment wall), cart additions (reversible), searches (reversible), account creation (benign)

Interception Output

/data/interception.json:

{
  "intercepted": true,
  "request": {
    "url": "https://inbox.purelymail.com/action",
    "method": "POST",
    "params": {},
    "body": {"_action": "send", "_to": "user@example.com"}
  }
}

Test Driver

The test driver (test-driver/run.py) automates running test cases end-to-end: creates a disposable email via PurelyMail, launches a container, enforces a time limit, collects results, and cleans up. Test cases are defined in test-cases/ with a task.json validated by test-cases/task.schema.json.

# Interactive menu (configure models, select cases, choose run mode):
./run.sh

# Or run directly:
uv run --project test-driver test-driver/run.py test-cases/886-entertainment-hobbies-experience-topgolf qwen3.5-397b-a17b

# Human mode (no agent — you control the browser via noVNC):
uv run --project test-driver test-driver/run.py test-cases/886-entertainment-hobbies-experience-topgolf --human

# Batch: all models x cases 1-50, 3 concurrent
uv run test-driver/batch.py --all-models --case-range 1-50 --max-concurrent 3

See test-driver/README.md for full documentation.

Human Mode

Human mode lets you perform test cases manually in the browser instead of using an AI agent. This is useful for collecting human baselines, debugging test cases, or verifying that a task is completable.

The container runs the same infrastructure (Xvfb, Chromium, extension, FastAPI server, ffmpeg recording) but instead of launching an OpenClaw agent, it exposes the browser via noVNC — a browser-based VNC client.

Usage

# Via interactive menu (select "Human mode"):
./run.sh

# Or directly:
uv run --project test-driver test-driver/run.py test-cases/886-entertainment-hobbies-experience-topgolf --human

After the container starts, open the noVNC URL printed in the terminal:

noVNC ready: http://localhost:6080/vnc.html

You'll see the Chromium browser in your web browser. Complete the task manually — all your actions, screenshots, and HTTP requests are recorded just like in agent mode.

How it ends

The session stops when any of these happen:

Eval interceptor matches — the target HTTP request was detected (task completed)
VNC disconnect — you close the noVNC tab (15s grace period for reconnection)
Time limit — the configured time limit expires

After the session ends, results are collected in the same format as agent runs (actions, screenshots, recording, interception).

License

This project is licensed under the Apache License 2.0 — see the LICENSE file for details.

Acknowledgments

ClawBench uses the following open-source projects:

noVNC (MPL 2.0) — browser-based VNC client for human mode
websockify (LGPL 3.0) — WebSocket-to-TCP proxy for VNC
OpenClaw — AI agent driver for browser automation

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
.claude/skills/meta-stock		.claude/skills/meta-stock
chrome-extension		chrome-extension
extension-server		extension-server
models		models
prompts		prompts
shared		shared
test-cases		test-cases
test-driver		test-driver
.dockerignore		.dockerignore
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
LICENSE		LICENSE
NOTICE		NOTICE
README.md		README.md
entrypoint.sh		entrypoint.sh
run.sh		run.sh
setup-openclaw.sh		setup-openclaw.sh

Folders and files

Latest commit

History

Repository files navigation

ClawBench

Table of Contents

Dependencies

Quick Start

What's under the hood?

Architecture

Data Output

Action Format (JSONL)

Agent Messages Format (JSONL)

HTTP Requests Format (JSONL)

Building the Container

Container engine

Build

Ports

API Endpoints

OpenClaw Agent Integration

Environment Variables

Container Lifecycle with OpenClaw

OpenClaw Configuration

OpenClaw Browser Patch

Synthetic User Profile

Tool Restrictions

Request Interceptor

How It Works

Schema Format

When to Block

Interception Output

Test Driver

Human Mode

Usage

How it ends

License

Acknowledgments

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages