Skip to content

Commit f1e24a3

Browse files
authored
Merge pull request #132 from agent-diff-bench/fixes-kdd
DB Performance
2 parents 26ceaaf + 8448109 commit f1e24a3

19 files changed

Lines changed: 3971 additions & 697 deletions

File tree

AGENTS.md

Lines changed: 247 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,247 @@
1+
# AGENTS.md — Agent-Diff Developer Guide
2+
3+
## Project Overview
4+
5+
Agent-Diff is a benchmarking platform for evaluating AI agents that interact with
6+
real-world SaaS APIs (Slack, Linear, Box, Google Calendar). It provides **isolated,
7+
reproducible environments** backed by PostgreSQL schema cloning.
8+
9+
## Architecture
10+
11+
```
12+
┌──────────────────────────┐ ┌──────────────────────┐
13+
│ Evaluation Client │ │ Agent Sandbox │
14+
│ (prime eval / SDK) │──────▶│ (Docker container) │
15+
│ │ │ │
16+
│ 1. initEnv │ │ Runs agent code │
17+
│ 2. startRun │ │ Makes API calls ──┐ │
18+
│ 3. evaluateRun │ └────────────────────┼─┘
19+
│ 4. getResults │ │
20+
└──────────┬───────────────┘ │
21+
│ │
22+
▼ ▼
23+
┌──────────────────────────────────────────────────────────┐
24+
│ AgentDiff Backend (FastAPI/Starlette) │
25+
│ │
26+
│ Platform API (/api/platform/*) │
27+
│ - initEnv, startRun, evaluateRun, diffRun │
28+
│ - Template & test suite management │
29+
│ │
30+
│ Service APIs (/api/env/{env_id}/services/{service}/*) │
31+
│ - Box REST API replica (/services/box/2.0/*) │
32+
│ - Slack API replica (/services/slack/*) │
33+
│ - Linear GraphQL replica (/services/linear/*) │
34+
│ - Calendar API replica (/services/calendar/*) │
35+
│ │
36+
│ Middleware: │
37+
│ PlatformMiddleware → API key auth for platform calls │
38+
│ IsolationMiddleware → per-env DB session + auth │
39+
└──────────────────────────────────────────────────────────┘
40+
```
41+
42+
## Environment Lifecycle
43+
44+
### 1. Create an Isolated Environment (initEnv)
45+
46+
Every evaluation starts by creating an isolated copy of a template database schema.
47+
48+
**Via SDK (Python):**
49+
```python
50+
from agent_diff import AgentDiff
51+
52+
client = AgentDiff(
53+
api_key="ad_live_sk_...",
54+
base_url="https://api.agentdiff.dev", # or http://localhost:8000
55+
)
56+
57+
env = client.init_env(
58+
templateService="box", # "box" | "linear" | "slack" | "calendar"
59+
templateName="box_default", # name of the seeded template
60+
impersonateUserId="27512847635", # user ID from the seed data
61+
)
62+
# env.environmentId → hex string, e.g. "824d0c408eeb42368f20e24d2d9f03c3"
63+
# env.environmentUrl → "/api/env/{env_id}/services/box"
64+
```
65+
66+
**Via curl:**
67+
```bash
68+
curl -X POST https://api.agentdiff.dev/api/platform/initEnv \
69+
-H "X-API-Key: ad_live_sk_..." \
70+
-H "Content-Type: application/json" \
71+
-d '{
72+
"templateService": "box",
73+
"templateName": "box_default",
74+
"impersonateUserId": "27512847635"
75+
}'
76+
```
77+
78+
**What happens internally:**
79+
1. `templateManager.resolve_init_template()` finds the template by service+name
80+
2. `CoreIsolationEngine.create_environment()` clones the template PostgreSQL schema
81+
3. A new `state_<uuid>` schema is created with all tables and data copied
82+
4. A `RunTimeEnvironment` record is stored in the meta schema with TTL
83+
84+
### 2. Make API Calls Against the Environment
85+
86+
Once the environment is created, API calls go to the service replica endpoints:
87+
88+
```
89+
Base URL: {base_url}/api/env/{env_id}/services/{service}
90+
91+
Box: /api/env/{env_id}/services/box/2.0/search?query=fomc
92+
Linear: /api/env/{env_id}/services/linear/graphql
93+
Slack: /api/env/{env_id}/services/slack/conversations.list
94+
Calendar: /api/env/{env_id}/services/calendar/calendars/{calendarId}/events
95+
```
96+
97+
Each request goes through `IsolationMiddleware` which:
98+
1. Validates the API key via control plane (`get_principal_id`)
99+
2. Looks up the environment in meta DB to get impersonate_user_id
100+
3. Opens a DB session scoped to the environment's `state_<uuid>` schema
101+
4. Passes the request to the service route handler
102+
103+
### 3. Start a Run & Evaluate
104+
105+
```python
106+
run = client.start_run(envId=env.environmentId)
107+
# ... agent makes API calls that modify the environment ...
108+
result = client.evaluate_run(runId=run.runId, expectedOutput={...})
109+
results = client.get_results_for_run(runId=run.runId)
110+
```
111+
112+
### 4. Cleanup
113+
114+
```python
115+
client.delete_env(envId=env.environmentId)
116+
```
117+
118+
## Available Templates
119+
120+
| Service | Template Name | Impersonate User ID |
121+
|----------|-------------------|----------------------------------------|
122+
| box | box_default | 27512847635 |
123+
| linear | linear_default | 2790a7ee-fde0-4537-9588-e233aa5a68d1 |
124+
| slack | slack_default | U01AGENBOT9 |
125+
| calendar | calendar_base | (varies by seed) |
126+
127+
## Writing Tests
128+
129+
### Integration Tests (in-process, no HTTP server)
130+
131+
Tests create environments via `core_isolation_engine.create_environment()` and
132+
wire up an `AsyncClient` with middleware that injects the DB session:
133+
134+
```python
135+
@pytest_asyncio.fixture
136+
async def box_client(test_user_id, core_isolation_engine, session_manager, environment_handler):
137+
env_result = core_isolation_engine.create_environment(
138+
template_schema="box_default",
139+
ttl_seconds=3600,
140+
created_by=test_user_id,
141+
impersonate_user_id="27512847635",
142+
)
143+
144+
async def add_db_session(request, call_next):
145+
with session_manager.with_session_for_environment(env_result.environment_id) as session:
146+
request.state.db_session = session
147+
request.state.environment_id = env_result.environment_id
148+
request.state.impersonate_user_id = "27512847635"
149+
request.state.impersonate_email = None
150+
response = await call_next(request)
151+
return response
152+
153+
from src.services.box.api.routes import routes as box_routes
154+
app = Starlette(routes=box_routes)
155+
app.middleware("http")(add_db_session)
156+
157+
transport = ASGITransport(app=app)
158+
async with AsyncClient(transport=transport, base_url="http://test") as client:
159+
yield client
160+
161+
environment_handler.drop_schema(env_result.schema_name)
162+
```
163+
164+
### Running Tests
165+
166+
```bash
167+
cd backend
168+
# Requires DATABASE_URL in .env or environment
169+
pytest tests/performance/test_box_bench_perf.py -v -s
170+
pytest tests/integration/ -v
171+
```
172+
173+
## Running Evaluations Locally
174+
175+
```bash
176+
# 1. Activate the bench environment's venv
177+
source third_party/prime-environments/environments/agent_diff_bench/.venv/bin/activate
178+
179+
# 2. Install the environment package
180+
cd third_party/prime-environments/environments/agent_diff_bench
181+
uv pip install -e .
182+
183+
# 3. Run evaluation (from the agent_diff_bench directory)
184+
uv run prime eval run agent-diff-bench \
185+
-m "openai/gpt-5-mini" \
186+
-n 5 -r 3 -s \
187+
-a '{"agentdiff_api_key": "ad_live_sk_..."}'
188+
```
189+
190+
Results are saved to: `third_party/prime-environments/environments/agent_diff_bench/eval_results/`
191+
192+
## Database Seeding
193+
194+
Templates are seeded from JSON files in `backend/seeds/` (Docker) or `examples/` (local).
195+
196+
Seed scripts in `backend/utils/`:
197+
- `seed_box_template.py` — creates box_default, box_base templates
198+
- `seed_linear_template.py` — creates linear_default, linear_base, linear_expanded
199+
- `seed_slack_template.py` — creates slack_default, slack_bench_default
200+
- `seed_calendar_template.py` — creates calendar_base
201+
- `seed_tests.py` — loads test suite JSON files
202+
203+
On Railway, seeding runs automatically on deploy when `SEED=true` env var is set.
204+
The Dockerfile startup script runs Alembic migrations then all seed scripts.
205+
206+
## Performance Profiling
207+
208+
All `[PERF]` log lines are instrumented for performance tracking:
209+
210+
- **Middleware**: `[PERF] GET /api/env/.../services/box/... total=Xms auth=Xms meta_db=Xms handler=Xms`
211+
- **Box operations**: `[PERF] search_content TOTAL=Xms`, `[PERF] get_folder_by_id(...) time=Xms`
212+
- **Box schema**: `[PERF] File._get_path_collection depth=N time=Xms`
213+
- **Calendar**: `[PERF] Calendar events_list took Xms`
214+
215+
Filter with: `grep "\[PERF\]"` in Railway logs.
216+
217+
## Key Directories
218+
219+
```
220+
backend/
221+
src/
222+
platform/ # Platform API (initEnv, runs, evaluation)
223+
services/
224+
box/ # Box API replica
225+
slack/ # Slack API replica
226+
linear/ # Linear API replica
227+
calendar/ # Calendar API replica
228+
tests/
229+
integration/ # Full-stack integration tests
230+
performance/ # Performance/benchmark tests
231+
validation/ # API parity tests
232+
unit/ # Unit tests
233+
utils/ # Seed scripts
234+
seeds/ # Seed data JSON files (for Docker)
235+
236+
sdk/agent-diff-python/ # Python SDK (agent_diff package)
237+
238+
examples/
239+
box/ # Box seed data + test suites
240+
linear/ # Linear seed data + test suites
241+
slack/ # Slack seed data + test suites
242+
calendar/ # Calendar seed data
243+
244+
third_party/prime-environments/environments/agent_diff_bench/
245+
agent_diff_bench.py # Entry point for prime eval
246+
src/environment.py # Environment setup (initEnv, startRun, etc.)
247+
```

backend/src/platform/api/middleware.py

Lines changed: 22 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -1,6 +1,7 @@
11
from __future__ import annotations
22

33
import logging
4+
import time
45

56
from starlette.middleware.base import BaseHTTPMiddleware
67
from starlette.requests import Request
@@ -86,6 +87,8 @@ async def dispatch(self, request: Request, call_next) -> Response:
8687
if not path.startswith("/api/env/"):
8788
return await call_next(request)
8889

90+
t_total_start = time.perf_counter()
91+
8992
try:
9093
path_after_prefix = path[len("/api/env/") :]
9194
env_id = path_after_prefix.split("/")[0] if path_after_prefix else ""
@@ -106,8 +109,11 @@ async def dispatch(self, request: Request, call_next) -> Response:
106109
status_code=status.HTTP_401_UNAUTHORIZED,
107110
)
108111

112+
t_auth_start = time.perf_counter()
109113
principal_id = await get_principal_id(api_key_hdr, action="api_request")
114+
t_auth_ms = (time.perf_counter() - t_auth_start) * 1000
110115

116+
t_meta_start = time.perf_counter()
111117
with self.session_manager.with_meta_session() as meta_session:
112118
request.state.principal_id = principal_id
113119

@@ -125,11 +131,26 @@ async def dispatch(self, request: Request, call_next) -> Response:
125131
logger.debug(
126132
f"Could not load impersonation data for env {env_id}: {e}"
127133
)
134+
t_meta_ms = (time.perf_counter() - t_meta_start) * 1000
128135

136+
t_handler_start = time.perf_counter()
129137
with self.session_manager.with_session_for_environment(env_id) as session:
130138
request.state.db_session = session
131139
request.state.environment_id = env_id
132-
return await call_next(request)
140+
response = await call_next(request)
141+
t_handler_ms = (time.perf_counter() - t_handler_start) * 1000
142+
143+
t_total_ms = (time.perf_counter() - t_total_start) * 1000
144+
# Extract service from path for easier log filtering
145+
parts = path_after_prefix.split("/")
146+
service_name = parts[2] if len(parts) > 2 else "unknown"
147+
logger.info(
148+
f"[PERF] {request.method} {path} | service={service_name} "
149+
f"total={t_total_ms:.0f}ms auth={t_auth_ms:.0f}ms "
150+
f"meta_db={t_meta_ms:.0f}ms handler={t_handler_ms:.0f}ms "
151+
f"status={response.status_code}"
152+
)
153+
return response
133154

134155
except PermissionError as exc:
135156
return JSONResponse(
Lines changed: 57 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,57 @@
1+
"""Add composite indexes for calendar event queries
2+
3+
Adds composite indexes on calendar_events to optimize the most common
4+
query patterns: time-range filtering, status filtering, and sync-token
5+
incremental queries.
6+
7+
Revision ID: a1b2c3d4e5f6
8+
Revises: merge_heads_20260130
9+
Create Date: 2026-02-11 12:00:00.000000
10+
11+
"""
12+
13+
from typing import Sequence, Union
14+
15+
from alembic import op
16+
17+
18+
# revision identifiers, used by Alembic.
19+
revision: str = "a1b2c3d4e5f6"
20+
down_revision: Union[str, None] = "merge_heads_20260130"
21+
branch_labels: Union[str, Sequence[str], None] = None
22+
depends_on: Union[str, Sequence[str], None] = None
23+
24+
25+
def upgrade() -> None:
26+
# Composite index for the most common list_events query pattern:
27+
# WHERE calendar_id = X AND status != 'cancelled' AND start_datetime < Y
28+
op.create_index(
29+
"ix_event_cal_status_start",
30+
"calendar_events",
31+
["calendar_id", "status", "start_datetime"],
32+
unique=False,
33+
)
34+
35+
# Composite index for time-range queries (list_events with timeMin/timeMax, freebusy):
36+
# WHERE calendar_id = X AND start_datetime >= Y AND end_datetime <= Z
37+
op.create_index(
38+
"ix_event_cal_start_end",
39+
"calendar_events",
40+
["calendar_id", "start_datetime", "end_datetime"],
41+
unique=False,
42+
)
43+
44+
# Composite index for sync-token incremental queries:
45+
# WHERE calendar_id = X AND updated_at > Y
46+
op.create_index(
47+
"ix_event_cal_updated",
48+
"calendar_events",
49+
["calendar_id", "updated_at"],
50+
unique=False,
51+
)
52+
53+
54+
def downgrade() -> None:
55+
op.drop_index("ix_event_cal_updated", table_name="calendar_events")
56+
op.drop_index("ix_event_cal_start_end", table_name="calendar_events")
57+
op.drop_index("ix_event_cal_status_start", table_name="calendar_events")

0 commit comments

Comments
 (0)