You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: README.md
+89-19Lines changed: 89 additions & 19 deletions
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -5,8 +5,11 @@
5
5
Run it locally (or deploy it). Agents call sandboxed replicas of APIs that behave like the real ones, and you get deterministic diffs of every state change — no external services, no side effects, no rate limits.
-**Slack** – core Web API coverage for conversations, chat, reactions, users, etc. Full list here [`backend/src/services/slack/README.md`](backend/src/services/slack/README.md). A few examples:
115
+
-**Box** – REST API for file/folder management, search, comments, tags, shared links, hubs, and content versioning. See [`backend/src/services/box/README.md`](backend/src/services/box/README.md). 27 endpoints.
113
116
114
-
```python
115
-
"chat.postMessage"# post messages in seeded channels/DMs
116
-
"conversations.open"# spin up IM/MPIM threads
117
-
"reactions.add"# add emoji reactions to seeded messages
118
-
```
117
+
-**Google Calendar** – REST API for calendar CRUD, events, recurring series, free/busy queries, ACL rules, calendar list management, and push notifications. See [`backend/src/services/calendar/README.md`](backend/src/services/calendar/README.md). 37 endpoints.
119
118
120
-
-**Linear** – GraphQL API. See [`backend/src/services/linear/README.md`](backend/src/services/linear/README.md).
119
+
-**Linear** – GraphQL API for issue tracking, teams, workflow states, labels, comments, relations, and memberships. See [`backend/src/services/linear/README.md`](backend/src/services/linear/README.md). 19 endpoints.
-**Slack** – Web API for conversations, messaging, reactions, threading, users, and channels. See [`backend/src/services/slack/README.md`](backend/src/services/slack/README.md). 25 endpoints.
SDK provides **code execution proxies** - tools for AI agents. You add it to your toolbox in Vercel AI SDK, Langchain or OpenAI Agents, making LLM write Python or Bash code to talk with Slack or Linear API. Requests will automatically be intercepted and routed to isolated test environments. This enables agents to interact with service replicas without any code changes. See more in: **[Python SDK](sdk/agent-diff-python/README.md)**
150
145
151
146
152
-
## Benchmark & Training
147
+
## Paper
148
+
149
+
> **Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation**
150
+
> Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson
If you use Agent-Diff in your research, please cite:
153
155
154
-
-**HuggingFace Dataset**: [hubertmarek/agent-diff-bench](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) — 224 tasks across all 4 services (80/20 train/test split, stratified by service)
155
-
-**Prime Intellect Environment**: [agent-diff-bench on Prime Lab](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench) — run evaluations or RL training via Hosted Training
156
-
-**Paper**: [AgentDiff: Agentic API Evaluation via State Differencing (KDD 2026 pre-print)](https://drive.google.com/file/d/1BlmJTSMX7ohwvD1aYBByg7_Y815fgsxp/view?usp=sharing)
156
+
```bibtex
157
+
@article{pysklo2025agentdiff,
158
+
title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
159
+
author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
160
+
journal={arXiv preprint arXiv:2602.11224},
161
+
year={2025}
162
+
}
163
+
```
164
+
165
+
## Run Evaluations
166
+
167
+
The fastest way to run Agent-Diff evaluations is via **[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — run evals or RL training with no setup required.
168
+
169
+
Alternatively, run locally or self-hosted using the SDK (see [To run evaluations](#to-run-evaluations) below).
170
+
171
+
**Resources:**
172
+
-**Dataset**: [hubertmarek/agent-diff-bench](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) — 224 tasks across all 4 services (80/20 train/test split)
173
+
-**Prime Intellect**: [agent-diff-bench on Prime Lab](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench) — hosted evaluations & RL training
174
+
175
+
## Benchmark
176
+
177
+
The Agent-Diff benchmark comprises **224 tasks** across four enterprise services, each evaluated via deterministic state-diff contracts. Tasks span single-step CRUD operations to long-horizon, multi-entity workflows requiring search, conditional logic, and coordinated state changes.
|**Operation profile**_(% of tasks, non-exclusive)_||||||
188
+
| Search | 92 | 77 | 89 | 64 | 80 |
189
+
| Create | 58 | 78 | 63 | 88 | 73 |
190
+
| Read | 54 | 82 | 14 | 68 | 55 |
191
+
| Update | 62 | 93 | 70 | 37 | 66 |
192
+
| Delete | 19 | 53 | 7 | 24 | 26 |
193
+
|||||||
194
+
|**Entity scope**||||||
195
+
| Single-entity | 28 | 11 | 33 | 33 | 105 |
196
+
| Multi-entity | 20 | 49 | 24 | 26 | 119 |
197
+
|||||||
198
+
|**Information availability**||||||
199
+
| Explicit | 6 | 10 | 25 | 36 | 77 |
200
+
| Implicit | 42 | 50 | 32 | 23 | 147 |
201
+
|||||||
202
+
|**Prompt ambiguity**||||||
203
+
| Low | 24 | 13 | 37 | 27 | 101 |
204
+
| Medium | 17 | 45 | 19 | 22 | 103 |
205
+
| High | 7 | 2 | 1 | 10 | 20 |
206
+
207
+
Tasks are characterized along five dimensions: _task horizon_ (minimum API calls under an optimal policy), _operation profile_ (which CRUD primitives are required), _entity scope_ (single vs. multi-entity state changes), _information availability_ (whether identifiers are given explicitly or must be discovered), and _prompt ambiguity_ (how underspecified the target is).
208
+
209
+
### Results (No-Docs Baseline)
210
+
211
+
| Model | Box | Calendar | Linear | Slack |**Overall**| Pass % | Cost/test | Score/$ |
Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the [paper](https://arxiv.org/abs/2602.11224).
157
224
158
225
## Evaluations & Test Suites
159
226
160
227
Collections of test cases with assertions that you can run against agent runs using evaluations.
161
228
229
+
-**[box_bench.json](examples/box/testsuites/box_bench.json)** - test cases covering file/folder operations, search, tags, comments, hubs, and content versioning
230
+
-**[calendar_bench.json](examples/calendar/testsuites/calendar_bench.json)** - test cases covering event CRUD, recurring events, free/busy queries, ACL management, and calendar lifecycle
231
+
-**[linear_bench.json](examples/linear/testsuites/linear_bench.json)** - test cases covering issue management, labels, comments, workflow states, and team operations
0 commit comments