Skip to content

Commit 76898ab

Browse files
authored
Merge pull request #137 from agent-diff-bench/fixes-kdd
Update README with arXiv paper, benchmark tables, and all 4 services
2 parents c86a80d + b319803 commit 76898ab

1 file changed

Lines changed: 89 additions & 19 deletions

File tree

README.md

Lines changed: 89 additions & 19 deletions
Original file line numberDiff line numberDiff line change
@@ -5,8 +5,11 @@
55
Run it locally (or deploy it). Agents call sandboxed replicas of APIs that behave like the real ones, and you get deterministic diffs of every state change — no external services, no side effects, no rate limits.
66

77
<p align="center">
8+
<a href="https://arxiv.org/abs/2602.11224">Paper (arXiv)</a> •
89
<a href="https://agentdiff.dev">Website</a> •
910
<a href="https://agentdiff.mintlify.app/introduction">Docs</a> •
11+
<a href="https://huggingface.co/datasets/hubertmarek/agent-diff-bench">Dataset</a> •
12+
<a href="https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench">Prime Intellect</a> •
1013
<a href="mailto:hubert@uni.minerva.edu">Feedback</a>
1114
</p>
1215

@@ -109,23 +112,15 @@ client.delete_env(envId=env.environmentId)
109112

110113
## Supported APIs
111114

112-
- **Slack**core Web API coverage for conversations, chat, reactions, users, etc. Full list here [`backend/src/services/slack/README.md`](backend/src/services/slack/README.md). A few examples:
115+
- **Box**REST API for file/folder management, search, comments, tags, shared links, hubs, and content versioning. See [`backend/src/services/box/README.md`](backend/src/services/box/README.md). 27 endpoints.
113116

114-
```python
115-
"chat.postMessage" # post messages in seeded channels/DMs
116-
"conversations.open" # spin up IM/MPIM threads
117-
"reactions.add" # add emoji reactions to seeded messages
118-
```
117+
- **Google Calendar** – REST API for calendar CRUD, events, recurring series, free/busy queries, ACL rules, calendar list management, and push notifications. See [`backend/src/services/calendar/README.md`](backend/src/services/calendar/README.md). 37 endpoints.
119118

120-
- **Linear** – GraphQL API. See [`backend/src/services/linear/README.md`](backend/src/services/linear/README.md).
119+
- **Linear** – GraphQL API for issue tracking, teams, workflow states, labels, comments, relations, and memberships. See [`backend/src/services/linear/README.md`](backend/src/services/linear/README.md). 19 endpoints.
121120

122-
```python
123-
"issues" # list/filter issues with pagination
124-
"teams" # list teams
125-
"issueCreate" # create new issue
126-
"issueUpdate" # update issue (state, assignee, priority, etc.)
127-
"commentCreate" # add comment to issue
128-
```
121+
- **Slack** – Web API for conversations, messaging, reactions, threading, users, and channels. See [`backend/src/services/slack/README.md`](backend/src/services/slack/README.md). 25 endpoints.
122+
123+
> **108 unique endpoints** across all 4 services.
129124
130125
## Templates, Seeds & Environments
131126

@@ -149,18 +144,93 @@ client.delete_env(envId=env.environmentId)
149144
SDK provides **code execution proxies** - tools for AI agents. You add it to your toolbox in Vercel AI SDK, Langchain or OpenAI Agents, making LLM write Python or Bash code to talk with Slack or Linear API. Requests will automatically be intercepted and routed to isolated test environments. This enables agents to interact with service replicas without any code changes. See more in: **[Python SDK](sdk/agent-diff-python/README.md)**
150145

151146

152-
## Benchmark & Training
147+
## Paper
148+
149+
> **Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation**
150+
> Hubert M. Pysklo, Artem Zhuravel, Patrick D. Watson
151+
> *Pre-print. Under review for KDD 2026.*
152+
> [arXiv:2602.11224](https://arxiv.org/abs/2602.11224)
153+
154+
If you use Agent-Diff in your research, please cite:
153155

154-
- **HuggingFace Dataset**: [hubertmarek/agent-diff-bench](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) — 224 tasks across all 4 services (80/20 train/test split, stratified by service)
155-
- **Prime Intellect Environment**: [agent-diff-bench on Prime Lab](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench) — run evaluations or RL training via Hosted Training
156-
- **Paper**: [AgentDiff: Agentic API Evaluation via State Differencing (KDD 2026 pre-print)](https://drive.google.com/file/d/1BlmJTSMX7ohwvD1aYBByg7_Y815fgsxp/view?usp=sharing)
156+
```bibtex
157+
@article{pysklo2025agentdiff,
158+
title={Agent-Diff: Benchmarking LLM Agents on Enterprise API Tasks via Code Execution with State-Diff-Based Evaluation},
159+
author={Pysklo, Hubert M. and Zhuravel, Artem and Watson, Patrick D.},
160+
journal={arXiv preprint arXiv:2602.11224},
161+
year={2025}
162+
}
163+
```
164+
165+
## Run Evaluations
166+
167+
The fastest way to run Agent-Diff evaluations is via **[Prime Intellect](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench)** — run evals or RL training with no setup required.
168+
169+
Alternatively, run locally or self-hosted using the SDK (see [To run evaluations](#to-run-evaluations) below).
170+
171+
**Resources:**
172+
- **Dataset**: [hubertmarek/agent-diff-bench](https://huggingface.co/datasets/hubertmarek/agent-diff-bench) — 224 tasks across all 4 services (80/20 train/test split)
173+
- **Prime Intellect**: [agent-diff-bench on Prime Lab](https://app.primeintellect.ai/dashboard/environments/hubert-marek/agent-diff-bench) — hosted evaluations & RL training
174+
175+
## Benchmark
176+
177+
The Agent-Diff benchmark comprises **224 tasks** across four enterprise services, each evaluated via deterministic state-diff contracts. Tasks span single-step CRUD operations to long-horizon, multi-entity workflows requiring search, conditional logic, and coordinated state changes.
178+
179+
### Task Distribution
180+
181+
| Metric | Box | Calendar | Linear | Slack | **Total** |
182+
|---|---|---|---|---|---|
183+
| Tasks | 48 | 60 | 57 | 59 | **224** |
184+
| Task horizon _n*_ (range) | 1–13 | 1–24 | 1–13 | 1–14 | 1–24 |
185+
| Task horizon _n*_ (mean) | 4.6 | 5.9 | 5.2 | 5.6 | 5.3 |
186+
| | | | | | |
187+
| **Operation profile** _(% of tasks, non-exclusive)_ | | | | | |
188+
| Search | 92 | 77 | 89 | 64 | 80 |
189+
| Create | 58 | 78 | 63 | 88 | 73 |
190+
| Read | 54 | 82 | 14 | 68 | 55 |
191+
| Update | 62 | 93 | 70 | 37 | 66 |
192+
| Delete | 19 | 53 | 7 | 24 | 26 |
193+
| | | | | | |
194+
| **Entity scope** | | | | | |
195+
| Single-entity | 28 | 11 | 33 | 33 | 105 |
196+
| Multi-entity | 20 | 49 | 24 | 26 | 119 |
197+
| | | | | | |
198+
| **Information availability** | | | | | |
199+
| Explicit | 6 | 10 | 25 | 36 | 77 |
200+
| Implicit | 42 | 50 | 32 | 23 | 147 |
201+
| | | | | | |
202+
| **Prompt ambiguity** | | | | | |
203+
| Low | 24 | 13 | 37 | 27 | 101 |
204+
| Medium | 17 | 45 | 19 | 22 | 103 |
205+
| High | 7 | 2 | 1 | 10 | 20 |
206+
207+
Tasks are characterized along five dimensions: _task horizon_ (minimum API calls under an optimal policy), _operation profile_ (which CRUD primitives are required), _entity scope_ (single vs. multi-entity state changes), _information availability_ (whether identifiers are given explicitly or must be discovered), and _prompt ambiguity_ (how underspecified the target is).
208+
209+
### Results (No-Docs Baseline)
210+
211+
| Model | Box | Calendar | Linear | Slack | **Overall** | Pass % | Cost/test | Score/$ |
212+
|---|---|---|---|---|---|---|---|---|
213+
| deepseek-v3.2 | 76.6 | **87.5** | **94.8** | **86.1** | **88.1** | 76 | $0.03 | 2,938 |
214+
| devstral-2512 | 79.0 | 80.0 | 91.5 | 85.7 | **86.0** | 74 | $0.08 | 1,075 |
215+
| qwen3-vl-235b | 68.4 | 71.0 | 82.0 | 75.8 | **79.2** | 65 | $0.02 | 3,959 |
216+
| kimi-k2-0905 | 66.5 | 72.3 | 88.2 | 82.2 | **75.4** | 64 | $0.04 | 1,885 |
217+
| grok-4.1-fast | 58.5 | 75.7 | 66.0 | 77.1 | **74.9** | 52 | $0.01 | 7,489 |
218+
| gemini-3-flash | **80.3** | 62.2 | 84.0 | 77.5 | **73.8** | 67 | $0.05 | 1,477 |
219+
| gpt-oss-120b | 70.1 | 68.4 | 79.5 | 69.1 | **68.5** | 60 | $0.02 | 3,428 |
220+
| claude-haiku-4.5 | 45.1 | 57.8 | 35.6 | 57.3 | **49.3** | 50 | $0.22 | 224 |
221+
| llama-4-scout | 33.7 | 41.4 | 20.9 | 42.9 | **38.0** | 29 | $0.02 | 1,900 |
222+
223+
Per-service assertion-weighted scores (95% Bayesian CrI). No-docs baseline: agents receive no API documentation and must discover endpoints through exploration. 3 trials per task. Full methodology and documentation ablation results in the [paper](https://arxiv.org/abs/2602.11224).
157224

158225
## Evaluations & Test Suites
159226

160227
Collections of test cases with assertions that you can run against agent runs using evaluations.
161228

229+
- **[box_bench.json](examples/box/testsuites/box_bench.json)** - test cases covering file/folder operations, search, tags, comments, hubs, and content versioning
230+
- **[calendar_bench.json](examples/calendar/testsuites/calendar_bench.json)** - test cases covering event CRUD, recurring events, free/busy queries, ACL management, and calendar lifecycle
231+
- **[linear_bench.json](examples/linear/testsuites/linear_bench.json)** - test cases covering issue management, labels, comments, workflow states, and team operations
162232
- **[slack_bench.json](examples/slack/testsuites/slack_bench.json)** - test cases covering message sending, channel ops, reactions, threading
163-
- **[linear_bench.json](examples/linear/testsuites/linear_bench.json)** - test cases covering issue management, labels, comments, workflow states, and team operations. HF dataset: https://huggingface.co/datasets/hubertmarek/linear-bench-mini .
233+
164234
<img width="2985" height="1966" alt="pass_rates_annotated" src="https://github.com/user-attachments/assets/f5c59c81-c3bd-427e-977c-a5c2c0695e86" />
165235

166236
- **[Evaluation DSL](docs/evaluation-dsl.md)** - Check DSL docs on how it works.

0 commit comments

Comments
 (0)