Skip to content

Commit 2095c78

Browse files
committed
Initial commit
0 parents  commit 2095c78

30 files changed

Lines changed: 4265 additions & 0 deletions

.gitattributes

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,2 @@
1+
# Auto detect text files and perform LF normalization
2+
* text=auto
Lines changed: 88 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,88 @@
1+
name: Test and Publish Agent
2+
3+
on:
4+
pull_request:
5+
push:
6+
branches:
7+
- main
8+
tags:
9+
- 'v*' # Trigger on version tags like v1.0.0, v1.1.0
10+
11+
jobs:
12+
test-and-publish:
13+
runs-on: ubuntu-latest
14+
15+
# These permissions are required for the workflow to:
16+
# - Read repository contents (checkout code)
17+
# - Write to GitHub Container Registry (push Docker images)
18+
permissions:
19+
contents: read
20+
packages: write
21+
22+
steps:
23+
- name: Checkout repository
24+
uses: actions/checkout@v4
25+
26+
- name: Extract metadata for Docker
27+
id: meta
28+
uses: docker/metadata-action@v5
29+
with:
30+
images: ghcr.io/${{ github.repository }}
31+
tags: |
32+
type=ref,event=pr
33+
type=semver,pattern={{version}}
34+
type=semver,pattern={{major}}
35+
type=raw,value=latest,enable={{is_default_branch}}
36+
37+
- name: Build Docker image
38+
uses: docker/build-push-action@v5
39+
with:
40+
context: .
41+
push: false
42+
tags: ${{ steps.meta.outputs.tags }}
43+
labels: ${{ steps.meta.outputs.labels }}
44+
load: true
45+
platforms: linux/amd64
46+
47+
- name: Start agent container
48+
env:
49+
SECRETS_JSON: ${{ toJson(secrets) }}
50+
run: |
51+
echo "$SECRETS_JSON" | jq -r 'to_entries[] | "\(.key)=\(.value)"' > .env
52+
docker run -d -p 9009:9009 --name agent-container --env-file .env $(echo "${{ steps.meta.outputs.tags }}" | head -n1) --host 0.0.0.0 --port 9009
53+
timeout 30 bash -c 'until curl -sf http://localhost:9009/.well-known/agent-card.json > /dev/null; do sleep 1; done'
54+
55+
- name: Set up uv
56+
uses: astral-sh/setup-uv@v4
57+
58+
- name: Install test dependencies
59+
run: uv sync --extra test
60+
61+
- name: Run tests
62+
run: uv run pytest -v --agent-url http://localhost:9009
63+
64+
- name: Stop container and show logs
65+
if: always()
66+
run: |
67+
echo "=== Agent Container Logs ==="
68+
docker logs agent-container || true
69+
docker stop agent-container || true
70+
71+
- name: Log in to GitHub Container Registry
72+
if: success() && github.event_name != 'pull_request'
73+
uses: docker/login-action@v3
74+
with:
75+
registry: ghcr.io
76+
username: ${{ github.actor }}
77+
password: ${{ secrets.GITHUB_TOKEN }}
78+
79+
- name: Push Docker image
80+
if: success() && github.event_name != 'pull_request'
81+
run: docker push --all-tags ghcr.io/${GITHUB_REPOSITORY,,}
82+
83+
- name: Output image digest
84+
if: success() && github.event_name != 'pull_request'
85+
run: |
86+
echo "## Docker Image Published :rocket:" >> $GITHUB_STEP_SUMMARY
87+
echo "" >> $GITHUB_STEP_SUMMARY
88+
echo "**Tags:** ${{ steps.meta.outputs.tags }}" >> $GITHUB_STEP_SUMMARY

.gitignore

Lines changed: 26 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,26 @@
1+
.DS_Store
2+
.env
3+
.python-version
4+
5+
# Python-generated files
6+
__pycache__/
7+
*.py[oc]
8+
build/
9+
dist/
10+
wheels/
11+
*.egg-info
12+
13+
# Virtual environments
14+
.venv
15+
16+
# Test / tooling caches & reports
17+
.pytest_cache/
18+
.ruff_cache/
19+
.mypy_cache/
20+
.tox/
21+
.nox/
22+
23+
# Coverage outputs
24+
.coverage
25+
coverage.xml
26+
htmlcov/

Dockerfile

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,17 @@
1+
FROM ghcr.io/astral-sh/uv:python3.13-bookworm
2+
3+
RUN adduser agent
4+
USER agent
5+
WORKDIR /home/agent
6+
7+
COPY pyproject.toml uv.lock README.md ./
8+
COPY src src
9+
COPY assets assets
10+
11+
RUN \
12+
--mount=type=cache,target=/home/agent/.cache/uv,uid=1000 \
13+
uv sync --locked
14+
15+
ENTRYPOINT ["uv", "run", "src/server.py"]
16+
CMD ["--host", "0.0.0.0"]
17+
EXPOSE 9009

README.md

Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# Protocol Agent Benchmark
2+
3+
## Intro
4+
5+
Many cryptographic primitives could reshape human-to-human communications in our business and personal lives, but this doesn’t happen because cryptography is complex and the math is hard to “do in your head.”
6+
7+
Agents could learn that. They can map mundane intents to primitives and instantiate protocols at runtime. Scheduling a meeting can use Private Set Intersection (PSI) instead of sharing calendars. “Prove you’re over 21” at the bar can use a Zero-Knowledge Proof (ZKP) (with nonce/challenge anti-replay) instead of photocopying an ID. Anonymous reporting with verifiable membership can use anonymous credentials / ring signatures / group signatures to solve spam. Tip tokens can use blind signatures so the issuer can’t link purchase to spend while still preventing double spending. The list is actually pretty long.
8+
9+
Achieving this is a multidimensional challenge. Agents must: (1) "spot" and select the right primitive in an everyday context, (2) negotiate adoption with another agent, (3) implement the protocol correctly, (4) use crypto tools and computation competently, and (5) reason about threats and security strength. These are exactly the five judging dimensions of Protocol Agent, a benchmark that measures not “crypto knowledge” in the abstract (which has already been studied), but the practical ability to apply cryptography to improve daily life.
10+
11+
This benchmark is the first step in a larger effort (more coming in Q1 2026): post-training models that perform better on it.
12+
13+
## Challenges
14+
15+
- Human-readable: [Read here](assets/benchmark_challenges_diverse_v1.md)
16+
17+
## Leaderboard
18+
19+
[See here](https://github.com/MarcoMetaMask/protocol-agent-leaderboard)
20+
21+
## About this repo
22+
23+
An [A2A (Agent-to-Agent)](https://a2a-protocol.org/latest/) **green agent** compatible with the [AgentBeats](https://agentbeats.dev) platform.
24+
25+
Protocol Agent benchmarks a **single purple agent** via **self-play** on the crypto conversational challenges from `benchmark_challenges_diverse_v1.json`, scoring with the same rubric dimensions as the arena:
26+
27+
- Primitive Selection
28+
- Negotiation Skills
29+
- Implementation Correctness
30+
- Computation / Tool Usage
31+
- Security Strength
32+
33+
This repo is **standalone** for local demo runs: it includes a local baseline purple agent (`baseline_purple/`) and a one-command runner that streams the multi-role conversation as it runs.
34+
35+
## Project Structure
36+
37+
```
38+
src/
39+
├─ server.py # Server setup and agent card configuration
40+
├─ executor.py # A2A request handling
41+
├─ agent.py # Protocol Agent implementation (entrypoint)
42+
├─ benchmark_schema.py # Benchmark JSON loader + datamodel
43+
├─ runner.py # Self-play match runner
44+
├─ judge_openai.py # OpenAI judge wrapper
45+
├─ scoring.py # Outcome + aggregation (arena-aligned)
46+
└─ messenger.py # A2A messaging utilities
47+
baseline_purple/
48+
├─ src/ # Local baseline purple agent (A2A server)
49+
└─ requirements.txt
50+
scripts/
51+
├─ run_local.sh # One-command local end-to-end runner
52+
└─ run_client.py # Local streaming client (prints turns + result artifact)
53+
tests/
54+
└─ test_agent.py # Agent tests
55+
Dockerfile # Docker configuration
56+
pyproject.toml # Python dependencies
57+
.github/
58+
└─ workflows/
59+
└─ test-and-publish.yml # CI workflow
60+
```
61+
62+
## Quickstart (end-to-end, no manual intervention)
63+
64+
1) Set env vars:
65+
66+
```bash
67+
export OPENAI_API_KEY="...your key..."
68+
export OPENAI_MODEL_JUDGE="gpt-4.1-mini"
69+
export OPENAI_MODEL_PARTICIPANT="gpt-4.1-mini"
70+
```
71+
72+
2) Run:
73+
74+
```bash
75+
./scripts/run_local.sh
76+
```
77+
78+
You should see streamed lines like:
79+
80+
- `turn 1 | Alice: ...`
81+
- `turn 2 | Bob: ...`
82+
83+
and then a final `Result` artifact (JSON + summary).
84+
85+
## Running Locally
86+
87+
```bash
88+
python3 src/server.py --host 127.0.0.1 --port 9009
89+
```
90+
91+
## Example EvalRequest
92+
93+
```json
94+
{
95+
"participants": { "agent": "http://localhost:9019" },
96+
"config": {
97+
"benchmark_path": "assets/benchmark_challenges_diverse_v1.json",
98+
"limit_challenges": 1,
99+
"max_turns": 4,
100+
"repetitions": 1,
101+
"seed": 0,
102+
"include_transcripts": false,
103+
"timeout_s_per_turn": 300
104+
}
105+
}
106+
```
107+
108+
## Environment Variables
109+
110+
- `OPENAI_API_KEY`: required for judging.
111+
- `OPENAI_MODEL_JUDGE`: e.g. `gpt-4.1-mini`.
112+
- `OPENAI_BASE_URL` (optional): defaults to `https://api.openai.com/v1/responses`.
113+
114+
## Running with Docker
115+
116+
Build:
117+
118+
```bash
119+
docker build --platform linux/amd64 -t protocol-agent:local .
120+
```
121+
122+
Run:
123+
124+
```bash
125+
docker run -p 9009:9009 protocol-agent:local
126+
```
127+
128+
## Publishing
129+
130+
The repository includes a GitHub Actions workflow that automatically builds, tests, and publishes a Docker image of your agent to GitHub Container Registry.
131+
132+
If your agent needs API keys or other secrets, add them in Settings → Secrets and variables → Actions → Repository secrets. They'll be available as environment variables during CI tests.
133+
134+
- **Push to `main`** → publishes `latest` tag:
135+
```
136+
ghcr.io/<your-username>/<your-repo-name>:latest
137+
```
138+
139+
- **Create a git tag** (e.g. `git tag v1.0.0 && git push origin v1.0.0`) → publishes version tags:
140+
```
141+
ghcr.io/<your-username>/<your-repo-name>:1.0.0
142+
ghcr.io/<your-username>/<your-repo-name>:1
143+
```
144+
145+
Once the workflow completes, find your Docker image in the Packages section (right sidebar of your repository). Configure the package visibility in package settings.
146+
147+
> **Note:** Organization repositories may need package write permissions enabled manually (Settings → Actions → General). Version tags must follow [semantic versioning](https://semver.org/) (e.g., `v1.0.0`).

0 commit comments

Comments
 (0)