-
-
Notifications
You must be signed in to change notification settings - Fork 68
docs(benchmark): add AWS gateway benchmark and refresh benchmarks page #429
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changes from all commits
File filter
Filter by extension
Conversations
Jump to
Diff view
Diff view
There are no files selected for viewing
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,9 @@ | ||
| # Benchmark outputs and local Terraform state / secrets — never commit. | ||
| output/ | ||
| remote/results/ | ||
| terraform/.terraform/ | ||
| terraform/.terraform.lock.hcl | ||
| terraform/*.tfstate | ||
| terraform/*.tfstate.* | ||
| terraform/bench_key.pem | ||
| *.tar.gz | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,122 @@ | ||
| # AWS gateway latency & resource benchmark — GoModel vs LiteLLM vs Portkey vs Bifrost | ||
|
|
||
| A reproducible, one-command benchmark that provisions a free-tier AWS instance, | ||
|
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more.
The README describes this benchmark as provisioning "a free-tier AWS instance," but the default |
||
| runs four AI gateways through identical workloads against a deterministic mock | ||
| backend, measures latency and resource cost, and tears the infrastructure down. | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
|
|
||
| Because every gateway talks to the **same local mock backend**, the numbers | ||
| reflect *gateway overhead*, not upstream model latency or network jitter. | ||
|
|
||
| ## What it compares | ||
|
|
||
| Four OpenAI-compatible gateways, each pointed at the mock: | ||
|
|
||
| | Gateway | Image | How it reaches the mock | | ||
| |----------|--------------------------------------|-------------------------| | ||
| | GoModel | built from this repo (`Dockerfile`) | `OPENAI_BASE_URL` env | | ||
| | LiteLLM | `ghcr.io/berriai/litellm:main-stable`| `configs/litellm-config.yaml` | | ||
| | Portkey | `portkeyai/gateway:latest` | `x-portkey-custom-host` header (+ `TRUSTED_CUSTOM_HOSTS=mock`) | | ||
| | Bifrost | `maximhq/bifrost:latest` | `configs/bifrost-config.json` (`network_config.base_url` + `allow_private_network`) | | ||
|
|
||
| Per-gateway quirks the harness handles automatically (see `gw_model`/`gw_path` in | ||
| `run-on-instance.sh`): Bifrost needs an explicit `openai/`-prefixed model, serves | ||
| the Anthropic dialect at `/anthropic/v1/messages` (not `/v1/messages`), and must | ||
| allow private-network egress to reach the mock. | ||
|
|
||
| ### Workloads — 6 variants | ||
|
|
||
| The common denominator across OpenAI-compatible gateways, in both modes: | ||
|
|
||
| | Dialect | Endpoint | non-stream | stream | | ||
| |-----------|------------------------|:----------:|:------:| | ||
| | Chat | `/v1/chat/completions` | ✓ | ✓ | | ||
| | Responses | `/v1/responses` | ✓ | ✓ | | ||
| | Messages | `/v1/messages` (Anthropic) | ✓ | ✓ | | ||
|
|
||
| A **baseline** (load sent straight to the mock, no gateway) runs first as the | ||
| latency floor. Variants a gateway does not implement are reported as failures | ||
| rather than silently skipped — e.g. Portkey's OSS gateway does not serve the | ||
| Anthropic Messages dialect here, so its messages variants fail; that asymmetry is | ||
| the finding. Streaming uses a terminal-marker **or idle-gap** end-of-stream | ||
| detection (`loadgen -idle`), so a gateway that streams content without sending a | ||
| terminal event (Bifrost) is still measured to last-byte rather than hanging. | ||
|
|
||
| ### Metrics captured | ||
|
|
||
| - **Latency** — total-latency p50/p90/p95/p99, plus **TTFT** (time to first | ||
| token) for streaming, and throughput (RPS). Driven by the `loadgen` tool. | ||
| - **Docker image size** — `docker image inspect` size + repo digest per gateway. | ||
| - **Memory** — idle RSS after warmup and peak RSS under load (`docker stats`). | ||
| - **CPU** — average CPU% under load (`docker stats`). | ||
|
|
||
| ## Layout | ||
|
|
||
| ``` | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win Add a language to the fenced layout block (markdownlint MD040). Use ```text. 🧰 Tools🪛 markdownlint-cli2 (0.22.1)[warning] 54-54: Fenced code blocks should have a language specified (MD040, fenced-code-language) 🤖 Prompt for AI AgentsSource: Linters/SAST tools |
||
| terraform/ free-tier EC2 + SSH key + security group (apply/destroy) | ||
| remote/ everything shipped to and run on the instance | ||
| bench-tools/ Go mock backend + loadgen (one small image) | ||
| configs/ litellm config | ||
| docker-compose.yml mock + one gateway per profile (benchnet network) | ||
| run-on-instance.sh builds images, runs 6 variants x N gateways, samples stats | ||
| scripts/summarize.py raw JSON -> latency + resource tables + summary.json | ||
| run.sh orchestrator: build -> apply -> run -> collect -> destroy | ||
| ``` | ||
|
|
||
| ## Prerequisites | ||
|
|
||
| - AWS credentials configured (`aws sts get-caller-identity` works) | ||
| - Terraform ≥ 1.6, Docker (with `buildx`), `rsync`, `ssh`, Python 3 | ||
| - An AWS account with default VPC in the chosen region | ||
|
|
||
| ## Run it | ||
|
|
||
| ```bash | ||
| cd docs/2026-06-25_aws_gateway_benchmark | ||
| ./run.sh # full run in us-east-1, then auto-destroy | ||
| N=1000 C=20 ./run.sh # heavier load | ||
| REGION=eu-west-1 ./run.sh # different region | ||
| GATEWAYS="gomodel litellm" ./run.sh # subset | ||
| KEEP=1 ./run.sh # leave the instance up for debugging | ||
| ``` | ||
|
|
||
| `run.sh` always tears the instance down via an EXIT trap, even on failure. If a | ||
| run is interrupted, reconcile manually: | ||
|
|
||
| ```bash | ||
| cd terraform && terraform destroy -auto-approve | ||
| ``` | ||
|
|
||
| Results land in `output/<timestamp>/` (raw per-variant JSON, `summary.json`, | ||
| and the printed `summary.txt` table). | ||
|
|
||
| ## Local dry-run (no AWS) | ||
|
|
||
| The instance-side harness runs on any Docker host: | ||
|
|
||
| ```bash | ||
| cd remote && N=30 C=5 GATEWAYS="gomodel litellm portkey bifrost" ./run-on-instance.sh | ||
| ``` | ||
|
|
||
| (Build the GoModel image first: `docker build -t gomodel-bench:local ../../..`) | ||
|
|
||
| ## Reproducibility & caveats | ||
|
|
||
| - **Pinned**: gateway image refs (overridable via `*_IMAGE` env), the Compose | ||
| plugin version, instance type, and the deterministic mock payload. Exact image | ||
| **digests** are recorded in each `*_image.json` so a run is fully traceable. | ||
| - **AMI** resolves to the latest Amazon Linux 2023 via SSM (reproducible by | ||
| policy). Pin `var.ami_id` for a byte-identical OS. | ||
| - **Free tier**: defaults to **t2.micro** — the 12-month-free-tier instance in | ||
| us-east-1 — with `standard` CPU credits (no surprise burst charges), a 20 GiB | ||
| gp3 root volume (free tier allows 30 GiB), the default VPC (no paid NAT/EIP), | ||
| and an Amazon Linux 2023 AMI. Image pulls are inbound traffic (free). In | ||
| regions where t2.micro is unavailable, set `INSTANCE_TYPE=t3.micro` (the | ||
| free-tier instance there). Newer accounts on AWS's credit-based free plan stay | ||
| within credit for a single short run. | ||
| - **t2.micro is burstable** (1 vCPU, CPU-credit throttled). Treat absolute | ||
| latency as *indicative*; the value is the *relative* comparison on identical | ||
| hardware. Gateways run **one at a time** so they never contend, and the load is | ||
| kept modest (N=500, c=10) to stay within launch credits. For production-grade | ||
| absolute numbers, set `INSTANCE_TYPE=c7i.large` (not free tier). | ||
| - **Cost**: a single free-tier instance for ~15–30 min — $0 within free-tier | ||
| allowance, otherwise a few cents. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,135 @@ | ||
| # Results — 2026-06-25 (AWS c7i.large run) | ||
|
|
||
| Reference run produced by `./run.sh` (raw data in `output/20260625-182538/`, | ||
| machine summary in that dir's `summary.md` / `summary.json`). Four gateways: | ||
| **GoModel, LiteLLM, Portkey, Bifrost**. | ||
|
|
||
| - **Host**: AWS EC2 **c7i.large** (2 vCPU, 4 GiB, **non-burstable** — no CPU-credit | ||
| drift, so the tail is stable), Amazon Linux 2023, us-east-1. | ||
| - **Load**: N=8000 requests/variant, concurrency 10, **2 randomized-order trials** | ||
| (latency = median across trials; p99 shown with its min–max), 200-request | ||
| process warmup + 50-request per-variant warmup, per-variant wall cap 10 s, 8 s | ||
|
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value Inconsistent hyphenation: "warmup" vs "warm-up". Line 11 uses "warmup" (no hyphen) while line 124 uses "Warm-up" (hyphen). Pick one style and apply consistently throughout. 🤖 Prompt for AI Agents |
||
| resource window, capacity sweep at c∈{1,16,128}. Shared in-process **mock** | ||
| backend, so every number is **gateway overhead**, not model latency. | ||
| - **Parity**: retries disabled on every gateway, GoModel's circuit breaker disabled | ||
| (so the sweep can't trip it), and **LiteLLM run at its recommended worker count — | ||
| one worker per CPU core (`num_workers=2` on this 2-vCPU box)** so it isn't pinned | ||
| to a single core while the Go gateways use both. | ||
|
coderabbitai[bot] marked this conversation as resolved.
|
||
| - Images (digests in `*_image.json`): GoModel (built from this repo), latest | ||
| `litellm:main-stable`, `portkeyai/gateway:latest`, `maximhq/bifrost:latest`. | ||
|
|
||
| > Fast reference run (N=8000 × 2 trials) sized to finish end-to-end in well under | ||
| > 20 minutes; the p99 min–max spreads are tight, so the medians are stable. Raise | ||
| > `N`/`REPEATS` for a heavier run. | ||
|
|
||
| ## Latency — non-streaming (ms, median of trials) | ||
|
|
||
| | Workload | metric | baseline | GoModel | Bifrost | Portkey | LiteLLM | | ||
| |-----------|--------|---------:|--------:|--------:|--------:|--------:| | ||
| | chat | p50 | 0.23 | **1.81** | 2.51 | 9.70 | 30.56 | | ||
| | chat | p99 | 2.77 | **6.88** | 18.27 | 30.54 | 39.26 | | ||
| | responses | p50 | 0.26 | **2.01** | 2.73 | 9.07 | 39.12 | | ||
| | responses | p99 | 2.33 | **7.28** | 16.55 | 26.92 | 48.60 | | ||
| | messages | p50 | 0.26 | **1.76** | 2.65 | ✗ | 61.06 | | ||
| | messages | p99 | 2.23 | **6.59** | 19.08 | ✗ | 98.12 | | ||
|
|
||
| **GoModel has the lowest p50 and the tightest p99** (~7 ms vs Bifrost ~18 ms, | ||
| Portkey ~31 ms, LiteLLM ~39 ms). `overhead p50` (gateway p50 − baseline p50): | ||
| GoModel ≈ 1.6 ms, Bifrost ≈ 2.3 ms, Portkey ≈ 9.5 ms, LiteLLM ≈ 30 ms. | ||
|
|
||
| ## Latency — streaming (ms, median of trials) | ||
|
|
||
| | Workload | metric | GoModel | Bifrost | Portkey | LiteLLM | | ||
| |-----------|--------|--------:|--------:|--------:|--------:| | ||
| | chat | TTFT p50 | **4.71** | 9.02 | 27.97 | 151.94 | | ||
| | chat | total p50 | **4.95** | 11.89 | 27.98 | 151.95 | | ||
| | responses | TTFT p50 | **4.69** | 12.87 | 27.90 | 47.53 | | ||
| | responses | total p50 | **5.00** | 14.94 | 27.93 | 47.55 | | ||
| | messages | TTFT p50 | **7.50** | † | ✗ | 48.86 | | ||
| | messages | total p50 | **8.38** | † | ✗ | 48.89 | | ||
|
|
||
| † **Bifrost messages-stream is an idle-bound artifact, not a throughput number** | ||
| (no terminal event over a non-native backend → 0 completions within the 10 s cap). | ||
|
|
||
| ## Throughput / capacity (chat non-stream, sustained req/s by concurrency) | ||
|
|
||
| | target | c=1 | c=16 | c=128 | peak | knee | | ||
| |--------|----:|-----:|------:|-----:|-----:| | ||
| | baseline | 15510 | 29701 | 30015 | **30015** | 16 | | ||
| | GoModel | 2745 | 4928 | 4567 | **4928** | 16 | | ||
| | Bifrost | 1885 | 3088 | 2904 | **3088** | 16 | | ||
| | Portkey | 636 | 946 | 900 | **946** | 16 | | ||
| | LiteLLM | 227 | 324 | 254 | **324** | 16 | | ||
|
|
||
| GoModel tops the gateways at **~4900 req/s**, ~1.6× Bifrost, ~5× Portkey, ~15× | ||
| LiteLLM. All saturate by c=16 on 2 vCPUs. | ||
|
|
||
| ## Resources | ||
|
|
||
| | Metric | GoModel | Portkey | Bifrost | LiteLLM | | ||
| |--------|--------:|--------:|--------:|--------:| | ||
| | Docker image, compressed pull (MB) | **16** | 59 | 77 | 372 | | ||
| | Docker image, on-disk (MB) | **47.2** | 177.4 | 230.7 | 1159.9 | | ||
| | Cold start to first 200 (s) | **0.56** | 1.05 | 7.07 | 25.49 | | ||
| | Peak RSS under load (MB)| **37.0** | 112.0 | 143.0 | 2272.3 | | ||
| | Avg CPU under load (%) | 92.6 | 116.9 | 117.6 | 101.1 | | ||
| | Sustained req/s (resource window) | **4824** | 960 | 2977 | 261 | | ||
| | Efficiency (req/s per CPU %) | **52.1** | 8.2 | 25.3 | 2.6 | | ||
|
|
||
| GoModel is the most CPU-efficient (**52 req/s per CPU-%**, ~2× Bifrost, ~6× | ||
| Portkey, ~20× LiteLLM), the smallest image (**47 MB**), the smallest footprint | ||
| (**37 MB** peak), and the fastest cold start (**0.56 s**). | ||
|
|
||
| > **LiteLLM at its recommended config.** With `num_workers=2` (one per core) LiteLLM | ||
| > is faster and higher-throughput than the earlier single-worker run (≈220 → 324 | ||
| > req/s; chat p50 ≈ 44 → 31 ms — a single worker was queuing the 10 concurrent | ||
| > requests), but its **memory doubled to ~2.3 GB** (two ~1 GB worker processes) and | ||
| > its **cold start rose to ~25 s**. Running LiteLLM "properly" widens the resource | ||
| > gap, not narrows it. | ||
|
|
||
| ## Feature coverage (6 variants) | ||
|
|
||
| | Gateway | chat | responses | messages | total | | ||
| |---------|:----:|:---------:|:--------:|:-----:| | ||
| | GoModel | ✓ | ✓ | ✓ | 6/6 | | ||
| | LiteLLM | ✓ | ✓ | ✓ | 6/6 | | ||
| | Bifrost | ✓ | ✓ | ✓† | 6/6 | | ||
| | Portkey | ✓ | ✓ | ✗ | 4/6 | | ||
|
|
||
| - **Portkey** errors on the Anthropic `/v1/messages` dialect in this single-provider | ||
| (openai → mock) setup; setup limitation, not a hard capability gap. | ||
| - **Bifrost** serves Anthropic at `/anthropic/v1/messages`, needs an `openai/`-prefixed | ||
| model and `allow_private_network:true`; messages-streaming has the caveat above (†). | ||
|
|
||
| ## Takeaways | ||
|
|
||
| - **GoModel** — best all-rounder: lowest p50 and tightest p99 (~7 ms), highest | ||
| gateway throughput (~4900 req/s), best CPU efficiency (52 req/s per %), smallest | ||
| image (47 MB) and memory (37 MB), fastest cold start (0.56 s), full 6/6 coverage. | ||
| - **Bifrost** (Go) — second on throughput, low p50 but a heavier p99 tail; streaming | ||
| terminal-event gaps over a non-native backend. | ||
| - **Portkey** (Node) — middle tier; no Anthropic Messages in this setup. | ||
| - **LiteLLM** (Python) — full coverage, but even at its recommended 2-worker config | ||
| it is ~15× behind on throughput and carries a **1.16 GB image + ~2.3 GB RAM + ~25 s | ||
| cold start**. The cost of Python on the hot path. | ||
|
|
||
| ## Methodology notes | ||
|
|
||
| - **Repeats + spread** — 2 trials, randomized gateway order each trial; latency is | ||
| the median across trials, p99 carries its min–max. | ||
| - **Config parity** — retries off on all; GoModel's circuit breaker disabled (a few | ||
| transient errors under the c=128 sweep would otherwise trip it and blanket-503 its | ||
| own capacity); **LiteLLM at one worker per core (`num_workers`=vCPUs)**, its own | ||
| production recommendation, set automatically from `nproc`. | ||
| - **Warm-up** — 200 global + 50 per-variant requests; the per-variant warmup | ||
| neutralizes LiteLLM's lazy per-dialect imports and, with >1 worker, warms each | ||
| worker before measuring. | ||
| - **Throughput vs latency separated** — capacity comes from a time-boxed concurrency | ||
| sweep, not the latency-coupled rps in the latency tables. | ||
| - **Per-variant wall cap (10 s)** — bounds idle-bound streaming variants; cap-aborted | ||
| requests are reported as `capped`, not `failed`. | ||
| - **Resilient orchestration** — the remote benchmark runs detached (`setsid`) and the | ||
| orchestrator polls for the `meta.json` sentinel, so an SSH drop can't kill or hang | ||
| the run; `set -uo` so one flaky variant skips instead of aborting. | ||
| - Reproduce with `./run.sh`; pin `var.ami_id` and the `*_IMAGE` digests for a | ||
| byte-identical rerun. Heavier run: `N=20000 REPEATS=5 ./run.sh`. | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,18 @@ | ||
| # Builds the mock backend and load generator into one small static image. | ||
| # Both binaries live in the final image; the compose service / docker run | ||
| # command selects which one to execute. | ||
| FROM golang:1.26-alpine AS build | ||
| WORKDIR /src | ||
| COPY go.mod ./ | ||
| COPY mock ./mock | ||
| COPY loadgen ./loadgen | ||
| RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/mock ./mock \ | ||
| && CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/loadgen ./loadgen | ||
|
|
||
| FROM gcr.io/distroless/static-debian12:nonroot | ||
|
Comment on lines
+4
to
+12
Contributor
There was a problem hiding this comment. Choose a reason for hiding this commentThe reason will be displayed to describe this comment to others. Learn more. 🎯 Functional Correctness | 🟠 Major Pin the bench-tools base images by digest. Floating tags ( Suggested shape-FROM golang:1.26-alpine AS build
+FROM golang:1.26-alpine@sha256:<digest> AS build
...
-FROM gcr.io/distroless/static-debian12:nonroot
+FROM gcr.io/distroless/static-debian12:nonroot@sha256:<digest>🤖 Prompt for AI Agents |
||
| COPY --from=build /out/mock /mock | ||
| COPY --from=build /out/loadgen /loadgen | ||
| # No ENTRYPOINT: each invocation picks the binary as its command, e.g. | ||
| # docker run img /mock (compose `command: ["/mock"]`) | ||
| # docker run img /loadgen -url … | ||
| CMD ["/mock"] | ||
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,3 @@ | ||
| module gomodel-bench-tools | ||
|
|
||
| go 1.26 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win
Do not ignore the Terraform lockfile.
Ignoring
.terraform.lock.hcllets provider selections drift between runs, which undercuts the reproducible benchmark goal even when the Terraform sources are unchanged.Suggested fix
-terraform/.terraform.lock.hcl📝 Committable suggestion
🤖 Prompt for AI Agents