Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
9 changes: 9 additions & 0 deletions docs/2026-06-25_aws_gateway_benchmark/.gitignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,9 @@
# Benchmark outputs and local Terraform state / secrets — never commit.
output/
remote/results/
terraform/.terraform/
terraform/.terraform.lock.hcl

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Do not ignore the Terraform lockfile.

Ignoring .terraform.lock.hcl lets provider selections drift between runs, which undercuts the reproducible benchmark goal even when the Terraform sources are unchanged.

Suggested fix
-terraform/.terraform.lock.hcl
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
terraform/.terraform.lock.hcl
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/.gitignore` at line 5, The .gitignore
entry is incorrectly excluding the Terraform lockfile, which can allow provider
versions to drift between benchmark runs. Remove the
terraform/.terraform.lock.hcl ignore rule so the lockfile stays tracked and
reproducible alongside the Terraform benchmark setup.

terraform/*.tfstate
terraform/*.tfstate.*
terraform/bench_key.pem
*.tar.gz
122 changes: 122 additions & 0 deletions docs/2026-06-25_aws_gateway_benchmark/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,122 @@
# AWS gateway latency & resource benchmark — GoModel vs LiteLLM vs Portkey vs Bifrost

A reproducible, one-command benchmark that provisions a free-tier AWS instance,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "free-tier" claim is inaccurate for the default instance

The README describes this benchmark as provisioning "a free-tier AWS instance," but the default instance_type is c7i.large, which is explicitly documented in variables.tf as "NOT free-tier eligible (~$0.09/hr on-demand in us-east-1)." A user who clones the repo and runs ./run.sh without overriding the instance type will be charged. The free-tier language should be removed or replaced with an explicit cost note.

runs four AI gateways through identical workloads against a deterministic mock
backend, measures latency and resource cost, and tears the infrastructure down.
Comment thread
coderabbitai[bot] marked this conversation as resolved.

Because every gateway talks to the **same local mock backend**, the numbers
reflect *gateway overhead*, not upstream model latency or network jitter.

## What it compares

Four OpenAI-compatible gateways, each pointed at the mock:

| Gateway | Image | How it reaches the mock |
|----------|--------------------------------------|-------------------------|
| GoModel | built from this repo (`Dockerfile`) | `OPENAI_BASE_URL` env |
| LiteLLM | `ghcr.io/berriai/litellm:main-stable`| `configs/litellm-config.yaml` |
| Portkey | `portkeyai/gateway:latest` | `x-portkey-custom-host` header (+ `TRUSTED_CUSTOM_HOSTS=mock`) |
| Bifrost | `maximhq/bifrost:latest` | `configs/bifrost-config.json` (`network_config.base_url` + `allow_private_network`) |

Per-gateway quirks the harness handles automatically (see `gw_model`/`gw_path` in
`run-on-instance.sh`): Bifrost needs an explicit `openai/`-prefixed model, serves
the Anthropic dialect at `/anthropic/v1/messages` (not `/v1/messages`), and must
allow private-network egress to reach the mock.

### Workloads — 6 variants

The common denominator across OpenAI-compatible gateways, in both modes:

| Dialect | Endpoint | non-stream | stream |
|-----------|------------------------|:----------:|:------:|
| Chat | `/v1/chat/completions` | ✓ | ✓ |
| Responses | `/v1/responses` | ✓ | ✓ |
| Messages | `/v1/messages` (Anthropic) | ✓ | ✓ |

A **baseline** (load sent straight to the mock, no gateway) runs first as the
latency floor. Variants a gateway does not implement are reported as failures
rather than silently skipped — e.g. Portkey's OSS gateway does not serve the
Anthropic Messages dialect here, so its messages variants fail; that asymmetry is
the finding. Streaming uses a terminal-marker **or idle-gap** end-of-stream
detection (`loadgen -idle`), so a gateway that streams content without sending a
terminal event (Bifrost) is still measured to last-byte rather than hanging.

### Metrics captured

- **Latency** — total-latency p50/p90/p95/p99, plus **TTFT** (time to first
token) for streaming, and throughput (RPS). Driven by the `loadgen` tool.
- **Docker image size** — `docker image inspect` size + repo digest per gateway.
- **Memory** — idle RSS after warmup and peak RSS under load (`docker stats`).
- **CPU** — average CPU% under load (`docker stats`).

## Layout

```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add a language to the fenced layout block (markdownlint MD040). Use ```text.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 54-54: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/README.md` at line 54, The fenced
layout block in the README is missing a language tag and triggers markdownlint
MD040. Update the fenced block in the documentation to use a text language
identifier, and make sure the change is applied to the layout block near the
benchmark description so the markdown renders and lints correctly.

Source: Linters/SAST tools

terraform/ free-tier EC2 + SSH key + security group (apply/destroy)
remote/ everything shipped to and run on the instance
bench-tools/ Go mock backend + loadgen (one small image)
configs/ litellm config
docker-compose.yml mock + one gateway per profile (benchnet network)
run-on-instance.sh builds images, runs 6 variants x N gateways, samples stats
scripts/summarize.py raw JSON -> latency + resource tables + summary.json
run.sh orchestrator: build -> apply -> run -> collect -> destroy
```

## Prerequisites

- AWS credentials configured (`aws sts get-caller-identity` works)
- Terraform ≥ 1.6, Docker (with `buildx`), `rsync`, `ssh`, Python 3
- An AWS account with default VPC in the chosen region

## Run it

```bash
cd docs/2026-06-25_aws_gateway_benchmark
./run.sh # full run in us-east-1, then auto-destroy
N=1000 C=20 ./run.sh # heavier load
REGION=eu-west-1 ./run.sh # different region
GATEWAYS="gomodel litellm" ./run.sh # subset
KEEP=1 ./run.sh # leave the instance up for debugging
```

`run.sh` always tears the instance down via an EXIT trap, even on failure. If a
run is interrupted, reconcile manually:

```bash
cd terraform && terraform destroy -auto-approve
```

Results land in `output/<timestamp>/` (raw per-variant JSON, `summary.json`,
and the printed `summary.txt` table).

## Local dry-run (no AWS)

The instance-side harness runs on any Docker host:

```bash
cd remote && N=30 C=5 GATEWAYS="gomodel litellm portkey bifrost" ./run-on-instance.sh
```

(Build the GoModel image first: `docker build -t gomodel-bench:local ../../..`)

## Reproducibility & caveats

- **Pinned**: gateway image refs (overridable via `*_IMAGE` env), the Compose
plugin version, instance type, and the deterministic mock payload. Exact image
**digests** are recorded in each `*_image.json` so a run is fully traceable.
- **AMI** resolves to the latest Amazon Linux 2023 via SSM (reproducible by
policy). Pin `var.ami_id` for a byte-identical OS.
- **Free tier**: defaults to **t2.micro** — the 12-month-free-tier instance in
us-east-1 — with `standard` CPU credits (no surprise burst charges), a 20 GiB
gp3 root volume (free tier allows 30 GiB), the default VPC (no paid NAT/EIP),
and an Amazon Linux 2023 AMI. Image pulls are inbound traffic (free). In
regions where t2.micro is unavailable, set `INSTANCE_TYPE=t3.micro` (the
free-tier instance there). Newer accounts on AWS's credit-based free plan stay
within credit for a single short run.
- **t2.micro is burstable** (1 vCPU, CPU-credit throttled). Treat absolute
latency as *indicative*; the value is the *relative* comparison on identical
hardware. Gateways run **one at a time** so they never contend, and the load is
kept modest (N=500, c=10) to stay within launch credits. For production-grade
absolute numbers, set `INSTANCE_TYPE=c7i.large` (not free tier).
- **Cost**: a single free-tier instance for ~15–30 min — $0 within free-tier
allowance, otherwise a few cents.
135 changes: 135 additions & 0 deletions docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,135 @@
# Results — 2026-06-25 (AWS c7i.large run)

Reference run produced by `./run.sh` (raw data in `output/20260625-182538/`,
machine summary in that dir's `summary.md` / `summary.json`). Four gateways:
**GoModel, LiteLLM, Portkey, Bifrost**.

- **Host**: AWS EC2 **c7i.large** (2 vCPU, 4 GiB, **non-burstable** — no CPU-credit
drift, so the tail is stable), Amazon Linux 2023, us-east-1.
- **Load**: N=8000 requests/variant, concurrency 10, **2 randomized-order trials**
(latency = median across trials; p99 shown with its min–max), 200-request
process warmup + 50-request per-variant warmup, per-variant wall cap 10 s, 8 s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Inconsistent hyphenation: "warmup" vs "warm-up".

Line 11 uses "warmup" (no hyphen) while line 124 uses "Warm-up" (hyphen). Pick one style and apply consistently throughout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/RESULTS.md` at line 11, The benchmark
RESULTS.md has inconsistent hyphenation for the warmup term, so standardize the
wording to one style everywhere. Update the text around the benchmark summary
and any other occurrences in RESULTS.md to use the same form consistently, and
keep the chosen style aligned with the existing terminology in the document.

resource window, capacity sweep at c∈{1,16,128}. Shared in-process **mock**
backend, so every number is **gateway overhead**, not model latency.
- **Parity**: retries disabled on every gateway, GoModel's circuit breaker disabled
(so the sweep can't trip it), and **LiteLLM run at its recommended worker count —
one worker per CPU core (`num_workers=2` on this 2-vCPU box)** so it isn't pinned
to a single core while the Go gateways use both.
Comment thread
coderabbitai[bot] marked this conversation as resolved.
- Images (digests in `*_image.json`): GoModel (built from this repo), latest
`litellm:main-stable`, `portkeyai/gateway:latest`, `maximhq/bifrost:latest`.

> Fast reference run (N=8000 × 2 trials) sized to finish end-to-end in well under
> 20 minutes; the p99 min–max spreads are tight, so the medians are stable. Raise
> `N`/`REPEATS` for a heavier run.

## Latency — non-streaming (ms, median of trials)

| Workload | metric | baseline | GoModel | Bifrost | Portkey | LiteLLM |
|-----------|--------|---------:|--------:|--------:|--------:|--------:|
| chat | p50 | 0.23 | **1.81** | 2.51 | 9.70 | 30.56 |
| chat | p99 | 2.77 | **6.88** | 18.27 | 30.54 | 39.26 |
| responses | p50 | 0.26 | **2.01** | 2.73 | 9.07 | 39.12 |
| responses | p99 | 2.33 | **7.28** | 16.55 | 26.92 | 48.60 |
| messages | p50 | 0.26 | **1.76** | 2.65 | ✗ | 61.06 |
| messages | p99 | 2.23 | **6.59** | 19.08 | ✗ | 98.12 |

**GoModel has the lowest p50 and the tightest p99** (~7 ms vs Bifrost ~18 ms,
Portkey ~31 ms, LiteLLM ~39 ms). `overhead p50` (gateway p50 − baseline p50):
GoModel ≈ 1.6 ms, Bifrost ≈ 2.3 ms, Portkey ≈ 9.5 ms, LiteLLM ≈ 30 ms.

## Latency — streaming (ms, median of trials)

| Workload | metric | GoModel | Bifrost | Portkey | LiteLLM |
|-----------|--------|--------:|--------:|--------:|--------:|
| chat | TTFT p50 | **4.71** | 9.02 | 27.97 | 151.94 |
| chat | total p50 | **4.95** | 11.89 | 27.98 | 151.95 |
| responses | TTFT p50 | **4.69** | 12.87 | 27.90 | 47.53 |
| responses | total p50 | **5.00** | 14.94 | 27.93 | 47.55 |
| messages | TTFT p50 | **7.50** | † | ✗ | 48.86 |
| messages | total p50 | **8.38** | † | ✗ | 48.89 |

† **Bifrost messages-stream is an idle-bound artifact, not a throughput number**
(no terminal event over a non-native backend → 0 completions within the 10 s cap).

## Throughput / capacity (chat non-stream, sustained req/s by concurrency)

| target | c=1 | c=16 | c=128 | peak | knee |
|--------|----:|-----:|------:|-----:|-----:|
| baseline | 15510 | 29701 | 30015 | **30015** | 16 |
| GoModel | 2745 | 4928 | 4567 | **4928** | 16 |
| Bifrost | 1885 | 3088 | 2904 | **3088** | 16 |
| Portkey | 636 | 946 | 900 | **946** | 16 |
| LiteLLM | 227 | 324 | 254 | **324** | 16 |

GoModel tops the gateways at **~4900 req/s**, ~1.6× Bifrost, ~5× Portkey, ~15×
LiteLLM. All saturate by c=16 on 2 vCPUs.

## Resources

| Metric | GoModel | Portkey | Bifrost | LiteLLM |
|--------|--------:|--------:|--------:|--------:|
| Docker image, compressed pull (MB) | **16** | 59 | 77 | 372 |
| Docker image, on-disk (MB) | **47.2** | 177.4 | 230.7 | 1159.9 |
| Cold start to first 200 (s) | **0.56** | 1.05 | 7.07 | 25.49 |
| Peak RSS under load (MB)| **37.0** | 112.0 | 143.0 | 2272.3 |
| Avg CPU under load (%) | 92.6 | 116.9 | 117.6 | 101.1 |
| Sustained req/s (resource window) | **4824** | 960 | 2977 | 261 |
| Efficiency (req/s per CPU %) | **52.1** | 8.2 | 25.3 | 2.6 |

GoModel is the most CPU-efficient (**52 req/s per CPU-%**, ~2× Bifrost, ~6×
Portkey, ~20× LiteLLM), the smallest image (**47 MB**), the smallest footprint
(**37 MB** peak), and the fastest cold start (**0.56 s**).

> **LiteLLM at its recommended config.** With `num_workers=2` (one per core) LiteLLM
> is faster and higher-throughput than the earlier single-worker run (≈220 → 324
> req/s; chat p50 ≈ 44 → 31 ms — a single worker was queuing the 10 concurrent
> requests), but its **memory doubled to ~2.3 GB** (two ~1 GB worker processes) and
> its **cold start rose to ~25 s**. Running LiteLLM "properly" widens the resource
> gap, not narrows it.

## Feature coverage (6 variants)

| Gateway | chat | responses | messages | total |
|---------|:----:|:---------:|:--------:|:-----:|
| GoModel | ✓ | ✓ | ✓ | 6/6 |
| LiteLLM | ✓ | ✓ | ✓ | 6/6 |
| Bifrost | ✓ | ✓ | ✓† | 6/6 |
| Portkey | ✓ | ✓ | ✗ | 4/6 |

- **Portkey** errors on the Anthropic `/v1/messages` dialect in this single-provider
(openai → mock) setup; setup limitation, not a hard capability gap.
- **Bifrost** serves Anthropic at `/anthropic/v1/messages`, needs an `openai/`-prefixed
model and `allow_private_network:true`; messages-streaming has the caveat above (†).

## Takeaways

- **GoModel** — best all-rounder: lowest p50 and tightest p99 (~7 ms), highest
gateway throughput (~4900 req/s), best CPU efficiency (52 req/s per %), smallest
image (47 MB) and memory (37 MB), fastest cold start (0.56 s), full 6/6 coverage.
- **Bifrost** (Go) — second on throughput, low p50 but a heavier p99 tail; streaming
terminal-event gaps over a non-native backend.
- **Portkey** (Node) — middle tier; no Anthropic Messages in this setup.
- **LiteLLM** (Python) — full coverage, but even at its recommended 2-worker config
it is ~15× behind on throughput and carries a **1.16 GB image + ~2.3 GB RAM + ~25 s
cold start**. The cost of Python on the hot path.

## Methodology notes

- **Repeats + spread** — 2 trials, randomized gateway order each trial; latency is
the median across trials, p99 carries its min–max.
- **Config parity** — retries off on all; GoModel's circuit breaker disabled (a few
transient errors under the c=128 sweep would otherwise trip it and blanket-503 its
own capacity); **LiteLLM at one worker per core (`num_workers`=vCPUs)**, its own
production recommendation, set automatically from `nproc`.
- **Warm-up** — 200 global + 50 per-variant requests; the per-variant warmup
neutralizes LiteLLM's lazy per-dialect imports and, with >1 worker, warms each
worker before measuring.
- **Throughput vs latency separated** — capacity comes from a time-boxed concurrency
sweep, not the latency-coupled rps in the latency tables.
- **Per-variant wall cap (10 s)** — bounds idle-bound streaming variants; cap-aborted
requests are reported as `capped`, not `failed`.
- **Resilient orchestration** — the remote benchmark runs detached (`setsid`) and the
orchestrator polls for the `meta.json` sentinel, so an SSH drop can't kill or hang
the run; `set -uo` so one flaky variant skips instead of aborting.
- Reproduce with `./run.sh`; pin `var.ami_id` and the `*_IMAGE` digests for a
byte-identical rerun. Heavier run: `N=20000 REPEATS=5 ./run.sh`.
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
# Builds the mock backend and load generator into one small static image.
# Both binaries live in the final image; the compose service / docker run
# command selects which one to execute.
FROM golang:1.26-alpine AS build
WORKDIR /src
COPY go.mod ./
COPY mock ./mock
COPY loadgen ./loadgen
RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/mock ./mock \
&& CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/loadgen ./loadgen

FROM gcr.io/distroless/static-debian12:nonroot
Comment on lines +4 to +12

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major

Pin the bench-tools base images by digest.

Floating tags (golang:1.26-alpine, gcr.io/distroless/static-debian12:nonroot) allow the build environment and runtime binaries to change on every docker build pull, even with identical source code. This undermines the reproducibility required for benchmarking. Pin both images to their immutable SHA256 digests.

Suggested shape
-FROM golang:1.26-alpine AS build
+FROM golang:1.26-alpine@sha256:<digest> AS build
...
-FROM gcr.io/distroless/static-debian12:nonroot
+FROM gcr.io/distroless/static-debian12:nonroot@sha256:<digest>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile` around
lines 4 - 12, The Dockerfile for bench-tools uses floating base image tags in
both the build stage and runtime stage, so update the FROM entries for
golang:1.26-alpine and gcr.io/distroless/static-debian12:nonroot to immutable
SHA256 digests. Keep the rest of the build steps in place and preserve the
existing multi-stage structure so the Dockerfile remains reproducible for
benchmarking.

COPY --from=build /out/mock /mock
COPY --from=build /out/loadgen /loadgen
# No ENTRYPOINT: each invocation picks the binary as its command, e.g.
# docker run img /mock (compose `command: ["/mock"]`)
# docker run img /loadgen -url …
CMD ["/mock"]
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
module gomodel-bench-tools

go 1.26
Loading