From 97afaadbe1687339837ccaf1c33dd21845d329d1 Mon Sep 17 00:00:00 2001 From: "Jakub A. W" Date: Fri, 26 Jun 2026 14:29:28 +0200 Subject: [PATCH] docs(benchmark): add AWS gateway benchmark and refresh benchmarks page Adds docs/2026-06-25_aws_gateway_benchmark/: a reproducible AWS benchmark comparing GoModel against LiteLLM, Portkey, and Bifrost on latency, throughput, memory, image size, and cold start, using a shared recording-mock backend so the numbers reflect gateway overhead. Includes RESULTS.md and the one-command Terraform/Docker harness (run.sh, remote/, terraform/, scripts/summarize.py). Refreshes docs/about/benchmarks.mdx to this June 2026 run (with a paid-AWS note). The benchmark write-up (ARTICLE.md, cover, charts) and the co-located QA and translation tooling are in a separate draft PR. Terraform state, provider binaries, the generated SSH key, and raw run output are gitignored. Co-Authored-By: Claude Opus 4.8 (1M context) --- .../.gitignore | 9 + .../README.md | 122 +++++ .../RESULTS.md | 135 ++++++ .../remote/bench-tools/Dockerfile | 18 + .../remote/bench-tools/go.mod | 3 + .../remote/bench-tools/loadgen/main.go | 442 ++++++++++++++++++ .../remote/bench-tools/mock/main.go | 411 ++++++++++++++++ .../remote/configs/bifrost-config.json | 60 +++ .../remote/configs/litellm-config.yaml | 19 + .../remote/docker-compose.yml | 99 ++++ .../remote/run-on-instance.sh | 379 +++++++++++++++ docs/2026-06-25_aws_gateway_benchmark/run.sh | 148 ++++++ .../scripts/summarize.py | 296 ++++++++++++ .../terraform/main.tf | 148 ++++++ .../terraform/outputs.tf | 29 ++ .../terraform/variables.tf | 50 ++ docs/about/benchmarks.mdx | 157 +++---- 17 files changed, 2429 insertions(+), 96 deletions(-) create mode 100644 docs/2026-06-25_aws_gateway_benchmark/.gitignore create mode 100644 docs/2026-06-25_aws_gateway_benchmark/README.md create mode 100644 docs/2026-06-25_aws_gateway_benchmark/RESULTS.md create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml create mode 100755 docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh create mode 100755 docs/2026-06-25_aws_gateway_benchmark/run.sh create mode 100644 docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py create mode 100644 docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf create mode 100644 docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf create mode 100644 docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf diff --git a/docs/2026-06-25_aws_gateway_benchmark/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/.gitignore new file mode 100644 index 00000000..39a18413 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/.gitignore @@ -0,0 +1,9 @@ +# Benchmark outputs and local Terraform state / secrets — never commit. +output/ +remote/results/ +terraform/.terraform/ +terraform/.terraform.lock.hcl +terraform/*.tfstate +terraform/*.tfstate.* +terraform/bench_key.pem +*.tar.gz diff --git a/docs/2026-06-25_aws_gateway_benchmark/README.md b/docs/2026-06-25_aws_gateway_benchmark/README.md new file mode 100644 index 00000000..0a4a2950 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/README.md @@ -0,0 +1,122 @@ +# AWS gateway latency & resource benchmark — GoModel vs LiteLLM vs Portkey vs Bifrost + +A reproducible, one-command benchmark that provisions a free-tier AWS instance, +runs four AI gateways through identical workloads against a deterministic mock +backend, measures latency and resource cost, and tears the infrastructure down. + +Because every gateway talks to the **same local mock backend**, the numbers +reflect *gateway overhead*, not upstream model latency or network jitter. + +## What it compares + +Four OpenAI-compatible gateways, each pointed at the mock: + +| Gateway | Image | How it reaches the mock | +|----------|--------------------------------------|-------------------------| +| GoModel | built from this repo (`Dockerfile`) | `OPENAI_BASE_URL` env | +| LiteLLM | `ghcr.io/berriai/litellm:main-stable`| `configs/litellm-config.yaml` | +| Portkey | `portkeyai/gateway:latest` | `x-portkey-custom-host` header (+ `TRUSTED_CUSTOM_HOSTS=mock`) | +| Bifrost | `maximhq/bifrost:latest` | `configs/bifrost-config.json` (`network_config.base_url` + `allow_private_network`) | + +Per-gateway quirks the harness handles automatically (see `gw_model`/`gw_path` in +`run-on-instance.sh`): Bifrost needs an explicit `openai/`-prefixed model, serves +the Anthropic dialect at `/anthropic/v1/messages` (not `/v1/messages`), and must +allow private-network egress to reach the mock. + +### Workloads — 6 variants + +The common denominator across OpenAI-compatible gateways, in both modes: + +| Dialect | Endpoint | non-stream | stream | +|-----------|------------------------|:----------:|:------:| +| Chat | `/v1/chat/completions` | ✓ | ✓ | +| Responses | `/v1/responses` | ✓ | ✓ | +| Messages | `/v1/messages` (Anthropic) | ✓ | ✓ | + +A **baseline** (load sent straight to the mock, no gateway) runs first as the +latency floor. Variants a gateway does not implement are reported as failures +rather than silently skipped — e.g. Portkey's OSS gateway does not serve the +Anthropic Messages dialect here, so its messages variants fail; that asymmetry is +the finding. Streaming uses a terminal-marker **or idle-gap** end-of-stream +detection (`loadgen -idle`), so a gateway that streams content without sending a +terminal event (Bifrost) is still measured to last-byte rather than hanging. + +### Metrics captured + +- **Latency** — total-latency p50/p90/p95/p99, plus **TTFT** (time to first + token) for streaming, and throughput (RPS). Driven by the `loadgen` tool. +- **Docker image size** — `docker image inspect` size + repo digest per gateway. +- **Memory** — idle RSS after warmup and peak RSS under load (`docker stats`). +- **CPU** — average CPU% under load (`docker stats`). + +## Layout + +``` +terraform/ free-tier EC2 + SSH key + security group (apply/destroy) +remote/ everything shipped to and run on the instance + bench-tools/ Go mock backend + loadgen (one small image) + configs/ litellm config + docker-compose.yml mock + one gateway per profile (benchnet network) + run-on-instance.sh builds images, runs 6 variants x N gateways, samples stats +scripts/summarize.py raw JSON -> latency + resource tables + summary.json +run.sh orchestrator: build -> apply -> run -> collect -> destroy +``` + +## Prerequisites + +- AWS credentials configured (`aws sts get-caller-identity` works) +- Terraform ≥ 1.6, Docker (with `buildx`), `rsync`, `ssh`, Python 3 +- An AWS account with default VPC in the chosen region + +## Run it + +```bash +cd docs/2026-06-25_aws_gateway_benchmark +./run.sh # full run in us-east-1, then auto-destroy +N=1000 C=20 ./run.sh # heavier load +REGION=eu-west-1 ./run.sh # different region +GATEWAYS="gomodel litellm" ./run.sh # subset +KEEP=1 ./run.sh # leave the instance up for debugging +``` + +`run.sh` always tears the instance down via an EXIT trap, even on failure. If a +run is interrupted, reconcile manually: + +```bash +cd terraform && terraform destroy -auto-approve +``` + +Results land in `output//` (raw per-variant JSON, `summary.json`, +and the printed `summary.txt` table). + +## Local dry-run (no AWS) + +The instance-side harness runs on any Docker host: + +```bash +cd remote && N=30 C=5 GATEWAYS="gomodel litellm portkey bifrost" ./run-on-instance.sh +``` + +(Build the GoModel image first: `docker build -t gomodel-bench:local ../../..`) + +## Reproducibility & caveats + +- **Pinned**: gateway image refs (overridable via `*_IMAGE` env), the Compose + plugin version, instance type, and the deterministic mock payload. Exact image + **digests** are recorded in each `*_image.json` so a run is fully traceable. +- **AMI** resolves to the latest Amazon Linux 2023 via SSM (reproducible by + policy). Pin `var.ami_id` for a byte-identical OS. +- **Free tier**: defaults to **t2.micro** — the 12-month-free-tier instance in + us-east-1 — with `standard` CPU credits (no surprise burst charges), a 20 GiB + gp3 root volume (free tier allows 30 GiB), the default VPC (no paid NAT/EIP), + and an Amazon Linux 2023 AMI. Image pulls are inbound traffic (free). In + regions where t2.micro is unavailable, set `INSTANCE_TYPE=t3.micro` (the + free-tier instance there). Newer accounts on AWS's credit-based free plan stay + within credit for a single short run. +- **t2.micro is burstable** (1 vCPU, CPU-credit throttled). Treat absolute + latency as *indicative*; the value is the *relative* comparison on identical + hardware. Gateways run **one at a time** so they never contend, and the load is + kept modest (N=500, c=10) to stay within launch credits. For production-grade + absolute numbers, set `INSTANCE_TYPE=c7i.large` (not free tier). +- **Cost**: a single free-tier instance for ~15–30 min — $0 within free-tier + allowance, otherwise a few cents. diff --git a/docs/2026-06-25_aws_gateway_benchmark/RESULTS.md b/docs/2026-06-25_aws_gateway_benchmark/RESULTS.md new file mode 100644 index 00000000..b764ef05 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/RESULTS.md @@ -0,0 +1,135 @@ +# Results — 2026-06-25 (AWS c7i.large run) + +Reference run produced by `./run.sh` (raw data in `output/20260625-182538/`, +machine summary in that dir's `summary.md` / `summary.json`). Four gateways: +**GoModel, LiteLLM, Portkey, Bifrost**. + +- **Host**: AWS EC2 **c7i.large** (2 vCPU, 4 GiB, **non-burstable** — no CPU-credit + drift, so the tail is stable), Amazon Linux 2023, us-east-1. +- **Load**: N=8000 requests/variant, concurrency 10, **2 randomized-order trials** + (latency = median across trials; p99 shown with its min–max), 200-request + process warmup + 50-request per-variant warmup, per-variant wall cap 10 s, 8 s + resource window, capacity sweep at c∈{1,16,128}. Shared in-process **mock** + backend, so every number is **gateway overhead**, not model latency. +- **Parity**: retries disabled on every gateway, GoModel's circuit breaker disabled + (so the sweep can't trip it), and **LiteLLM run at its recommended worker count — + one worker per CPU core (`num_workers=2` on this 2-vCPU box)** so it isn't pinned + to a single core while the Go gateways use both. +- Images (digests in `*_image.json`): GoModel (built from this repo), latest + `litellm:main-stable`, `portkeyai/gateway:latest`, `maximhq/bifrost:latest`. + +> Fast reference run (N=8000 × 2 trials) sized to finish end-to-end in well under +> 20 minutes; the p99 min–max spreads are tight, so the medians are stable. Raise +> `N`/`REPEATS` for a heavier run. + +## Latency — non-streaming (ms, median of trials) + +| Workload | metric | baseline | GoModel | Bifrost | Portkey | LiteLLM | +|-----------|--------|---------:|--------:|--------:|--------:|--------:| +| chat | p50 | 0.23 | **1.81** | 2.51 | 9.70 | 30.56 | +| chat | p99 | 2.77 | **6.88** | 18.27 | 30.54 | 39.26 | +| responses | p50 | 0.26 | **2.01** | 2.73 | 9.07 | 39.12 | +| responses | p99 | 2.33 | **7.28** | 16.55 | 26.92 | 48.60 | +| messages | p50 | 0.26 | **1.76** | 2.65 | ✗ | 61.06 | +| messages | p99 | 2.23 | **6.59** | 19.08 | ✗ | 98.12 | + +**GoModel has the lowest p50 and the tightest p99** (~7 ms vs Bifrost ~18 ms, +Portkey ~31 ms, LiteLLM ~39 ms). `overhead p50` (gateway p50 − baseline p50): +GoModel ≈ 1.6 ms, Bifrost ≈ 2.3 ms, Portkey ≈ 9.5 ms, LiteLLM ≈ 30 ms. + +## Latency — streaming (ms, median of trials) + +| Workload | metric | GoModel | Bifrost | Portkey | LiteLLM | +|-----------|--------|--------:|--------:|--------:|--------:| +| chat | TTFT p50 | **4.71** | 9.02 | 27.97 | 151.94 | +| chat | total p50 | **4.95** | 11.89 | 27.98 | 151.95 | +| responses | TTFT p50 | **4.69** | 12.87 | 27.90 | 47.53 | +| responses | total p50 | **5.00** | 14.94 | 27.93 | 47.55 | +| messages | TTFT p50 | **7.50** | † | ✗ | 48.86 | +| messages | total p50 | **8.38** | † | ✗ | 48.89 | + +† **Bifrost messages-stream is an idle-bound artifact, not a throughput number** +(no terminal event over a non-native backend → 0 completions within the 10 s cap). + +## Throughput / capacity (chat non-stream, sustained req/s by concurrency) + +| target | c=1 | c=16 | c=128 | peak | knee | +|--------|----:|-----:|------:|-----:|-----:| +| baseline | 15510 | 29701 | 30015 | **30015** | 16 | +| GoModel | 2745 | 4928 | 4567 | **4928** | 16 | +| Bifrost | 1885 | 3088 | 2904 | **3088** | 16 | +| Portkey | 636 | 946 | 900 | **946** | 16 | +| LiteLLM | 227 | 324 | 254 | **324** | 16 | + +GoModel tops the gateways at **~4900 req/s**, ~1.6× Bifrost, ~5× Portkey, ~15× +LiteLLM. All saturate by c=16 on 2 vCPUs. + +## Resources + +| Metric | GoModel | Portkey | Bifrost | LiteLLM | +|--------|--------:|--------:|--------:|--------:| +| Docker image, compressed pull (MB) | **16** | 59 | 77 | 372 | +| Docker image, on-disk (MB) | **47.2** | 177.4 | 230.7 | 1159.9 | +| Cold start to first 200 (s) | **0.56** | 1.05 | 7.07 | 25.49 | +| Peak RSS under load (MB)| **37.0** | 112.0 | 143.0 | 2272.3 | +| Avg CPU under load (%) | 92.6 | 116.9 | 117.6 | 101.1 | +| Sustained req/s (resource window) | **4824** | 960 | 2977 | 261 | +| Efficiency (req/s per CPU %) | **52.1** | 8.2 | 25.3 | 2.6 | + +GoModel is the most CPU-efficient (**52 req/s per CPU-%**, ~2× Bifrost, ~6× +Portkey, ~20× LiteLLM), the smallest image (**47 MB**), the smallest footprint +(**37 MB** peak), and the fastest cold start (**0.56 s**). + +> **LiteLLM at its recommended config.** With `num_workers=2` (one per core) LiteLLM +> is faster and higher-throughput than the earlier single-worker run (≈220 → 324 +> req/s; chat p50 ≈ 44 → 31 ms — a single worker was queuing the 10 concurrent +> requests), but its **memory doubled to ~2.3 GB** (two ~1 GB worker processes) and +> its **cold start rose to ~25 s**. Running LiteLLM "properly" widens the resource +> gap, not narrows it. + +## Feature coverage (6 variants) + +| Gateway | chat | responses | messages | total | +|---------|:----:|:---------:|:--------:|:-----:| +| GoModel | ✓ | ✓ | ✓ | 6/6 | +| LiteLLM | ✓ | ✓ | ✓ | 6/6 | +| Bifrost | ✓ | ✓ | ✓† | 6/6 | +| Portkey | ✓ | ✓ | ✗ | 4/6 | + +- **Portkey** errors on the Anthropic `/v1/messages` dialect in this single-provider + (openai → mock) setup; setup limitation, not a hard capability gap. +- **Bifrost** serves Anthropic at `/anthropic/v1/messages`, needs an `openai/`-prefixed + model and `allow_private_network:true`; messages-streaming has the caveat above (†). + +## Takeaways + +- **GoModel** — best all-rounder: lowest p50 and tightest p99 (~7 ms), highest + gateway throughput (~4900 req/s), best CPU efficiency (52 req/s per %), smallest + image (47 MB) and memory (37 MB), fastest cold start (0.56 s), full 6/6 coverage. +- **Bifrost** (Go) — second on throughput, low p50 but a heavier p99 tail; streaming + terminal-event gaps over a non-native backend. +- **Portkey** (Node) — middle tier; no Anthropic Messages in this setup. +- **LiteLLM** (Python) — full coverage, but even at its recommended 2-worker config + it is ~15× behind on throughput and carries a **1.16 GB image + ~2.3 GB RAM + ~25 s + cold start**. The cost of Python on the hot path. + +## Methodology notes + +- **Repeats + spread** — 2 trials, randomized gateway order each trial; latency is + the median across trials, p99 carries its min–max. +- **Config parity** — retries off on all; GoModel's circuit breaker disabled (a few + transient errors under the c=128 sweep would otherwise trip it and blanket-503 its + own capacity); **LiteLLM at one worker per core (`num_workers`=vCPUs)**, its own + production recommendation, set automatically from `nproc`. +- **Warm-up** — 200 global + 50 per-variant requests; the per-variant warmup + neutralizes LiteLLM's lazy per-dialect imports and, with >1 worker, warms each + worker before measuring. +- **Throughput vs latency separated** — capacity comes from a time-boxed concurrency + sweep, not the latency-coupled rps in the latency tables. +- **Per-variant wall cap (10 s)** — bounds idle-bound streaming variants; cap-aborted + requests are reported as `capped`, not `failed`. +- **Resilient orchestration** — the remote benchmark runs detached (`setsid`) and the + orchestrator polls for the `meta.json` sentinel, so an SSH drop can't kill or hang + the run; `set -uo` so one flaky variant skips instead of aborting. +- Reproduce with `./run.sh`; pin `var.ami_id` and the `*_IMAGE` digests for a + byte-identical rerun. Heavier run: `N=20000 REPEATS=5 ./run.sh`. diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile new file mode 100644 index 00000000..8285027f --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile @@ -0,0 +1,18 @@ +# Builds the mock backend and load generator into one small static image. +# Both binaries live in the final image; the compose service / docker run +# command selects which one to execute. +FROM golang:1.26-alpine AS build +WORKDIR /src +COPY go.mod ./ +COPY mock ./mock +COPY loadgen ./loadgen +RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/mock ./mock \ + && CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/loadgen ./loadgen + +FROM gcr.io/distroless/static-debian12:nonroot +COPY --from=build /out/mock /mock +COPY --from=build /out/loadgen /loadgen +# No ENTRYPOINT: each invocation picks the binary as its command, e.g. +# docker run img /mock (compose `command: ["/mock"]`) +# docker run img /loadgen -url … +CMD ["/mock"] diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod new file mode 100644 index 00000000..2552d97b --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod @@ -0,0 +1,3 @@ +module gomodel-bench-tools + +go 1.26 diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go new file mode 100644 index 00000000..bcbcf6b8 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go @@ -0,0 +1,442 @@ +// loadgen drives concurrent requests at one gateway endpoint and reports latency +// percentiles. It speaks three request dialects (OpenAI chat, OpenAI responses, +// Anthropic messages) in streaming and non-streaming modes, so a single binary +// covers all six benchmark variants. +// +// Two closed-loop modes: +// - fixed count (-n N): send N requests at concurrency C, then stop. +// - time-boxed (-duration D): keep C workers busy for D, counting +// completions. Used by the capacity sweep to +// measure *sustained* throughput at each +// concurrency level (vs the latency-coupled +// "completed req/s @ c=N" the fixed mode reports). +// +// For streaming requests it records TTFT (time to first token/byte) separately +// from total latency, plus inter-chunk gap percentiles (a pass-through gateway +// relays each upstream chunk immediately; a buffering one clumps them). Output is +// a JSON summary suitable for aggregation. +package main + +import ( + "bufio" + "bytes" + "context" + "encoding/json" + "flag" + "fmt" + "io" + "math" + "net/http" + "os" + "sort" + "strings" + "sync" + "time" +) + +type headerList []string + +func (h *headerList) String() string { return strings.Join(*h, ",") } +func (h *headerList) Set(v string) error { + *h = append(*h, v) + return nil +} + +type result struct { + ttft time.Duration + total time.Duration + gaps []time.Duration // streaming: time between consecutive received chunks + err string + capped bool // aborted by the variant wall cap — excluded from ok/failed +} + +func main() { + var ( + url = flag.String("url", "", "Target URL") + n = flag.Int("n", 500, "Total requests (fixed-count mode; ignored when -duration > 0)") + c = flag.Int("c", 10, "Concurrency") + duration = flag.Duration("duration", 0, "Time-boxed mode: keep -c workers busy for this long, count completions (capacity sweep)") + dialect = flag.String("dialect", "chat", "Request dialect: chat | responses | messages") + stream = flag.Bool("stream", false, "Stream the response") + model = flag.String("model", "gpt-4o-mini", "Model name") + auth = flag.String("auth", "sk-bench-test-key", "Bearer token for Authorization header") + jsonOut = flag.String("json", "", "Write JSON summary to this file") + timeout = flag.Duration("timeout", 30*time.Second, "Per-request hard timeout") + idle = flag.Duration("idle", 1500*time.Millisecond, "Streaming: end the stream if no new data arrives for this long") + maxWall = flag.Duration("max-wall", 0, "Fixed mode: stop launching new requests after this wall time (caps slow/idle-bound variants; 0 = no cap)") + ) + var headers headerList + flag.Var(&headers, "H", "Extra header 'Key: Value' (repeatable)") + flag.Parse() + + if *url == "" { + fmt.Fprintln(os.Stderr, "usage: loadgen -url [-n] [-c] [-dialect chat|responses|messages] [-stream] [-model] [-H 'K: V']") + os.Exit(2) + } + + body := buildBody(*dialect, *model, *stream) + // Tuned transport: keep a full set of hot keep-alive connections for the + // configured concurrency so the measured window isn't paying TCP setup. The + // stdlib default caps idle conns per host at 2, which would churn connections + // at c>2 and add noise to short windows. + tr := http.DefaultTransport.(*http.Transport).Clone() + tr.MaxIdleConns = *c * 2 + tr.MaxIdleConnsPerHost = *c * 2 + client := &http.Client{Transport: tr} + + do := func(ctx context.Context) result { + return doRequest(ctx, client, *url, body, *dialect, *stream, *auth, headers, *timeout, *idle) + } + + var results []result + var wall time.Duration + if *duration > 0 { + results, wall = driveTimeBoxed(do, *c, *duration) + } else { + results, wall = driveFixed(do, *n, *c, *maxWall) + } + + report(*url, *dialect, *stream, len(results), *c, wall, *duration, results, *jsonOut) +} + +// driveFixed sends up to n requests at concurrency c (closed loop). If maxWall>0 +// it caps the variant at that wall time: it stops launching new requests AND +// cancels any in-flight ones via the shared context, so an idle-bound streaming +// variant can't run for the full N at ~7 req/s (and can't stall the launch loop +// on a full semaphore either). Fast variants reach N long before the cap; slow +// ones return however many completed in the window. Requests aborted *by* the cap +// are tagged capped (excluded from ok/failed) rather than counted as errors. +func driveFixed(do func(context.Context) result, n, c int, maxWall time.Duration) ([]result, time.Duration) { + capCtx := context.Background() + if maxWall > 0 { + var cancel context.CancelFunc + capCtx, cancel = context.WithTimeout(capCtx, maxWall) + defer cancel() + } + var mu sync.Mutex + results := make([]result, 0, n) + sem := make(chan struct{}, c) + var wg sync.WaitGroup + start := time.Now() +loop: + for i := 0; i < n; i++ { + select { + case <-capCtx.Done(): // cap reached — stop launching even if the sem is full + break loop + case sem <- struct{}{}: + } + wg.Add(1) + go func() { + defer wg.Done() + defer func() { <-sem }() + r := do(capCtx) + if r.err != "" && capCtx.Err() != nil { + r.capped = true // aborted by the cap, not a genuine failure + } + mu.Lock() + results = append(results, r) + mu.Unlock() + }() + } + wg.Wait() + return results, time.Since(start) +} + +// driveTimeBoxed keeps c workers continuously busy for d, returning every result +// completed in the window. This measures sustained throughput at concurrency c +// (the capacity sweep), independent of any fixed request count. The shared +// context cancels each worker's final in-flight request at the deadline so the +// tail can't overrun by a full per-request timeout. +func driveTimeBoxed(do func(context.Context) result, c int, d time.Duration) ([]result, time.Duration) { + ctx, cancel := context.WithTimeout(context.Background(), d) + defer cancel() + var mu sync.Mutex + var results []result + var wg sync.WaitGroup + start := time.Now() + for range c { + wg.Add(1) + go func() { + defer wg.Done() + for ctx.Err() == nil { + r := do(ctx) + if r.err != "" && ctx.Err() != nil { + r.capped = true // tail request cut off at the window edge + } + mu.Lock() + results = append(results, r) + mu.Unlock() + } + }() + } + wg.Wait() + return results, time.Since(start) +} + +func buildBody(dialect, model string, stream bool) []byte { + const prompt = "Say hello for a benchmark test." + var req map[string]any + switch dialect { + case "responses": + req = map[string]any{"model": model, "stream": stream, "input": prompt} + case "messages": + req = map[string]any{ + "model": model, "stream": stream, "max_tokens": 256, + "messages": []map[string]any{{"role": "user", "content": prompt}}, + } + default: // chat + req = map[string]any{ + "model": model, "stream": stream, + "messages": []map[string]any{{"role": "user", "content": prompt}}, + } + } + b, _ := json.Marshal(req) + return b +} + +func doRequest(parent context.Context, client *http.Client, url string, body []byte, dialect string, stream bool, auth string, headers headerList, timeout, idle time.Duration) result { + ctx, cancel := context.WithTimeout(parent, timeout) + defer cancel() + + req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body)) + if err != nil { + return result{err: err.Error()} + } + req.Header.Set("Content-Type", "application/json") + if auth != "" { + req.Header.Set("Authorization", "Bearer "+auth) + } + if dialect == "messages" { + req.Header.Set("anthropic-version", "2023-06-01") + } + for _, h := range headers { + k, v, ok := strings.Cut(h, ":") + if ok { + req.Header.Set(strings.TrimSpace(k), strings.TrimSpace(v)) + } + } + + startReq := time.Now() + resp, err := client.Do(req) + if err != nil { + return result{err: err.Error()} + } + defer resp.Body.Close() + + if resp.StatusCode != http.StatusOK { + b, _ := io.ReadAll(io.LimitReader(resp.Body, 300)) + return result{err: fmt.Sprintf("HTTP %d: %s", resp.StatusCode, strings.TrimSpace(string(b)))} + } + + if !stream { + if _, err := io.Copy(io.Discard, resp.Body); err != nil { + return result{err: "read body: " + err.Error()} + } + d := time.Since(startReq) + return result{ttft: d, total: d} + } + return readStream(resp.Body, startReq, dialect, idle) +} + +// readStream consumes an SSE body, recording TTFT at the first data line and +// total latency at the last received chunk. A stream ends when any of these +// occurs: the dialect's terminal marker is seen, the connection closes, or no +// new data arrives for `idle` (a fallback for gateways that stream content but +// never send a terminal event nor close — e.g. Bifrost's responses/messages +// streams over an OpenAI-backed provider). Total latency is always measured to +// the last byte, so the idle wait never inflates the reported latency. +func readStream(r io.Reader, startReq time.Time, dialect string, idle time.Duration) result { + lines := make(chan string, 128) + go func() { + scanner := bufio.NewScanner(r) + scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024) + for scanner.Scan() { + lines <- scanner.Text() + } + close(lines) + }() + + var ttft, total, prev time.Duration + var gaps []time.Duration + gotFirst := false + timer := time.NewTimer(idle) + defer timer.Stop() + + for { + select { + case line, ok := <-lines: + if !ok { // connection closed + if !gotFirst { + return result{err: "empty stream"} + } + return result{ttft: ttft, total: total, gaps: gaps} + } + if !strings.HasPrefix(line, "data: ") { + continue + } + now := time.Since(startReq) + if !gotFirst { + ttft = now + gotFirst = true + } else { + gaps = append(gaps, now-prev) // time since the previous chunk + } + prev = now + total = now // advance to the most recent chunk + if isTerminal(dialect, line[len("data: "):]) { + return result{ttft: ttft, total: total, gaps: gaps} + } + if !timer.Stop() { + select { + case <-timer.C: + default: + } + } + timer.Reset(idle) + case <-timer.C: // idle gap: treat as end-of-stream at the last chunk + if !gotFirst { + return result{err: "no data before idle timeout"} + } + return result{ttft: ttft, total: total, gaps: gaps} + } + } +} + +// isTerminal recognizes each dialect's end-of-stream markers, including the +// content-complete events some gateways send instead of a final wrapper event. +func isTerminal(dialect, payload string) bool { + switch dialect { + case "responses": + return strings.Contains(payload, `"response.completed"`) || + strings.Contains(payload, `"response.output_text.done"`) + case "messages": + return strings.Contains(payload, `"message_stop"`) + default: // chat + return payload == "[DONE]" + } +} + +func report(url, dialect string, stream bool, n, c int, wall, duration time.Duration, results []result, jsonOut string) { + var ttfts, totals, gaps []float64 + errs := map[string]int{} + errCount := 0 + cappedCount := 0 + for _, r := range results { + if r.capped { // cut off by the wall cap — neither a success nor a failure + cappedCount++ + continue + } + if r.err != "" { + errCount++ + errs[r.err]++ + continue + } + ttfts = append(ttfts, float64(r.ttft.Microseconds())) + totals = append(totals, float64(r.total.Microseconds())) + for _, g := range r.gaps { + gaps = append(gaps, float64(g.Microseconds())) + } + } + ok := len(totals) + sort.Float64s(ttfts) + sort.Float64s(totals) + sort.Float64s(gaps) + + mode := "nonstream" + if stream { + mode = "stream" + } + rps := 0.0 + if wall > 0 { + rps = float64(ok) / wall.Seconds() + } + // measure documents how rps should be read: throughput = sustained capacity + // at concurrency c; latency = completed req/s @ c (coupled to per-req latency). + measure := "latency" + if duration > 0 { + measure = "throughput" + } + + // "-" means machine-readable only: emit JSON to stdout, skip the human report. + if jsonOut == "-" { + writeSummary("-", url, dialect, mode, measure, n, ok, errCount, cappedCount, c, wall, rps, ttfts, totals, gaps, errs) + return + } + + fmt.Printf("\n=== %s/%s %s (%s) ===\n", dialect, mode, url, measure) + fmt.Printf("requests: %d ok: %d failed: %d capped: %d concurrency: %d wall: %s rps: %.1f\n", + n, ok, errCount, cappedCount, c, wall.Round(time.Millisecond), rps) + if ok > 0 { + fmt.Printf("total latency ms p50=%.2f p90=%.2f p95=%.2f p99=%.2f max=%.2f\n", + ms(pct(totals, 50)), ms(pct(totals, 90)), ms(pct(totals, 95)), ms(pct(totals, 99)), ms(totals[ok-1])) + if stream { + fmt.Printf("ttft ms p50=%.2f p90=%.2f p95=%.2f p99=%.2f\n", + ms(pct(ttfts, 50)), ms(pct(ttfts, 90)), ms(pct(ttfts, 95)), ms(pct(ttfts, 99))) + if len(gaps) > 0 { + fmt.Printf("inter-chunk ms p50=%.2f p90=%.2f p99=%.2f\n", + ms(pct(gaps, 50)), ms(pct(gaps, 90)), ms(pct(gaps, 99))) + } + } + } + for e, ct := range errs { + fmt.Printf(" error x%d: %s\n", ct, e) + } + + if jsonOut != "" { + writeSummary(jsonOut, url, dialect, mode, measure, n, ok, errCount, cappedCount, c, wall, rps, ttfts, totals, gaps, errs) + } +} + +func writeSummary(path, url, dialect, mode, measure string, n, ok, errCount, cappedCount, c int, wall time.Duration, rps float64, ttfts, totals, gaps []float64, errs map[string]int) { + sample := func(s []float64) map[string]any { + if len(s) == 0 { + return map[string]any{} + } + return map[string]any{ + "p50_ms": ms(pct(s, 50)), "p90_ms": ms(pct(s, 90)), + "p95_ms": ms(pct(s, 95)), "p99_ms": ms(pct(s, 99)), + "min_ms": ms(s[0]), "max_ms": ms(s[len(s)-1]), "avg_ms": ms(avg(s)), + } + } + out := map[string]any{ + "url": url, "dialect": dialect, "mode": mode, "measure": measure, + "requests": n, "ok": ok, "failed": errCount, "capped": cappedCount, "concurrency": c, + "wall_ms": wall.Milliseconds(), "rps": rps, + "total_latency": sample(totals), + "ttft": sample(ttfts), + "inter_chunk": sample(gaps), + "errors": errs, + } + b, _ := json.MarshalIndent(out, "", " ") + if path == "-" { + fmt.Println(string(b)) + return + } + if err := os.WriteFile(path, b, 0o644); err != nil { + fmt.Fprintf(os.Stderr, "write %s: %v\n", path, err) + os.Exit(1) + } +} + +func pct(sorted []float64, p float64) float64 { + if len(sorted) == 0 { + return 0 + } + idx := p / 100 * float64(len(sorted)-1) + lo, hi := int(math.Floor(idx)), int(math.Ceil(idx)) + if lo == hi { + return sorted[lo] + } + frac := idx - float64(lo) + return sorted[lo]*(1-frac) + sorted[hi]*frac +} + +func avg(s []float64) float64 { + sum := 0.0 + for _, v := range s { + sum += v + } + return sum / float64(len(s)) +} + +func ms(us float64) float64 { return math.Round(us/10) / 100 } // microseconds -> ms, 2dp diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go new file mode 100644 index 00000000..8930cf8a --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go @@ -0,0 +1,411 @@ +// Mock OpenAI/Anthropic-compatible backend for gateway benchmarking. +// +// It answers instantly with deterministic payloads so the benchmark measures +// pure gateway overhead, not upstream model latency. Three dialects are served +// so every gateway can be exercised through its own native translation path: +// +// /v1/chat/completions — OpenAI Chat Completions (stream + non-stream) +// /v1/responses — OpenAI Responses (stream + non-stream) +// /v1/messages — Anthropic Messages (stream + non-stream) +// +// Each path is also exposed without the /v1 prefix because some gateways strip +// it before forwarding upstream. +// +// Recording mode (MOCK_RECORD=1): every upstream request is captured (method, +// path, headers with secrets redacted, body) along with the canned response the +// mock returned, and exposed via control endpoints so a harness can inspect how +// each gateway *translated* a client request: +// +// POST /__reset clear the capture log +// GET /__log {"entries":[...]} all captured exchanges since reset +// GET /__last the most recent captured exchange +// +// Recording also enriches responses with provider-specific extras +// (system_fingerprint, service_tier, x_provider_note) so response-normalization +// fidelity is observable. Both behaviors are gated so the latency benchmark +// stays byte-identical when MOCK_RECORD is unset. +package main + +import ( + "encoding/json" + "fmt" + "io" + "log" + "net/http" + "os" + "strings" + "sync" + "time" +) + +func main() { + port := "9999" + if p := os.Getenv("MOCK_PORT"); p != "" { + port = p + } + + mux := http.NewServeMux() + register(mux, "/chat/completions", handleChatCompletions) + register(mux, "/responses", handleResponses) + register(mux, "/messages", handleMessages) + mux.HandleFunc("/v1/models", handleModels) + mux.HandleFunc("/models", handleModels) + mux.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) { + writeJSONBytes(w, http.StatusOK, []byte(`{"status":"ok"}`)) + }) + mux.HandleFunc("/__reset", handleReset) + mux.HandleFunc("/__log", handleLog) + mux.HandleFunc("/__last", handleLast) + + log.Printf("Mock backend (openai+anthropic) listening on :%s (record=%v)", port, recording()) + if err := http.ListenAndServe(":"+port, mux); err != nil { + log.Fatal(err) + } +} + +// register binds a handler at both the canonical and /v1-prefixed path. +func register(mux *http.ServeMux, path string, h http.HandlerFunc) { + mux.HandleFunc(path, h) + mux.HandleFunc("/v1"+path, h) +} + +// ---------- request/response capture ---------- + +func recording() bool { return os.Getenv("MOCK_RECORD") == "1" } + +// entry is one captured upstream exchange: the request a gateway sent and the +// response the mock returned for it. +type entry struct { + Seq int `json:"seq"` + Time string `json:"time"` + Method string `json:"method"` + Path string `json:"path"` + Query string `json:"query,omitempty"` + Headers map[string]string `json:"headers"` + Body json.RawMessage `json:"body,omitempty"` + BodyText string `json:"body_text,omitempty"` // set when body is not valid JSON + Stream bool `json:"stream"` + Response any `json:"response,omitempty"` +} + +var rec struct { + mu sync.Mutex + entries []*entry + seq int +} + +var sensitiveHeaders = map[string]bool{ + "authorization": true, "x-api-key": true, "api-key": true, + "x-portkey-api-key": true, "x-goog-api-key": true, +} + +// begin reads and (in recording mode) captures the request, returning the entry +// so the handler can attach the response it produces. Returns ok=false if the +// request was already rejected. +func begin(w http.ResponseWriter, r *http.Request) (*entry, bool) { + if r.Method != http.MethodPost { + http.Error(w, "method not allowed", http.StatusMethodNotAllowed) + return nil, false + } + raw, _ := io.ReadAll(r.Body) + var sr streamReq + if err := json.Unmarshal(raw, &sr); err != nil { + http.Error(w, "invalid request body", http.StatusBadRequest) + return nil, false + } + e := &entry{ + Time: time.Now().UTC().Format(time.RFC3339Nano), Method: r.Method, + Path: r.URL.Path, Query: r.URL.RawQuery, Headers: captureHeaders(r), + Stream: sr.Stream, + } + if json.Valid(raw) { + e.Body = json.RawMessage(raw) + } else { + e.BodyText = string(raw) + } + if recording() { + rec.mu.Lock() + rec.seq++ + e.Seq = rec.seq + rec.entries = append(rec.entries, e) + rec.mu.Unlock() + } + return e, true +} + +func captureHeaders(r *http.Request) map[string]string { + h := make(map[string]string, len(r.Header)) + for k, v := range r.Header { + val := strings.Join(v, ", ") + if sensitiveHeaders[strings.ToLower(k)] { + val = fmt.Sprintf("redacted(len=%d)", len(val)) + } + h[k] = val + } + return h +} + +func handleReset(w http.ResponseWriter, _ *http.Request) { + rec.mu.Lock() + rec.entries = nil + rec.seq = 0 + rec.mu.Unlock() + writeJSONBytes(w, http.StatusOK, []byte(`{"ok":true}`)) +} + +func handleLog(w http.ResponseWriter, _ *http.Request) { + rec.mu.Lock() + defer rec.mu.Unlock() + writeJSON(w, map[string]any{"entries": rec.entries}) +} + +func handleLast(w http.ResponseWriter, _ *http.Request) { + rec.mu.Lock() + defer rec.mu.Unlock() + if len(rec.entries) == 0 { + writeJSONBytes(w, http.StatusNotFound, []byte(`{"error":"no entries"}`)) + return + } + writeJSON(w, rec.entries[len(rec.entries)-1]) +} + +// streamTokens is the deterministic body streamed token-by-token. Kept short and +// fixed so every run transfers identical bytes. +var streamTokens = []string{ + "This ", "is ", "a ", "benchmark ", "response ", "from ", "the ", "mock ", + "backend ", "server. ", "It ", "contains ", "enough ", "text ", "to ", "be ", + "representative ", "of ", "a ", "typical ", "short ", "AI ", "response ", + "that ", "would ", "be ", "returned ", "in ", "production ", "use ", "cases.", +} + +func fullText() string { return strings.Join(streamTokens, "") } + +// providerExtras returns provider-specific fields (only in recording mode) so +// response-normalization fidelity is observable across gateways. +func providerExtras() map[string]any { + if !recording() { + return nil + } + return map[string]any{ + "system_fingerprint": "fp_mock_0001", + "service_tier": "default", + "x_provider_note": "mock-extra-field", + } +} + +func merge(base map[string]any, extra map[string]any) map[string]any { + for k, v := range extra { + base[k] = v + } + return base +} + +// ---------- OpenAI Chat Completions ---------- + +func handleChatCompletions(w http.ResponseWriter, r *http.Request) { + e, ok := begin(w, r) + if !ok { + return + } + if e.Stream { + streamChatCompletion(w, e) + } else { + nonStreamChatCompletion(w, e) + } +} + +func nonStreamChatCompletion(w http.ResponseWriter, e *entry) { + resp := merge(map[string]any{ + "id": "chatcmpl-bench-001", + "object": "chat.completion", + "created": time.Now().Unix(), + "model": "gpt-4o-mini", + "choices": []map[string]any{{ + "index": 0, + "message": map[string]any{"role": "assistant", "content": fullText()}, + "finish_reason": "stop", + }}, + "usage": map[string]any{"prompt_tokens": 25, "completion_tokens": 35, "total_tokens": 60}, + }, providerExtras()) + respond(w, e, resp) +} + +func streamChatCompletion(w http.ResponseWriter, e *entry) { + flusher := beginSSE(w) + if flusher == nil { + return + } + setStreamResp(e, "chat.completion.chunk") + now := time.Now().Unix() + send(w, flusher, "", fmt.Sprintf(`{"id":"chatcmpl-bench-001","object":"chat.completion.chunk","created":%d,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}`, now)) + for _, tok := range streamTokens { + send(w, flusher, "", fmt.Sprintf(`{"id":"chatcmpl-bench-001","object":"chat.completion.chunk","created":%d,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"content":%q},"finish_reason":null}]}`, now, tok)) + } + send(w, flusher, "", fmt.Sprintf(`{"id":"chatcmpl-bench-001","object":"chat.completion.chunk","created":%d,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":35,"total_tokens":60}}`, now)) + send(w, flusher, "", "[DONE]") +} + +// ---------- OpenAI Responses ---------- + +func handleResponses(w http.ResponseWriter, r *http.Request) { + e, ok := begin(w, r) + if !ok { + return + } + if e.Stream { + streamResponses(w, e) + } else { + nonStreamResponses(w, e) + } +} + +func nonStreamResponses(w http.ResponseWriter, e *entry) { + resp := merge(map[string]any{ + "id": "resp-bench-001", "object": "response", "created_at": time.Now().Unix(), + "model": "gpt-4o-mini", "status": "completed", + "output": []map[string]any{{ + "type": "message", "id": "msg-bench-001", "role": "assistant", + "content": []map[string]any{{"type": "output_text", "text": fullText()}}, + }}, + "usage": map[string]any{"input_tokens": 25, "output_tokens": 35, "total_tokens": 60}, + }, providerExtras()) + respond(w, e, resp) +} + +func streamResponses(w http.ResponseWriter, e *entry) { + flusher := beginSSE(w) + if flusher == nil { + return + } + setStreamResp(e, "response.*") + now := time.Now().Unix() + send(w, flusher, "response.created", mustJSON(map[string]any{"id": "resp-bench-001", "object": "response", "created_at": now, "model": "gpt-4o-mini", "status": "in_progress", "output": []any{}})) + send(w, flusher, "response.output_item.added", mustJSON(map[string]any{"type": "message", "id": "msg-bench-001", "role": "assistant", "content": []any{}})) + send(w, flusher, "response.content_part.added", mustJSON(map[string]any{"type": "output_text", "text": ""})) + for _, tok := range streamTokens { + send(w, flusher, "response.output_text.delta", mustJSON(map[string]any{"type": "response.output_text.delta", "delta": tok})) + } + send(w, flusher, "response.output_text.done", mustJSON(map[string]any{"type": "response.output_text.done", "text": fullText()})) + send(w, flusher, "response.completed", mustJSON(map[string]any{ + "id": "resp-bench-001", "object": "response", "status": "completed", + "output": []map[string]any{{"type": "message", "id": "msg-bench-001", "role": "assistant", + "content": []map[string]any{{"type": "output_text", "text": fullText()}}}}, + "usage": map[string]any{"input_tokens": 25, "output_tokens": 35, "total_tokens": 60}, + })) +} + +// ---------- Anthropic Messages ---------- + +func handleMessages(w http.ResponseWriter, r *http.Request) { + e, ok := begin(w, r) + if !ok { + return + } + if e.Stream { + streamMessages(w, e) + } else { + nonStreamMessages(w, e) + } +} + +func nonStreamMessages(w http.ResponseWriter, e *entry) { + resp := merge(map[string]any{ + "id": "msg-bench-001", "type": "message", "role": "assistant", + "model": "claude-3-5-sonnet", + "content": []map[string]any{{"type": "text", "text": fullText()}}, + "stop_reason": "end_turn", "stop_sequence": nil, + "usage": map[string]any{"input_tokens": 25, "output_tokens": 35}, + }, providerExtras()) + respond(w, e, resp) +} + +func streamMessages(w http.ResponseWriter, e *entry) { + flusher := beginSSE(w) + if flusher == nil { + return + } + setStreamResp(e, "message_*") + send(w, flusher, "message_start", mustJSON(map[string]any{"type": "message_start", "message": map[string]any{ + "id": "msg-bench-001", "type": "message", "role": "assistant", "model": "claude-3-5-sonnet", + "content": []any{}, "stop_reason": nil, "usage": map[string]any{"input_tokens": 25, "output_tokens": 1}, + }})) + send(w, flusher, "content_block_start", mustJSON(map[string]any{"type": "content_block_start", "index": 0, "content_block": map[string]any{"type": "text", "text": ""}})) + for _, tok := range streamTokens { + send(w, flusher, "content_block_delta", mustJSON(map[string]any{"type": "content_block_delta", "index": 0, "delta": map[string]any{"type": "text_delta", "text": tok}})) + } + send(w, flusher, "content_block_stop", mustJSON(map[string]any{"type": "content_block_stop", "index": 0})) + send(w, flusher, "message_delta", mustJSON(map[string]any{"type": "message_delta", "delta": map[string]any{"stop_reason": "end_turn", "stop_sequence": nil}, "usage": map[string]any{"output_tokens": 35}})) + send(w, flusher, "message_stop", mustJSON(map[string]any{"type": "message_stop"})) +} + +// ---------- Models ---------- + +func handleModels(w http.ResponseWriter, _ *http.Request) { + writeJSON(w, map[string]any{ + "object": "list", + "data": []map[string]any{ + {"id": "gpt-4o-mini", "object": "model", "owned_by": "openai", "created": time.Now().Unix()}, + {"id": "claude-3-5-sonnet", "object": "model", "owned_by": "anthropic", "created": time.Now().Unix()}, + }, + }) +} + +// ---------- Shared helpers ---------- + +// streamReq is the only field the mock needs to decode from any request body. +type streamReq struct { + Stream bool `json:"stream"` +} + +// respond writes a non-stream JSON response and records it on the entry. +func respond(w http.ResponseWriter, e *entry, v map[string]any) { + e.Response = v + writeJSON(w, v) +} + +// setStreamResp records a compact description of a streamed canned response. +func setStreamResp(e *entry, kind string) { + e.Response = map[string]any{"stream": true, "event_kind": kind, "text": fullText()} +} + +func beginSSE(w http.ResponseWriter) http.Flusher { + flusher, ok := w.(http.Flusher) + if !ok { + http.Error(w, "streaming not supported", http.StatusInternalServerError) + return nil + } + w.Header().Set("Content-Type", "text/event-stream") + w.Header().Set("Cache-Control", "no-cache") + w.Header().Set("Connection", "keep-alive") + return flusher +} + +// send writes one SSE frame. An empty event name omits the event: line (OpenAI +// chat style); a name emits "event: " (Responses / Anthropic style). +func send(w http.ResponseWriter, flusher http.Flusher, event, data string) { + if event != "" { + fmt.Fprintf(w, "event: %s\n", event) + } + fmt.Fprintf(w, "data: %s\n\n", data) + flusher.Flush() +} + +func writeJSON(w http.ResponseWriter, v any) { + w.Header().Set("Content-Type", "application/json") + if err := json.NewEncoder(w).Encode(v); err != nil { + log.Printf("encode response: %v", err) + } +} + +func writeJSONBytes(w http.ResponseWriter, status int, payload []byte) { + w.Header().Set("Content-Type", "application/json") + w.WriteHeader(status) + if _, err := w.Write(payload); err != nil { + log.Printf("write response: %v", err) + } +} + +func mustJSON(v any) string { + b, _ := json.Marshal(v) + return string(b) +} diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json b/docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json new file mode 100644 index 00000000..0adbef1c --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json @@ -0,0 +1,60 @@ +{ + "$schema": "https://www.getbifrost.ai/schema", + "client": { + "drop_excess_requests": false, + "initial_pool_size": 5000, + "enable_logging": false, + "enforce_auth_on_inference": false, + "allowed_origins": [ + "*" + ] + }, + "config_store": { + "enabled": true, + "type": "sqlite", + "config": { + "path": "/app/data/config.db" + } + }, + "logs_store": { + "enabled": false + }, + "providers": { + "openai": { + "keys": [ + { + "name": "mock", + "value": "sk-bench-test-key", + "models": [ + "gpt-4o-mini" + ], + "weight": 1 + } + ], + "network_config": { + "base_url": "http://mock:9999", + "default_request_timeout_in_seconds": 60, + "max_retries": 0, + "allow_private_network": true + } + }, + "anthropic": { + "keys": [ + { + "name": "mock", + "value": "sk-bench-test-key", + "models": [ + "claude-3-5-sonnet" + ], + "weight": 1 + } + ], + "network_config": { + "base_url": "http://mock:9999", + "default_request_timeout_in_seconds": 60, + "max_retries": 0, + "allow_private_network": true + } + } + } +} diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml b/docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml new file mode 100644 index 00000000..fafffa5c --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml @@ -0,0 +1,19 @@ +# LiteLLM proxy config for the benchmark. One model, routed at the mock backend +# over the shared docker network. Retries/spend-logging disabled so we measure +# routing overhead, not bookkeeping. +model_list: + - model_name: gpt-4o-mini + litellm_params: + model: openai/gpt-4o-mini + api_key: sk-bench-test-key + api_base: http://mock:9999/v1 + +general_settings: + master_key: null + disable_spend_logs: true + +litellm_settings: + num_retries: 0 + request_timeout: 60 + drop_params: true + set_verbose: false diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml b/docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml new file mode 100644 index 00000000..73d169f1 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml @@ -0,0 +1,99 @@ +# Benchmark topology: a shared mock backend plus exactly one gateway at a time. +# +# Each gateway lives in its own profile so we can bring them up sequentially and +# never let them contend for the (single-vCPU, free-tier) instance: +# +# docker compose --profile gomodel up -d # mock + gomodel +# docker compose --profile litellm up -d # mock + litellm +# docker compose --profile portkey up -d # mock + portkey +# +# The mock has no profile, so it starts alongside whichever gateway is selected. +# The load generator is run separately via `docker run --network benchnet`. +# +# Image refs are overridable via env so exact versions/digests stay pinnable. + +networks: + default: + name: benchnet + +services: + mock: + image: ${BENCH_TOOLS_IMAGE:-bench-tools:local} + command: ["/mock"] + environment: + - MOCK_PORT=9999 + ports: + - "9999:9999" # published so the runner can record a no-gateway baseline + restart: "no" + + gomodel: + profiles: ["gomodel"] + image: ${GOMODEL_IMAGE:-gomodel-bench:local} + depends_on: [mock] + ports: + - "8080:8080" + environment: + - PORT=8080 + - GOMODEL_MASTER_KEY= + - OPENAI_API_KEY=sk-bench-test-key + - OPENAI_BASE_URL=http://mock:9999/v1 + - LOGGING_ENABLED=false + - USAGE_ENABLED=false + - METRICS_ENABLED=false + - SWAGGER_ENABLED=false + - PPROF_ENABLED=false + - ENABLE_PASSTHROUGH_ROUTES=false + # Config parity with the no-retry / no-breaker peers (LiteLLM, Bifrost): + # - retries off (default is 3). + # - circuit breaker effectively disabled. It has no on/off env, so set an + # unreachable failure threshold. This matters under load: a few transient + # upstream errors at high concurrency were tripping the default breaker + # (threshold 5), which then blanket-503'd every request ("circuit breaker + # is open") and made GoModel's own capacity/resource numbers read as ~0. + # No other gateway has a breaker, so disabling it keeps the test fair. + - RETRY_MAX_RETRIES=0 + - CIRCUIT_BREAKER_FAILURE_THRESHOLD=1000000000 + - STORAGE_TYPE=sqlite + - SQLITE_PATH=/app/data/gomodel-bench.db + - GOMODEL_CACHE_DIR=/app/.cache + restart: "no" + + litellm: + profiles: ["litellm"] + image: ${LITELLM_IMAGE:-ghcr.io/berriai/litellm:main-stable} + depends_on: [mock] + ports: + - "4000:4000" + volumes: + - ./configs/litellm-config.yaml:/app/config.yaml:ro + # LiteLLM's own recommendation is one worker per CPU core. run-on-instance.sh + # sets LITELLM_NUM_WORKERS=$(nproc) so this matches the box (2 on c7i.large), + # giving LiteLLM the same multi-core access the Go gateways get for free. + command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "${LITELLM_NUM_WORKERS:-1}"] + restart: "no" + + portkey: + profiles: ["portkey"] + image: ${PORTKEY_IMAGE:-portkeyai/gateway:latest} + depends_on: [mock] + ports: + - "8787:8787" + environment: + # Allow the mock service hostname through Portkey's SSRF/private-IP filter. + - TRUSTED_CUSTOM_HOSTS=mock + restart: "no" + + bifrost: + profiles: ["bifrost"] + image: ${BIFROST_IMAGE:-maximhq/bifrost:latest} + depends_on: [mock] + ports: + - "8089:8089" + environment: + - APP_PORT=8089 + - APP_HOST=0.0.0.0 # bind all interfaces (default binds localhost) + volumes: + # Provider config (openai + anthropic -> mock); /app/data stays writable + # for Bifrost's sqlite config store. + - ./configs/bifrost-config.json:/app/data/config.json:ro + restart: "no" diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh b/docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh new file mode 100755 index 00000000..bfb384b4 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh @@ -0,0 +1,379 @@ +#!/usr/bin/env bash +# Runs the gateway latency + capacity + resource benchmark on the local docker host. +# +# Designed to run ON the provisioned EC2 instance (invoked by ../run.sh over +# SSH), but it works on any docker host. Two passes: +# +# Pass A — latency: REPEATS independent trials. Each trial brings up exactly one +# gateway at a time (no contention), warms it, drives all six request +# variants, tears it down. Gateway *order is randomized every trial* so +# no gateway is pinned to the most-favorable slot; results land in +# results/run/. Aggregation (median + spread across trials) is left +# to scripts/summarize.py. +# +# Pass B — capacity + footprint (once): per gateway, measure cold-start latency, +# image size, a throughput-vs-concurrency sweep (sustained req/s at each +# concurrency level — true capacity, not latency-coupled), and CPU/mem +# under sustained load. +# +# Results are written as JSON to ./results/ for the orchestrator to collect. +# +# NOTE: deliberately NOT `set -e`. This is a resilient benchmark harness — a +# single flaky docker/compose/curl on one variant must not abort the whole run; +# it should skip to the next variant and still reach the final meta.json sentinel +# the orchestrator polls for. Failures are visible in each variant's ok/failed. +set -uo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +cd "$SCRIPT_DIR" +RESULTS_DIR="$SCRIPT_DIR/results" +COMPOSE=(docker compose -p bench) + +# Load knobs. Defaults target a non-burstable box (c7i.large); see ../run.sh. +N="${N:-20000}" # requests per variant (large enough for a stable p99) +C="${C:-10}" # reference concurrency for the latency pass +REPEATS="${REPEATS:-5}" # independent latency trials (median + spread) +WARMUP="${WARMUP:-100}" # global chat warmup after a gateway starts (process/connection init) +WARMUP_VARIANT="${WARMUP_VARIANT:-30}" # per-variant warmup (per-dialect lazy-import cold start) +RESOURCE_SECONDS="${RESOURCE_SECONDS:-15}" # sustained-load window for CPU/mem sampling +REST_SECONDS="${REST_SECONDS:-5}" # settle gap between targets (cooldown) +# Per-variant wall cap: fast variants hit full N in seconds; this only bites the +# idle-bound streaming variants (e.g. Bifrost streams over a non-native backend +# fall back to the 1.5s idle timeout → ~7 req/s, which would take ~50 min for N). +MAX_VARIANT_SECONDS="${MAX_VARIANT_SECONDS:-60}" +SWEEP_CONCURRENCY="${SWEEP_CONCURRENCY:-1 2 4 8 16 32 64 128 256}" # capacity-sweep points +SWEEP_DURATION="${SWEEP_DURATION:-8}" # seconds of sustained load per sweep point +GATEWAYS="${GATEWAYS:-gomodel litellm portkey bifrost}" +# LiteLLM recommends one worker per CPU core; match the box so it isn't pinned to a +# single core while the Go gateways use all of them. Exported for docker-compose's +# ${LITELLM_NUM_WORKERS} substitution. (Per-variant warmup already warms each +# dialect; with >1 worker the warmup also spreads across workers.) +export LITELLM_NUM_WORKERS="${LITELLM_NUM_WORKERS:-$(nproc 2>/dev/null || echo 1)}" + +AUTH="sk-bench-test-key" + +log() { printf '\n\033[1;34m>>> %s\033[0m\n' "$*"; } + +rm -rf "$RESULTS_DIR"; mkdir -p "$RESULTS_DIR" + +# ── helpers ──────────────────────────────────────────────────────── +# epoch as a float second (python3 is present on AL2023 + macOS; coarse fallback). +epoch() { python3 -c 'import time;print(time.time())' 2>/dev/null || date +%s; } + +# shuffle a space-separated list; seed varies per call so trials differ in order. +shuffle() { + printf '%s\n' $1 | awk -v seed="${2:-$RANDOM}" 'BEGIN{srand(seed)} {print rand()"\t"$0}' \ + | sort -k1,1n | cut -f2- | tr '\n' ' ' +} + +# svc:internal port + any extra loadgen headers (Portkey routing). +gw_port() { case "$1" in gomodel) echo 8080;; litellm) echo 4000;; portkey) echo 8787;; bifrost) echo 8089;; mock) echo 9999;; esac; } + +# gw_headers fills the global HDRS array with loadgen -H args for the target. +HDRS=() +gw_headers() { + HDRS=() + case "$1" in + portkey) + HDRS=(-H 'x-portkey-provider: openai' -H 'x-portkey-custom-host: http://mock:9999/v1') + ;; + esac +} + +# Model name per gateway. Bifrost routes by an explicit "provider/model" prefix. +gw_model() { case "$1" in bifrost) echo "openai/gpt-4o-mini";; *) echo "gpt-4o-mini";; esac; } + +# Path per (gateway, dialect). Bifrost exposes Anthropic Messages under /anthropic/v1/messages. +gw_path() { # target dialect default_path + if [[ "$1" == "bifrost" && "$2" == "messages" ]]; then echo "/anthropic/v1/messages"; else echo "$3"; fi +} + +# The six benchmark variants: dialect | mode | path +VARIANTS=( + "chat|nonstream|/v1/chat/completions" + "chat|stream|/v1/chat/completions" + "responses|nonstream|/v1/responses" + "responses|stream|/v1/responses" + "messages|nonstream|/v1/messages" + "messages|stream|/v1/messages" +) + +BENCH_TOOLS_IMAGE="${BENCH_TOOLS_IMAGE:-bench-tools:local}" + +# loadgen runs in a throwaway container on the shared benchnet network so it can +# reach gateways/mock by service name. JSON summary comes back on stdout. +run_variant() { + local target="$1" svc="$2" spec="$3" outfile="$4" + local dialect mode path + IFS='|' read -r dialect mode path <<< "$spec" + path="$(gw_path "$target" "$dialect" "$path")" + local port; port="$(gw_port "$svc")" + local url="http://${svc}:${port}${path}" + + local base=(-url "$url" -dialect "$dialect" -model "$(gw_model "$target")" -c "$C" -auth "$AUTH" -json -) + [[ "$mode" == "stream" ]] && base+=(-stream) + [[ "$MAX_VARIANT_SECONDS" -gt 0 ]] && base+=(-max-wall "${MAX_VARIANT_SECONDS}s") + gw_headers "$target" + if [[ ${#HDRS[@]} -gt 0 ]]; then base+=("${HDRS[@]}"); fi + + # Per-variant warmup: warm THIS exact dialect+mode before measuring. Python + # gateways (LiteLLM) lazily import per-dialect translation modules on first use, + # so a chat-only warmup leaves responses/messages cold and inflates their tails. + if [[ "$WARMUP_VARIANT" -gt 0 ]]; then + docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen \ + "${base[@]}" -n "$WARMUP_VARIANT" >/dev/null 2>&1 || true + fi + + docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen \ + "${base[@]}" -n "$N" > "$outfile" 2>/dev/null || true + + # `|| true`: a single empty/missing summary must never abort the whole run. + local ok fail + ok="$(grep -o '"ok": *[0-9]*' "$outfile" 2>/dev/null | head -1 | grep -o '[0-9]*' || true)" + fail="$(grep -o '"failed": *[0-9]*' "$outfile" 2>/dev/null | head -1 | grep -o '[0-9]*' || true)" + printf ' %-8s %-10s %-9s ok=%-6s failed=%s\n' "$target" "$dialect" "$mode" "${ok:-?}" "${fail:-?}" +} + +# run_sweep drives a throughput-vs-concurrency sweep (chat, non-stream) so we can +# read each gateway's saturation point — sustained req/s at each concurrency, via +# loadgen's time-boxed mode (not the latency pass's fixed-N, latency-coupled rps). +run_sweep() { + local label="$1" svc="$2" port; port="$(gw_port "$svc")" + local url="http://${svc}:${port}/v1/chat/completions" + mkdir -p "$RESULTS_DIR/sweep" + gw_headers "$label"; local hdr=(); [[ ${#HDRS[@]} -gt 0 ]] && hdr=("${HDRS[@]}") + for cc in $SWEEP_CONCURRENCY; do + local args=(-url "$url" -dialect chat -model "$(gw_model "$label")" -c "$cc" -duration "${SWEEP_DURATION}s" -auth "$AUTH" -json -) + [[ ${#hdr[@]} -gt 0 ]] && args+=("${hdr[@]}") + docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen "${args[@]}" \ + > "$RESULTS_DIR/sweep/${label}_c${cc}.json" 2>/dev/null || true + local rps; rps="$(grep -o '"rps": *[0-9.]*' "$RESULTS_DIR/sweep/${label}_c${cc}.json" 2>/dev/null | head -1 | grep -o '[0-9.]*' || true)" + printf ' sweep %-8s c=%-4s rps=%s\n' "$label" "$cc" "${rps:-?}" + done +} + +# awk program that normalizes a docker-stats MemUsage field to MiB, then prints +# "mem_mb,cpu_pct". +STAT_AWK=' +function tomib(s, v){ v=s; gsub(/[^0-9.]/,"",v); v=v+0; + if (s ~ /GiB|GB/) return v*1024; + if (s ~ /MiB|MB/) return v; + if (s ~ /KiB|kB/) return v/1024; + if (s ~ /[0-9]B/) return v/1048576; + return v } +{ split($0,a,";"); mem=a[1]; sub(/ ?\/.*/,"",mem); + cpu=a[2]; gsub(/[^0-9.]/,"",cpu); + m=tomib(mem); if (m>0) printf "%.2f,%s\n", m, cpu }' + +SAMPLER_PID="" +start_sampler() { + local cname="$1" csv="$2" + echo "mem_mb,cpu_pct" > "$csv" + ( + while docker ps --format '{{.Names}}' | grep -q "^${cname}$"; do + docker stats --no-stream --format '{{.MemUsage}};{{.CPUPerc}}' "$cname" 2>/dev/null \ + | awk "$STAT_AWK" >> "$csv" || true + done + ) & + SAMPLER_PID=$! +} + +# Drive sustained chat load at a gateway for ~RESOURCE_SECONDS so the sampler +# captures the container under genuine pressure. Writes loadgen's summary to +# $3 so the achieved rps shares the exact window the CPU sample covers (lets +# summarize.py compute a self-consistent rps-per-CPU% efficiency). +sustained_load() { + local gw="$1" hostport="$2" outfile="$3" + local args=(-url "http://${gw}:${hostport}/v1/chat/completions" -dialect chat -model "$(gw_model "$gw")" -duration "${RESOURCE_SECONDS}s" -c "$C" -auth "$AUTH" -json -) + gw_headers "$gw"; if [[ ${#HDRS[@]} -gt 0 ]]; then args+=("${HDRS[@]}"); fi + docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen "${args[@]}" > "$outfile" 2>/dev/null || true +} + +stop_sampler() { + [[ -n "$SAMPLER_PID" ]] && kill "$SAMPLER_PID" 2>/dev/null || true + [[ -n "$SAMPLER_PID" ]] && wait "$SAMPLER_PID" 2>/dev/null || true + SAMPLER_PID="" +} + +summarize_resources() { # csv -> json {peak_mem_mb, avg_mem_mb, avg_cpu_pct, samples} + [[ -f "$1" ]] || { printf '{"peak_mem_mb":0,"avg_mem_mb":0,"avg_cpu_pct":0,"samples":0}'; return 0; } + awk -F, 'NR>1 && $1>0 { n++; s_mem+=$1; s_cpu+=$2; if($1>peak)peak=$1 } + END { + if(n>0) printf "{\"peak_mem_mb\":%.1f,\"avg_mem_mb\":%.1f,\"avg_cpu_pct\":%.1f,\"samples\":%d}", peak, s_mem/n, s_cpu/n, n; + else printf "{\"peak_mem_mb\":0,\"avg_mem_mb\":0,\"avg_cpu_pct\":0,\"samples\":0}" + }' "$1" +} + +record_image() { # gateway image_ref -> results/_image.json + local gw="$1" ref="$2" + local size digest compressed + size="$(docker image inspect "$ref" --format '{{.Size}}' 2>/dev/null || echo 0)" + digest="$(docker image inspect "$ref" --format '{{if .RepoDigests}}{{index .RepoDigests 0}}{{else}}{{.Id}}{{end}}' 2>/dev/null || echo unknown)" + # Compressed size = what you actually pull/store: gzip the saved image (uniform + # across the locally-built gomodel image and the pulled competitor images). + compressed="$(docker save "$ref" 2>/dev/null | gzip -c | wc -c | tr -d ' ' || echo 0)" + printf '{"gateway":"%s","image":"%s","size_bytes":%s,"size_mb":%.1f,"compressed_bytes":%s,"compressed_mb":%.1f,"digest":"%s"}\n' \ + "$gw" "$ref" "${size:-0}" "$(awk "BEGIN{print ${size:-0}/1048576}")" \ + "${compressed:-0}" "$(awk "BEGIN{print ${compressed:-0}/1048576}")" "$digest" \ + > "$RESULTS_DIR/${gw}_image.json" +} + +wait_ready() { # gateway host_port -> poll a real chat request until HTTP 200 + local target="$1" hostport="$2" tries="${3:-60}" + gw_headers "$target" + local hdr=(); if [[ ${#HDRS[@]} -gt 0 ]]; then hdr=("${HDRS[@]}"); fi + local code + for ((i=0;i/dev/null || echo 000)" + [[ "$code" == "200" ]] && return 0 + sleep 2 + done + echo " WARN: $target did not return 200 within $((tries*2))s (last code: ${code:-?})" >&2 + return 1 +} + +# bring a gateway up and time cold-start latency (compose up -> first HTTP 200). +# Leaves the gateway running. Writes results/_startup.json. +measure_startup() { + local gw="$1" hostport; hostport="$(gw_port "$gw")" + local t0 t1 code ready=0 + gw_headers "$gw"; local hdr=(); [[ ${#HDRS[@]} -gt 0 ]] && hdr=("${HDRS[@]}") + t0="$(epoch)" + GOMODEL_IMAGE="${GOMODEL_IMAGE:-gomodel-bench:local}" \ + "${COMPOSE[@]}" --profile "$gw" up -d "$gw" >/dev/null 2>&1 || true + for ((i=0;i<600;i++)); do # up to ~120s, 0.2s resolution + code="$(curl -s -o /dev/null -w '%{http_code}' -m 5 -X POST \ + "http://localhost:${hostport}/v1/chat/completions" \ + -H 'Content-Type: application/json' -H "Authorization: Bearer $AUTH" ${hdr[@]+"${hdr[@]}"} \ + -d "{\"model\":\"$(gw_model "$gw")\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}]}" 2>/dev/null || echo 000)" + [[ "$code" == "200" ]] && { ready=1; break; } + sleep 0.2 + done + t1="$(epoch)" + local elapsed; elapsed="$(awk -v a="$t0" -v b="$t1" 'BEGIN{printf "%.3f", b-a}')" + printf '{"gateway":"%s","startup_s":%s,"ready":%s}\n' "$gw" "$elapsed" "$ready" \ + > "$RESULTS_DIR/${gw}_startup.json" + echo " startup: ${gw} ${elapsed}s (ready=$ready)" +} + +warmup_gateway() { + local gw="$1" hostport; hostport="$(gw_port "$gw")" + local warm_args=(-url "http://${gw}:${hostport}/v1/chat/completions" -dialect chat -model "$(gw_model "$gw")" -n "$WARMUP" -c "$C" -auth "$AUTH" -json -) + gw_headers "$gw"; if [[ ${#HDRS[@]} -gt 0 ]]; then warm_args+=("${HDRS[@]}"); fi + docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen "${warm_args[@]}" >/dev/null 2>&1 || true +} + +image_ref() { case "$1" in + gomodel) echo "${GOMODEL_IMAGE:-gomodel-bench:local}";; + litellm) echo "${LITELLM_IMAGE:-ghcr.io/berriai/litellm:main-stable}";; + portkey) echo "${PORTKEY_IMAGE:-portkeyai/gateway:latest}";; + bifrost) echo "${BIFROST_IMAGE:-maximhq/bifrost:latest}";; +esac; } + +# ── Build the bench-tools image ──────────────────────────────────── +log "Building bench-tools image" +docker build -q -t "$BENCH_TOOLS_IMAGE" ./bench-tools >/dev/null + +# ── Pull latest competitor images up front (digests recorded per gateway) ── +for gw in $GATEWAYS; do + [[ "$gw" == "gomodel" ]] && continue + docker pull -q "$(image_ref "$gw")" 2>/dev/null || true +done + +# ── Clean any leftover state, then bring up the shared mock ──────── +"${COMPOSE[@]}" --profile gomodel --profile litellm --profile portkey --profile bifrost down -v >/dev/null 2>&1 || true +log "Starting mock backend" +"${COMPOSE[@]}" up -d mock +sleep 2 + +# ── PASS A: latency, REPEATS trials, randomized target order ─────── +for r in $(seq 1 "$REPEATS"); do + RUN_DIR="$RESULTS_DIR/run${r}"; mkdir -p "$RUN_DIR" + "${COMPOSE[@]}" up -d mock >/dev/null 2>&1 || true # ensure the shared mock is up + ORDER="$(shuffle "baseline $GATEWAYS" "$((r * 7919 + RANDOM))")" + log "Latency trial ${r}/${REPEATS} (order: ${ORDER})" + for t in $ORDER; do + if [[ "$t" == "baseline" ]]; then + for spec in "${VARIANTS[@]}"; do + IFS='|' read -r dialect mode _ <<< "$spec" + run_variant "baseline" "mock" "$spec" "$RUN_DIR/baseline_${dialect}_${mode}.json" + done + else + GOMODEL_IMAGE="${GOMODEL_IMAGE:-gomodel-bench:local}" \ + "${COMPOSE[@]}" --profile "$t" up -d "$t" >/dev/null 2>&1 || true + wait_ready "$t" "$(gw_port "$t")" || true + warmup_gateway "$t" + for spec in "${VARIANTS[@]}"; do + IFS='|' read -r dialect mode _ <<< "$spec" + run_variant "$t" "$t" "$spec" "$RUN_DIR/${t}_${dialect}_${mode}.json" + done + # Remove only this gateway's container — NOT `compose down`, which would + # also tear down the profile-less mock and break the next baseline. + "${COMPOSE[@]}" --profile "$t" rm -sf "$t" >/dev/null 2>&1 || true + fi + sleep "$REST_SECONDS" + done +done + +# ── PASS B: capacity sweep + startup + footprint, once, randomized ─ +log "Capacity + footprint pass" +"${COMPOSE[@]}" up -d mock >/dev/null 2>&1 || true # ensure the shared mock is up +# Baseline capacity ceiling first (mock is already up, no gateway lifecycle). +run_sweep "baseline" "mock" + +for gw in $(shuffle "$GATEWAYS"); do + ref="$(image_ref "$gw")" + log "Capacity: $gw (image: $ref)" + measure_startup "$gw" # brings the gateway up + times cold start + record_image "$gw" "$ref" + warmup_gateway "$gw" + run_sweep "$gw" "$gw" + + cname="bench-${gw}-1" + csv="$RESULTS_DIR/${gw}_resources.csv" + load_json="$RESULTS_DIR/${gw}_sustained.json" + idle_mem="$(docker stats --no-stream --format '{{.MemUsage}};0' "$cname" 2>/dev/null | awk "$STAT_AWK" | cut -d, -f1 || true)" + start_sampler "$cname" "$csv" + sustained_load "$gw" "$(gw_port "$gw")" "$load_json" + stop_sampler + + res="$(summarize_resources "$csv")" + load_rps="$(grep -o '"rps": *[0-9.]*' "$load_json" 2>/dev/null | head -1 | grep -o '[0-9.]*' || true)" + printf '{"gateway":"%s","idle_mem_mb":%s,"load_rps":%s,"under_load":%s}\n' \ + "$gw" "${idle_mem:-0}" "${load_rps:-0}" "$res" > "$RESULTS_DIR/${gw}_resources.json" + echo " resources: idle=${idle_mem:-0}MiB load_rps=${load_rps:-0} $res" + + "${COMPOSE[@]}" --profile "$gw" rm -sf "$gw" >/dev/null 2>&1 || true + sleep "$REST_SECONDS" +done + +"${COMPOSE[@]}" down -v >/dev/null 2>&1 || true + +# ── Run metadata ─────────────────────────────────────────────────── +IMDS_TOKEN="$(curl -s -m 2 -X PUT 'http://169.254.169.254/latest/api/token' -H 'X-aws-ec2-metadata-token-ttl-seconds: 60' 2>/dev/null || true)" +INSTANCE_TYPE_META="$(curl -s -m 2 -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/meta-data/instance-type 2>/dev/null || true)" +[[ "$INSTANCE_TYPE_META" == *"<"* || -z "$INSTANCE_TYPE_META" ]] && INSTANCE_TYPE_META="unknown" +cat > "$RESULTS_DIR/meta.json" </dev/null || echo 1), + "kernel": "$(uname -r)" +} +JSON + +log "Done. Results in $RESULTS_DIR" +ls -1 "$RESULTS_DIR" diff --git a/docs/2026-06-25_aws_gateway_benchmark/run.sh b/docs/2026-06-25_aws_gateway_benchmark/run.sh new file mode 100755 index 00000000..42b24ad1 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/run.sh @@ -0,0 +1,148 @@ +#!/usr/bin/env bash +# End-to-end AWS gateway benchmark orchestrator. +# +# 1. build the GoModel image (linux/amd64) and save it for transfer +# 2. terraform apply -> EC2 instance (default c7i.large; NOT free tier) +# 3. wait for SSH + docker, ship the harness, load the GoModel image +# 4. run the containerized benchmark (6 variants x 4 gateways + baseline, +# REPEATS latency trials + a capacity sweep) +# 5. pull results back and summarize +# 6. terraform destroy -> guaranteed teardown (runs even on failure) +# +# Teardown is wired to an EXIT trap so the instance is always destroyed. Set +# KEEP=1 to leave it running for debugging. +# +# Usage: ./run.sh # full run, then destroy +# N=20000 C=10 REPEATS=5 ./run.sh +# INSTANCE_TYPE=t2.micro ./run.sh # cheaper/burstable (free tier) +# KEEP=1 ./run.sh # don't destroy at the end +set -euo pipefail + +SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)" +TF_DIR="$SCRIPT_DIR/terraform" +REMOTE_DIR="$SCRIPT_DIR/remote" +PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)" + +TF="${TERRAFORM:-terraform}" +REGION="${REGION:-us-east-1}" +INSTANCE_TYPE="${INSTANCE_TYPE:-c7i.large}" # 2 vCPU, non-burstable (stable tail); NOT free tier +N="${N:-20000}" +C="${C:-10}" +REPEATS="${REPEATS:-5}" +GATEWAYS="${GATEWAYS:-gomodel litellm portkey bifrost}" +GOMODEL_IMAGE_TAG="gomodel-bench:local" +IMAGE_TAR="/tmp/gomodel-bench-amd64.tar.gz" + +STAMP="$(date -u +%Y%m%d-%H%M%S)" +OUT_DIR="$SCRIPT_DIR/output/$STAMP" + +log() { printf '\n\033[1;34m>>> %s\033[0m\n' "$*"; } +err() { printf '\033[1;31m!!! %s\033[0m\n' "$*" >&2; } + +destroy() { + if [[ "${KEEP:-0}" == "1" ]]; then + err "KEEP=1 set — leaving instance up. Destroy later with: (cd $TF_DIR && $TF destroy -auto-approve)" + return + fi + log "Destroying AWS resources (terraform destroy)" + (cd "$TF_DIR" && $TF destroy -auto-approve -var "region=$REGION" >/dev/null 2>&1) \ + && echo " teardown complete" || err "TEARDOWN FAILED — check: (cd $TF_DIR && $TF destroy)" +} +trap destroy EXIT + +command -v "$TF" >/dev/null || { err "terraform not found (set TERRAFORM=/path/to/terraform)"; exit 1; } +command -v docker >/dev/null || { err "docker required to build the GoModel image"; exit 1; } + +# ── 1. Build + save GoModel image for amd64 ──────────────────────── +log "Building GoModel image (linux/amd64)" +docker buildx build --platform linux/amd64 -t "$GOMODEL_IMAGE_TAG" --load "$PROJECT_ROOT" +log "Saving image -> $IMAGE_TAR" +docker save "$GOMODEL_IMAGE_TAG" | gzip > "$IMAGE_TAR" + +# ── 2. Provision ─────────────────────────────────────────────────── +MY_IP="$(curl -s https://checkip.amazonaws.com | tr -d '[:space:]')" +log "Provisioning $INSTANCE_TYPE in $REGION (SSH locked to ${MY_IP}/32)" +(cd "$TF_DIR" && $TF init -input=false >/dev/null && \ + $TF apply -auto-approve -input=false \ + -var "region=$REGION" -var "instance_type=$INSTANCE_TYPE" \ + -var "ssh_ingress_cidr=${MY_IP}/32") + +IP="$(cd "$TF_DIR" && $TF output -raw public_ip)" +KEY="$(cd "$TF_DIR" && $TF output -raw ssh_private_key_path)" +USER="$(cd "$TF_DIR" && $TF output -raw ssh_user)" +SSH_OPTS=(-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 -i "$KEY") +echo " instance: $USER@$IP (key: $KEY)" +[[ -f "$KEY" ]] || { err "private key not found at $KEY"; exit 1; } + +# ── 3. Wait for SSH + bootstrap ──────────────────────────────────── +log "Waiting for SSH" +for i in $(seq 1 60); do + ssh "${SSH_OPTS[@]}" "$USER@$IP" true 2>/dev/null && break + if [[ $i == 60 ]]; then + err "SSH never came up — last attempt error:" + ssh "${SSH_OPTS[@]}" "$USER@$IP" true 2>&1 | sed 's/^/ /' || true + exit 1 + fi + sleep 5 +done +log "Waiting for docker bootstrap (user-data)" +for i in $(seq 1 60); do + ssh "${SSH_OPTS[@]}" "$USER@$IP" 'test -f ~/.bootstrap-done && docker info >/dev/null 2>&1' && break + sleep 5 + [[ $i == 60 ]] && { err "docker bootstrap never finished"; exit 1; } +done +echo " ready" + +# ── 4. Ship harness + image, run ─────────────────────────────────── +log "Shipping harness to instance" +ssh "${SSH_OPTS[@]}" "$USER@$IP" 'rm -rf ~/bench && mkdir -p ~/bench' +rsync -az -e "ssh ${SSH_OPTS[*]}" --exclude results "$REMOTE_DIR/" "$USER@$IP:~/bench/" +scp "${SSH_OPTS[@]}" "$IMAGE_TAR" "$USER@$IP:~/gomodel-bench-amd64.tar.gz" + +log "Loading GoModel image on instance" +ssh "${SSH_OPTS[@]}" "$USER@$IP" 'gunzip -c ~/gomodel-bench-amd64.tar.gz | docker load' + +# Forward all benchmark knobs to the instance (only the ones that are set). +REMOTE_ENV="N=$N C=$C REPEATS=$REPEATS GATEWAYS='$GATEWAYS' GOMODEL_IMAGE=$GOMODEL_IMAGE_TAG" +for v in MAX_VARIANT_SECONDS SWEEP_CONCURRENCY SWEEP_DURATION RESOURCE_SECONDS REST_SECONDS WARMUP WARMUP_VARIANT; do + if [[ -n "${!v:-}" ]]; then REMOTE_ENV="$REMOTE_ENV $v='${!v}'"; fi +done + +# Launch DETACHED with setsid so the benchmark survives any SSH drop or hang — +# the controlling session no longer owns the process. We then poll for the +# terminal sentinel (results/meta.json, written only at the very end). This is +# the fix for the earlier run dying with the SSH session still half-open. +log "Launching benchmark detached (N=$N C=$C REPEATS=$REPEATS gateways: $GATEWAYS)" +ssh "${SSH_OPTS[@]}" "$USER@$IP" \ + "cd ~/bench && chmod +x run-on-instance.sh && rm -f run.log && \ + setsid env $REMOTE_ENV bash run-on-instance.sh > run.log 2>&1 < /dev/null & echo launched" + +log "Waiting for benchmark (polling every 15s; survives SSH drops)" +POLL_MAX="${POLL_MAX:-160}" # 160 * 15s = 40 min ceiling +done_ok=0 +for ((i=0; i/dev/null; then + done_ok=1; echo " benchmark complete (meta.json present)"; break + fi + # After warmup, a missing run-on-instance process + no meta = it died; collect partial. + if (( i > 3 )) && ! ssh "${SSH_OPTS[@]}" "$USER@$IP" 'pgrep -f "[r]un-on-instance.sh" >/dev/null' 2>/dev/null; then + err "remote benchmark ended without meta.json — collecting partial results"; break + fi + if (( i % 4 == 0 )); then + ssh "${SSH_OPTS[@]}" "$USER@$IP" 'sed "s/\x1b\[[0-9;]*m//g" ~/bench/run.log 2>/dev/null | grep -E ">>>|trial [0-9]/" | tail -2' 2>/dev/null || true + fi +done +(( done_ok == 1 )) || err "polling ended (timeout or early exit) — proceeding to collect whatever exists" +ssh "${SSH_OPTS[@]}" "$USER@$IP" 'echo "--- tail of remote run.log ---"; tail -25 ~/bench/run.log' 2>/dev/null || true + +# ── 5. Collect + summarize ───────────────────────────────────────── +log "Collecting results -> $OUT_DIR" +mkdir -p "$OUT_DIR" +rsync -az -e "ssh ${SSH_OPTS[*]}" "$USER@$IP:~/bench/results/" "$OUT_DIR/" + +if command -v python3 >/dev/null; then + python3 "$SCRIPT_DIR/scripts/summarize.py" --results-dir "$OUT_DIR" | tee "$OUT_DIR/summary.txt" +fi +log "Raw + summarized results in: $OUT_DIR" +# destroy() runs on EXIT diff --git a/docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py b/docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py new file mode 100644 index 00000000..8097ac0c --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py @@ -0,0 +1,296 @@ +#!/usr/bin/env python3 +"""Normalize the raw benchmark JSON into human tables + one summary.json. + +Reads a results directory produced by run-on-instance.sh: + + results/ + run1/ … runN/ latency trials (per target+variant JSON) <- aggregated here + sweep/ throughput-vs-concurrency capacity points + _image.json _startup.json _resources.json + meta.json + +Latency is reported as the MEDIAN across trials with a min–max spread on the +noisy tail (p99) and on rps, so single-window jitter no longer drives the story. +Also emits overhead-vs-baseline, a capacity-sweep table (sustained req/s and the +saturation knee), startup latency, and rps-per-CPU% efficiency. + +Back-compat: a flat results dir (no run* subdirs) is treated as a single trial. +Stdlib only. +""" +import argparse +import glob +import json +import os +import re +import statistics + +TARGETS = ["baseline", "gomodel", "litellm", "portkey", "bifrost"] +VARIANTS = [ + ("chat", "nonstream"), ("chat", "stream"), + ("responses", "nonstream"), ("responses", "stream"), + ("messages", "nonstream"), ("messages", "stream"), +] + + +def load(path): + try: + with open(path) as f: + return json.load(f) + except (OSError, ValueError): + return None + + +def run_dirs(rd): + """Trial dirs: run* subdirs if present, else the flat dir (single trial).""" + runs = sorted(glob.glob(os.path.join(rd, "run*")), + key=lambda p: int(re.sub(r"\D", "", os.path.basename(p)) or 0)) + return runs or [rd] + + +def med(xs): + xs = [x for x in xs if isinstance(x, (int, float))] + return statistics.median(xs) if xs else None + + +def spread(xs): + xs = [x for x in xs if isinstance(x, (int, float))] + return (min(xs), max(xs)) if xs else (None, None) + + +def fnum(v, dp=2): + try: + return f"{float(v):.{dp}f}" + except (TypeError, ValueError): + return "—" + + +# ── latency aggregation ─────────────────────────────────────────────────────── +def collect(runs, target, dialect, mode): + """All trial summaries for one (target, variant).""" + out = [] + for r in runs: + d = load(os.path.join(r, f"{target}_{dialect}_{mode}.json")) + if d: + out.append(d) + return out + + +def agg_variant(trials): + """Median (+ spread) of the metrics we care about across trials.""" + def field(path): + vals = [] + for t in trials: + cur = t + for k in path: + cur = (cur or {}).get(k) if isinstance(cur, dict) else None + vals.append(cur) + return vals + + p99s = field(["total_latency", "p99_ms"]) + rpss = field(["rps"]) + return { + "trials": len(trials), + "ok": sum(t.get("ok", 0) for t in trials), + "failed": sum(t.get("failed", 0) for t in trials), + "rps": med(rpss), "rps_spread": spread(rpss), + "p50": med(field(["total_latency", "p50_ms"])), + "p90": med(field(["total_latency", "p90_ms"])), + "p99": med(p99s), "p99_spread": spread(p99s), + "ttft_p50": med(field(["ttft", "p50_ms"])), + "gap_p50": med(field(["inter_chunk", "p50_ms"])), + "gap_p99": med(field(["inter_chunk", "p99_ms"])), + } + + +# ── capacity sweep ──────────────────────────────────────────────────────────── +def sweep_curve(rd, target): + """{concurrency: rps} for a target, read from results/sweep/_c.json.""" + curve = {} + for p in glob.glob(os.path.join(rd, "sweep", f"{target}_c*.json")): + m = re.search(r"_c(\d+)\.json$", p) + d = load(p) + if m and d and isinstance(d.get("rps"), (int, float)): + curve[int(m.group(1))] = d["rps"] + return dict(sorted(curve.items())) + + +def sweep_stats(curve): + if not curve: + return {} + peak_c = max(curve, key=curve.get) + peak = curve[peak_c] + # saturation knee: lowest concurrency reaching >=95% of peak rps. + knee = next((c for c in sorted(curve) if curve[c] >= 0.95 * peak), peak_c) + return {"peak_rps": peak, "peak_c": peak_c, "knee_c": knee, "curve": curve} + + +def main(): + ap = argparse.ArgumentParser() + ap.add_argument("--results-dir", required=True) + args = ap.parse_args() + rd = args.results_dir + + meta = load(os.path.join(rd, "meta.json")) or {} + runs = run_dirs(rd) + present = sorted({os.path.basename(p).split("_")[0] + for p in glob.glob(os.path.join(runs[0], "*_*_*.json"))}) + targets = [t for t in TARGETS if t in present] + + summary = {"meta": meta, "trials": len(runs), "latency": {}, "capacity": {}, "resources": {}} + + print("\n" + "=" * 86) + print("GATEWAY BENCHMARK SUMMARY") + print("=" * 86) + if meta: + print(f"instance={meta.get('instance_type')} cpus={meta.get('cpus')} " + f"N={meta.get('n_requests')} c={meta.get('concurrency')} " + f"trials={meta.get('repeats', len(runs))}") + print(f"(latency = median across {len(runs)} trial(s); p99/rps show [min–max])") + + # ── Latency (median across trials) ───────────────────────────────────────── + base_p50 = {} + for dialect, mode in VARIANTS: + b = agg_variant(collect(runs, "baseline", dialect, mode)) + base_p50[(dialect, mode)] = b.get("p50") + + print("\nLATENCY (ms; rps = completed req/s @ c={})".format(meta.get("concurrency", "?"))) + hdr = (f"{'target':9} {'variant':18} {'ok/fail':>11} {'rps':>7} {'p50':>7} " + f"{'p99':>7} {'p99 range':>15} {'ttft':>7} {'ovhd':>7}") + print(hdr); print("-" * len(hdr)) + for t in targets: + summary["latency"][t] = {} + for dialect, mode in VARIANTS: + a = agg_variant(collect(runs, t, dialect, mode)) + if not a["trials"]: + continue + key = f"{dialect}/{mode}" + ovhd = (a["p50"] - base_p50[(dialect, mode)] + if a["p50"] is not None and base_p50.get((dialect, mode)) is not None else None) + lo, hi = a["p99_spread"] + rng = f"{fnum(lo)}–{fnum(hi)}" if lo is not None else "—" + print(f"{t:9} {key:18} {str(a['ok'])+'/'+str(a['failed']):>11} " + f"{fnum(a['rps'],0):>7} {fnum(a['p50']):>7} {fnum(a['p99']):>7} " + f"{rng:>15} {fnum(a['ttft_p50']):>7} {fnum(ovhd):>7}") + a["overhead_p50"] = ovhd + summary["latency"][t][key] = a + print() + + # ── Capacity sweep ───────────────────────────────────────────────────────── + print("CAPACITY (chat non-stream; sustained req/s by concurrency)") + sweep_targets = [t for t in TARGETS if sweep_curve(rd, t)] + if sweep_targets: + concs = sorted({c for t in sweep_targets for c in sweep_curve(rd, t)}) + hdrc = f"{'target':9} " + " ".join(f"c{c:>6}" for c in concs) + f" {'peak':>8} {'@c':>4} {'knee':>5}" + print(hdrc); print("-" * len(hdrc)) + for t in sweep_targets: + curve = sweep_curve(rd, t) + s = sweep_stats(curve) + row = f"{t:9} " + " ".join(f"{fnum(curve.get(c), 0):>7}" for c in concs) + row += f" {fnum(s['peak_rps'],0):>8} {s['peak_c']:>4} {s['knee_c']:>5}" + print(row) + summary["capacity"][t] = s + else: + print(" (no sweep data)") + print() + + # ── Resources / footprint ────────────────────────────────────────────────── + print("RESOURCES (per gateway; img_zip = compressed pull size)") + hdr2 = (f"{'gateway':9} {'img_zip':>8} {'img_disk':>9} {'startup_s':>10} {'idle_mb':>9} " + f"{'peak_mb':>9} {'avg_cpu%':>9} {'load_rps':>9} {'rps/cpu%':>9}") + print(hdr2); print("-" * len(hdr2)) + for t in [x for x in targets if x != "baseline"]: + img = load(os.path.join(rd, f"{t}_image.json")) or {} + res = load(os.path.join(rd, f"{t}_resources.json")) or {} + startup = load(os.path.join(rd, f"{t}_startup.json")) or {} + ul = res.get("under_load", {}) + load_rps = res.get("load_rps") or 0 + cpu = ul.get("avg_cpu_pct") or 0 + eff = (load_rps / cpu) if cpu else None + print(f"{t:9} {fnum(img.get('compressed_mb'),1):>8} {fnum(img.get('size_mb'),1):>9} " + f"{fnum(startup.get('startup_s'),2):>10} " + f"{fnum(res.get('idle_mem_mb'),1):>9} {fnum(ul.get('peak_mem_mb'),1):>9} " + f"{fnum(cpu,1):>9} {fnum(load_rps,0):>9} {fnum(eff,1):>9}") + summary["resources"][t] = {"image": img, "resources": res, "startup": startup, + "rps_per_cpu_pct": eff} + print() + + out = os.path.join(rd, "summary.json") + with open(out, "w") as f: + json.dump(summary, f, indent=2) + md = write_markdown(rd, meta, runs, targets) + print(f"wrote {out}\nwrote {md}") + + +def write_markdown(rd, meta, runs, targets): + """Emit clean GitHub-flavored Markdown tables.""" + L = ["# Gateway Benchmark Summary\n"] + if meta: + L.append(f"`instance={meta.get('instance_type')} cpus={meta.get('cpus')} " + f"N={meta.get('n_requests')} c={meta.get('concurrency')} " + f"trials={meta.get('repeats', len(runs))}`\n") + L.append(f"_Latency = median across {len(runs)} trial(s); p99 shows the min–max " + "across trials. rps in the latency table is completed req/s at the " + "fixed concurrency (latency-coupled); see the capacity table for " + "sustained throughput._\n") + + base_p50 = {(d, m): agg_variant(collect(runs, "baseline", d, m)).get("p50") + for d, m in VARIANTS} + + L.append("## Latency (ms, median of trials)\n") + L.append("| target | variant | ok/fail | rps | p50 | p90 | p99 | p99 min–max | ttft p50 | gap p50 | overhead p50 |") + L.append("|---|---|--:|--:|--:|--:|--:|--:|--:|--:|--:|") + for t in targets: + for dialect, mode in VARIANTS: + a = agg_variant(collect(runs, t, dialect, mode)) + if not a["trials"]: + continue + ovhd = (a["p50"] - base_p50[(dialect, mode)] + if a["p50"] is not None and base_p50.get((dialect, mode)) is not None else None) + lo, hi = a["p99_spread"] + rng = f"{fnum(lo)}–{fnum(hi)}" if lo is not None else "—" + gap = fnum(a["gap_p50"]) if mode == "stream" else "" + ttft = fnum(a["ttft_p50"]) if mode == "stream" else "" + L.append(f"| {t} | {dialect}/{mode} | {a['ok']}/{a['failed']} | {fnum(a['rps'],0)} | " + f"{fnum(a['p50'])} | {fnum(a['p90'])} | {fnum(a['p99'])} | {rng} | " + f"{ttft} | {gap} | {fnum(ovhd)} |") + L.append("") + + # capacity + sweep_targets = [t for t in TARGETS if sweep_curve(rd, t)] + if sweep_targets: + concs = sorted({c for t in sweep_targets for c in sweep_curve(rd, t)}) + L.append("## Capacity (chat non-stream, sustained req/s by concurrency)\n") + L.append("| target | " + " | ".join(f"c={c}" for c in concs) + " | peak rps | @c | knee c |") + L.append("|---|" + "--:|" * (len(concs) + 3)) + for t in sweep_targets: + curve = sweep_curve(rd, t) + s = sweep_stats(curve) + cells = " | ".join(fnum(curve.get(c), 0) for c in concs) + L.append(f"| {t} | {cells} | {fnum(s['peak_rps'],0)} | {s['peak_c']} | {s['knee_c']} |") + L.append("") + + L.append("## Resources\n") + L.append("| gateway | image MB (compressed) | image MB (on-disk) | startup s | idle MB | peak MB | avg CPU % | load rps | rps/CPU% |") + L.append("|---|--:|--:|--:|--:|--:|--:|--:|--:|") + for t in [x for x in targets if x != "baseline"]: + img = load(os.path.join(rd, f"{t}_image.json")) or {} + res = load(os.path.join(rd, f"{t}_resources.json")) or {} + startup = load(os.path.join(rd, f"{t}_startup.json")) or {} + ul = res.get("under_load", {}) + load_rps = res.get("load_rps") or 0 + cpu = ul.get("avg_cpu_pct") or 0 + eff = (load_rps / cpu) if cpu else None + L.append(f"| {t} | {fnum(img.get('compressed_mb'),1)} | {fnum(img.get('size_mb'),1)} | " + f"{fnum(startup.get('startup_s'),2)} | " + f"{fnum(res.get('idle_mem_mb'),1)} | {fnum(ul.get('peak_mem_mb'),1)} | " + f"{fnum(cpu,1)} | {fnum(load_rps,0)} | {fnum(eff,1)} |") + L.append("") + + path = os.path.join(rd, "summary.md") + with open(path, "w") as f: + f.write("\n".join(L)) + return path + + +if __name__ == "__main__": + main() diff --git a/docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf b/docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf new file mode 100644 index 00000000..cb1e5ae7 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf @@ -0,0 +1,148 @@ +terraform { + required_version = ">= 1.6" + required_providers { + aws = { + source = "hashicorp/aws" + version = "~> 5.60" + } + tls = { + source = "hashicorp/tls" + version = "~> 4.0" + } + local = { + source = "hashicorp/local" + version = "~> 2.5" + } + } +} + +provider "aws" { + region = var.region +} + +# ── AMI: latest Amazon Linux 2023 x86_64 (override via var.ami_id) ── +data "aws_ssm_parameter" "al2023" { + name = "/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64" +} + +locals { + ami_id = var.ami_id != "" ? var.ami_id : data.aws_ssm_parameter.al2023.value + # credit_specification is only valid for burstable T-family instances; on a + # fixed-performance type (c7i.large default) it must be omitted entirely. + is_burstable = can(regex("^t[0-9]", var.instance_type)) +} + +# ── Default VPC / subnet (free-tier friendly, no NAT) ────────────── +data "aws_vpc" "default" { + default = true +} + +data "aws_subnets" "default" { + filter { + name = "vpc-id" + values = [data.aws_vpc.default.id] + } +} + +# ── SSH keypair generated locally, written to disk for the runner ── +resource "tls_private_key" "bench" { + algorithm = "ED25519" +} + +resource "local_sensitive_file" "private_key" { + content = tls_private_key.bench.private_key_openssh + filename = "${path.module}/bench_key.pem" + file_permission = "0600" +} + +resource "aws_key_pair" "bench" { + key_name_prefix = "gomodel-bench-" + public_key = tls_private_key.bench.public_key_openssh + tags = var.tags +} + +# ── Security group: SSH only, from the operator's IP ─────────────── +resource "aws_security_group" "bench" { + name_prefix = "gomodel-bench-" + description = "SSH access for the gateway benchmark instance" + vpc_id = data.aws_vpc.default.id + + ingress { + description = "SSH" + from_port = 22 + to_port = 22 + protocol = "tcp" + cidr_blocks = [var.ssh_ingress_cidr] + } + + egress { + description = "All outbound" + from_port = 0 + to_port = 0 + protocol = "-1" + cidr_blocks = ["0.0.0.0/0"] + } + + tags = var.tags +} + +# ── Instance bootstrap: install docker + compose plugin ──────────── +locals { + user_data = <<-EOF + #!/bin/bash + set -euxo pipefail + + # 2 GiB swap: a 1 GiB free-tier instance can't hold memory-heavy gateways + # (LiteLLM idles near ~1 GiB). Swap lets every gateway run so the memory + # comparison is complete; the reported RSS still exposes the difference. + if [ ! -f /swapfile ]; then + fallocate -l 2G /swapfile || dd if=/dev/zero of=/swapfile bs=1M count=2048 + chmod 600 /swapfile + mkswap /swapfile + swapon /swapfile + echo '/swapfile none swap sw 0 0' >> /etc/fstab + fi + + dnf update -y + dnf install -y docker git + systemctl enable --now docker + usermod -aG docker ec2-user + + # Docker Compose v2 plugin (pinned). + mkdir -p /usr/libexec/docker/cli-plugins + curl -fsSL -o /usr/libexec/docker/cli-plugins/docker-compose \ + "https://github.com/docker/compose/releases/download/${var.compose_plugin_version}/docker-compose-linux-x86_64" + chmod +x /usr/libexec/docker/cli-plugins/docker-compose + + # Readiness marker the orchestrator polls for. + touch /home/ec2-user/.bootstrap-done + EOF +} + +resource "aws_instance" "bench" { + ami = local.ami_id + instance_type = var.instance_type + key_name = aws_key_pair.bench.key_name + vpc_security_group_ids = [aws_security_group.bench.id] + subnet_id = tolist(data.aws_subnets.default.ids)[0] + associate_public_ip_address = true + user_data = local.user_data + + # Only burstable (T-family) instances accept a credit specification. Standard + # credits avoid surprise burst charges there; fixed-performance types (the + # c7i.large default) omit this block entirely — and have no credit drift, which + # is exactly why they make the better latency reference. + dynamic "credit_specification" { + for_each = local.is_burstable ? [1] : [] + content { + cpu_credits = "standard" + } + } + + root_block_device { + volume_type = "gp3" + volume_size = var.root_volume_gb + } + + tags = merge(var.tags, { Name = "gomodel-gateway-benchmark" }) +} diff --git a/docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf b/docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf new file mode 100644 index 00000000..bee4ba70 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf @@ -0,0 +1,29 @@ +output "public_ip" { + description = "Public IPv4 of the benchmark instance." + value = aws_instance.bench.public_ip +} + +output "public_dns" { + description = "Public DNS of the benchmark instance." + value = aws_instance.bench.public_dns +} + +output "ssh_user" { + description = "SSH login user for Amazon Linux 2023." + value = "ec2-user" +} + +output "ssh_private_key_path" { + description = "Absolute path to the generated private key." + value = abspath(local_sensitive_file.private_key.filename) +} + +output "instance_id" { + value = aws_instance.bench.id +} + +output "ami_id" { + description = "Resolved AMI id used (record for reproducibility)." + # SSM-resolved public AMI alias is not secret; unwrap so it can be recorded. + value = nonsensitive(local.ami_id) +} diff --git a/docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf b/docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf new file mode 100644 index 00000000..afbc3381 --- /dev/null +++ b/docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf @@ -0,0 +1,50 @@ +variable "region" { + description = "AWS region to provision the benchmark instance in." + type = string + default = "us-east-1" +} + +variable "instance_type" { + description = <<-EOT + EC2 instance type. Default c7i.large (2 vCPU, 4 GiB, non-burstable) gives a + stable tail with no CPU-credit drift — the right reference for latency/p99. + It is NOT free-tier eligible (~$0.09/hr on-demand in us-east-1). For a + free-tier run set instance_type=t2.micro (1 vCPU, burstable) explicitly; + treat its absolute latencies as indicative only. + EOT + type = string + default = "c7i.large" +} + +variable "ssh_ingress_cidr" { + description = "CIDR allowed to SSH in. Set to /32. Defaults to fully open if left empty (NOT recommended)." + type = string + default = "0.0.0.0/0" +} + +variable "ami_id" { + description = "Override the AMI. Empty = latest Amazon Linux 2023 x86_64 via SSM (reproducible by policy, not by digest)." + type = string + default = "" +} + +variable "root_volume_gb" { + description = "Root EBS volume size (GiB). Free tier allows up to 30 GiB." + type = number + default = 20 +} + +variable "compose_plugin_version" { + description = "Pinned Docker Compose v2 plugin version installed via user-data." + type = string + default = "v2.29.7" +} + +variable "tags" { + description = "Tags applied to all resources." + type = map(string) + default = { + Project = "gomodel-gateway-benchmark" + Owner = "benchmark" + } +} diff --git a/docs/about/benchmarks.mdx b/docs/about/benchmarks.mdx index 612b1925..cca49c8c 100644 --- a/docs/about/benchmarks.mdx +++ b/docs/about/benchmarks.mdx @@ -1,134 +1,99 @@ --- title: "Benchmarks" -description: "A summary of GoModel benchmark results, with links to full write-ups and methodology." +description: "A short, up-to-date summary of GoModel benchmark results, with a link to the full write-up and the tooling to reproduce it." icon: "gauge" --- ## Benchmark snapshot -This page is a short reference for one public benchmark run comparing GoModel -and LiteLLM on OpenAI-compatible traffic. +This page is a short reference for our latest public benchmark: GoModel against +**LiteLLM, Portkey, and Bifrost**, all pointed at the same instant mock backend so +the numbers reflect gateway overhead, not model latency. -The full article contains the complete write-up, all charts, and the original -discussion: -[GoModel vs LiteLLM Benchmark: Speed, Throughput, and Resource Usage](https://enterpilot.io/blog/gomodel-vs-litellm-benchmark/). +The full article has the complete write-up, all the context, and the charts: +[AI Gateway Benchmark 2026: GoModel vs LiteLLM, Portkey & Bifrost](https://enterpilot.io/blog/gomodel-vs-litellm-portkey-bifrost-june-2026/). - This benchmark is a point-in-time snapshot published on March 5, 2026. Treat - it as data, not dogma. Gateway performance depends on workload, provider mix, - deployment setup, and tuning. + This is a point-in-time snapshot from a June 2026 run on AWS. Treat it as data, + not dogma. Gateway performance depends on your workload, provider mix, deployment + setup, and tuning. Older runs (March 2026, LiteLLM only, on localhost) are still + on the blog for history. -## Visual snapshot +## What we tested -![Benchmark dashboard from the original blog post](./images/benchmark-dashboard.png) +A simple, like-for-like setup: -Chart source and full context: -[Original benchmark post](https://enterpilot.io/blog/gomodel-vs-litellm-benchmark/). +- One gateway at a time, in Docker, on an AWS `c7i.large` (2 vCPU, 4 GiB). +- The same shared mock backend for everyone, so we measure only gateway overhead. +- Six workloads: chat completions, the Responses API, and Anthropic messages - + each streaming and non-streaming. +- `8,000` requests per workload at concurrency `10`, across two randomized-order + trials (latency is the median across them). +- Fair config: retries off for everyone, GoModel's circuit breaker off, and + LiteLLM run at its recommended one worker per CPU core. ## At a glance -In this benchmark run, GoModel came out ahead on the main operational signals -most teams care about: +GoModel came out ahead on every operational signal most teams care about: +the tightest latency tail, the highest sustained throughput, the smallest image +and memory, and the fastest cold start. -- Added latency -- Throughput under concurrency -- CPU overhead -- Memory overhead +| Gateway | p50 (ms) | p99 (ms) | Throughput (req/s) | Peak RAM | Image (compressed) | Cold start | +| --- | --- | --- | --- | --- | --- | --- | +| **GoModel** | **`1.8`** | **`6.9`** | **`4,900`** | **`37 MB`** | **`16 MB`** | **`0.56 s`** | +| Bifrost | `2.5` | `18.3` | `3,100` | `143 MB` | `77 MB` | `7.1 s` | +| Portkey | `9.7` | `30.5` | `950` | `112 MB` | `59 MB` | `1.1 s` | +| LiteLLM | `30.6` | `39.3` | `324` | `2.3 GB` | `372 MB` | `25.5 s` | -## Test shape - -The comparison used a simple like-for-like setup: - -- OpenAI-compatible `/v1/chat/completions` -- The same prompt and request shape on both sides -- Concurrency levels of `1`, `4`, and `8` -- A focus on clean runs with `0%` errors -- Metrics including req/s, latency percentiles, CPU usage, and RSS memory - -This docs page keeps only the primary comparison matrix from the blog post. - -## Reference table - -| Gateway | Concurrency | Success | Error % | Req/s | p50 ms | p95 ms | p99 ms | CPU avg % | RSS avg MB | -| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | -| GoModel | `1` | `12/12` | `0.00` | `9.61` | `86.4` | `141.1` | `144.4` | `0.81` | `45.4` | -| GoModel | `4` | `12/12` | `0.00` | `44.66` | `56.1` | `139.5` | `139.5` | `0.23` | `46.0` | -| GoModel | `8` | `12/12` | `0.00` | `52.75` | `98.4` | `130.6` | `131.1` | `1.13` | `46.0` | -| LiteLLM | `1` | `12/12` | `0.00` | `8.64` | `96.2` | `190.3` | `213.9` | `9.21` | `320.3` | -| LiteLLM | `4` | `12/12` | `0.00` | `36.82` | `104.7` | `149.5` | `149.5` | `5.20` | `320.8` | -| LiteLLM | `8` | `12/12` | `0.00` | `35.81` | `188.7` | `244.4` | `244.9` | `5.95` | `321.5` | +Latency is chat completions, non-streaming (representative). Throughput is the +sustained rate from a separate concurrency sweep. Image size is the compressed +pull size. ## Key readouts -Some useful reads from that March 5, 2026 run: - -- Lower p95 latency at every tested concurrency level. -- Higher throughput across the benchmark matrix. -- `45-46 MB` RSS, while LiteLLM stayed near `320-321 MB`. -- Less CPU in these runs. - -At the highest tested concurrency, GoModel reached `52.75 req/s` versus -LiteLLM at `35.81 req/s`. +- GoModel has both the lowest median (`1.8 ms`) and the tightest tail (`6.9 ms`). +- It pushes the most traffic per box (`~4,900 req/s`) and the most per CPU core. +- It is the smallest to ship and run: a `16 MB` compressed image and `37 MB` of + RAM under load, ready to serve `0.56 s` after launch. +- LiteLLM, even at its recommended multi-worker config, uses `~2.3 GB` of RAM and + takes `~25 s` to start - the cost of Python on the hot path. +- Portkey did not serve the Anthropic messages dialect in this single-provider + setup, so it covers 4 of the 6 workloads. ## Reproduce it yourself -All the tooling used in the published benchmark is available in this repository. - -### Prerequisites - -- Go 1.26.4+ -- Python 3.10+ with `matplotlib` and `numpy` -- `jq`, `curl` -- A Groq API key (or any OpenAI-compatible provider — adjust the script) -- `litellm[proxy]` (`pip install "litellm[proxy]"`) - -### Scripts +The whole thing is one command. It provisions a small AWS box, runs all four +gateways against the same mock backend, prints the tables, and tears the +infrastructure back down on its own. -The benchmark suite lives in [`docs/about/benchmark-tools/`](https://github.com/ENTERPILOT/GoModel/tree/main/docs/about/benchmark-tools): + + This runs on **paid** AWS infrastructure, not the free tier. A `c7i.large` is + about $0.09/hour and the run self-destructs within an hour or two, so budget + **under $1** per run to be safe. If you pass `KEEP=1` or a teardown fails, you + keep paying until you destroy the box - so confirm it is gone. + -| File | Purpose | -| --- | --- | -| [`compare.sh`](https://github.com/ENTERPILOT/GoModel/blob/main/docs/about/benchmark-tools/compare.sh) | Builds GoModel, starts both gateways, runs the full benchmark matrix, and writes a `REPORT.md` | -| [`bench_main.go`](https://github.com/ENTERPILOT/GoModel/blob/main/docs/about/benchmark-tools/bench_main.go) | Source for the `bench` CLI that sends requests and collects latency + process metrics | -| [`plot_benchmark_charts.py`](https://github.com/ENTERPILOT/GoModel/blob/main/docs/about/benchmark-tools/plot_benchmark_charts.py) | Generates per-metric charts and a combined dashboard from the JSON results | - -### Quick start +The harness lives in the repo at +[`docs/2026-06-25_aws_gateway_benchmark/`](https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark): ```bash -# 1. Clone GoModel and set up your .env with GROQ_API_KEY +# Needs Docker, Terraform, and AWS credentials git clone https://github.com/ENTERPILOT/GoModel.git -cd gomodel -echo "GROQ_API_KEY=gsk_..." > .env - -# 2. Run the full comparison (builds GoModel, starts LiteLLM, benchmarks both) -bash docs/about/benchmark-tools/compare.sh - -# 3. Generate charts from the latest result -pip install matplotlib numpy -python3 docs/about/benchmark-tools/plot_benchmark_charts.py benchmark-results/ -``` - -The script creates a timestamped directory under `benchmark-results/` containing -JSON result files, gateway logs, and a `REPORT.md` with the results table. - -### Tuning - -You can override defaults via environment variables: - -```bash -REQUESTS=100 CONCURRENCIES="1 4 8 16" MAX_TOKENS=16 bash docs/about/benchmark-tools/compare.sh +cd gomodel/docs/2026-06-25_aws_gateway_benchmark +./run.sh ``` -See the top of `compare.sh` for the full list of knobs. +Knobs like `N` (requests per workload) and `REPEATS` (trials) are env vars, e.g. +`N=20000 REPEATS=5 ./run.sh` for a heavier run. For a quick local check against +just LiteLLM, the older localhost harness is still in +[`docs/about/benchmark-tools/`](https://github.com/ENTERPILOT/GoModel/tree/main/docs/about/benchmark-tools). ## Why this page is short -This page is intentionally shorter and more operational than the blog version. - -It exists so docs readers can see the benchmark result quickly without reading a -full article inside the product docs. If you want the full narrative, more -charts, and the original context, use the source post. +It is meant to give you the result fast, inside the product docs, without a full +article. For the narrative, the charts, and the methodology details, read the +[full post](https://enterpilot.io/blog/gomodel-vs-litellm-portkey-bifrost-june-2026/). No single benchmark settles the question for every environment. If you are evaluating gateways seriously, reproduce the test against your own traffic and