From 97afaadbe1687339837ccaf1c33dd21845d329d1 Mon Sep 17 00:00:00 2001
From: "Jakub A. W" <jakubwasek@gmail.com>
Date: Fri, 26 Jun 2026 14:29:28 +0200
Subject: [PATCH] docs(benchmark): add AWS gateway benchmark and refresh
 benchmarks page

Adds docs/2026-06-25_aws_gateway_benchmark/: a reproducible AWS benchmark
comparing GoModel against LiteLLM, Portkey, and Bifrost on latency, throughput,
memory, image size, and cold start, using a shared recording-mock backend so the
numbers reflect gateway overhead. Includes RESULTS.md and the one-command
Terraform/Docker harness (run.sh, remote/, terraform/, scripts/summarize.py).

Refreshes docs/about/benchmarks.mdx to this June 2026 run (with a paid-AWS note).

The benchmark write-up (ARTICLE.md, cover, charts) and the co-located QA and
translation tooling are in a separate draft PR.

Terraform state, provider binaries, the generated SSH key, and raw run output are
gitignored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
---
 .../.gitignore                                |   9 +
 .../README.md                                 | 122 +++++
 .../RESULTS.md                                | 135 ++++++
 .../remote/bench-tools/Dockerfile             |  18 +
 .../remote/bench-tools/go.mod                 |   3 +
 .../remote/bench-tools/loadgen/main.go        | 442 ++++++++++++++++++
 .../remote/bench-tools/mock/main.go           | 411 ++++++++++++++++
 .../remote/configs/bifrost-config.json        |  60 +++
 .../remote/configs/litellm-config.yaml        |  19 +
 .../remote/docker-compose.yml                 |  99 ++++
 .../remote/run-on-instance.sh                 | 379 +++++++++++++++
 docs/2026-06-25_aws_gateway_benchmark/run.sh  | 148 ++++++
 .../scripts/summarize.py                      | 296 ++++++++++++
 .../terraform/main.tf                         | 148 ++++++
 .../terraform/outputs.tf                      |  29 ++
 .../terraform/variables.tf                    |  50 ++
 docs/about/benchmarks.mdx                     | 157 +++----
 17 files changed, 2429 insertions(+), 96 deletions(-)
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/.gitignore
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/README.md
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml
 create mode 100755 docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh
 create mode 100755 docs/2026-06-25_aws_gateway_benchmark/run.sh
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf
 create mode 100644 docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf

diff --git a/docs/2026-06-25_aws_gateway_benchmark/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/.gitignore
new file mode 100644
index 00000000..39a18413
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/.gitignore
@@ -0,0 +1,9 @@
+# Benchmark outputs and local Terraform state / secrets — never commit.
+output/
+remote/results/
+terraform/.terraform/
+terraform/.terraform.lock.hcl
+terraform/*.tfstate
+terraform/*.tfstate.*
+terraform/bench_key.pem
+*.tar.gz
diff --git a/docs/2026-06-25_aws_gateway_benchmark/README.md b/docs/2026-06-25_aws_gateway_benchmark/README.md
new file mode 100644
index 00000000..0a4a2950
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/README.md
@@ -0,0 +1,122 @@
+# AWS gateway latency & resource benchmark — GoModel vs LiteLLM vs Portkey vs Bifrost
+
+A reproducible, one-command benchmark that provisions a free-tier AWS instance,
+runs four AI gateways through identical workloads against a deterministic mock
+backend, measures latency and resource cost, and tears the infrastructure down.
+
+Because every gateway talks to the **same local mock backend**, the numbers
+reflect *gateway overhead*, not upstream model latency or network jitter.
+
+## What it compares
+
+Four OpenAI-compatible gateways, each pointed at the mock:
+
+| Gateway  | Image                                | How it reaches the mock |
+|----------|--------------------------------------|-------------------------|
+| GoModel  | built from this repo (`Dockerfile`)  | `OPENAI_BASE_URL` env   |
+| LiteLLM  | `ghcr.io/berriai/litellm:main-stable`| `configs/litellm-config.yaml` |
+| Portkey  | `portkeyai/gateway:latest`           | `x-portkey-custom-host` header (+ `TRUSTED_CUSTOM_HOSTS=mock`) |
+| Bifrost  | `maximhq/bifrost:latest`             | `configs/bifrost-config.json` (`network_config.base_url` + `allow_private_network`) |
+
+Per-gateway quirks the harness handles automatically (see `gw_model`/`gw_path` in
+`run-on-instance.sh`): Bifrost needs an explicit `openai/`-prefixed model, serves
+the Anthropic dialect at `/anthropic/v1/messages` (not `/v1/messages`), and must
+allow private-network egress to reach the mock.
+
+### Workloads — 6 variants
+
+The common denominator across OpenAI-compatible gateways, in both modes:
+
+| Dialect   | Endpoint               | non-stream | stream |
+|-----------|------------------------|:----------:|:------:|
+| Chat      | `/v1/chat/completions` | ✓ | ✓ |
+| Responses | `/v1/responses`        | ✓ | ✓ |
+| Messages  | `/v1/messages` (Anthropic) | ✓ | ✓ |
+
+A **baseline** (load sent straight to the mock, no gateway) runs first as the
+latency floor. Variants a gateway does not implement are reported as failures
+rather than silently skipped — e.g. Portkey's OSS gateway does not serve the
+Anthropic Messages dialect here, so its messages variants fail; that asymmetry is
+the finding. Streaming uses a terminal-marker **or idle-gap** end-of-stream
+detection (`loadgen -idle`), so a gateway that streams content without sending a
+terminal event (Bifrost) is still measured to last-byte rather than hanging.
+
+### Metrics captured
+
+- **Latency** — total-latency p50/p90/p95/p99, plus **TTFT** (time to first
+  token) for streaming, and throughput (RPS). Driven by the `loadgen` tool.
+- **Docker image size** — `docker image inspect` size + repo digest per gateway.
+- **Memory** — idle RSS after warmup and peak RSS under load (`docker stats`).
+- **CPU** — average CPU% under load (`docker stats`).
+
+## Layout
+
+```
+terraform/            free-tier EC2 + SSH key + security group (apply/destroy)
+remote/               everything shipped to and run on the instance
+  bench-tools/        Go mock backend + loadgen (one small image)
+  configs/            litellm config
+  docker-compose.yml  mock + one gateway per profile (benchnet network)
+  run-on-instance.sh  builds images, runs 6 variants x N gateways, samples stats
+scripts/summarize.py  raw JSON -> latency + resource tables + summary.json
+run.sh                orchestrator: build -> apply -> run -> collect -> destroy
+```
+
+## Prerequisites
+
+- AWS credentials configured (`aws sts get-caller-identity` works)
+- Terraform ≥ 1.6, Docker (with `buildx`), `rsync`, `ssh`, Python 3
+- An AWS account with default VPC in the chosen region
+
+## Run it
+
+```bash
+cd docs/2026-06-25_aws_gateway_benchmark
+./run.sh                         # full run in us-east-1, then auto-destroy
+N=1000 C=20 ./run.sh             # heavier load
+REGION=eu-west-1 ./run.sh        # different region
+GATEWAYS="gomodel litellm" ./run.sh   # subset
+KEEP=1 ./run.sh                  # leave the instance up for debugging
+```
+
+`run.sh` always tears the instance down via an EXIT trap, even on failure. If a
+run is interrupted, reconcile manually:
+
+```bash
+cd terraform && terraform destroy -auto-approve
+```
+
+Results land in `output/<timestamp>/` (raw per-variant JSON, `summary.json`,
+and the printed `summary.txt` table).
+
+## Local dry-run (no AWS)
+
+The instance-side harness runs on any Docker host:
+
+```bash
+cd remote && N=30 C=5 GATEWAYS="gomodel litellm portkey bifrost" ./run-on-instance.sh
+```
+
+(Build the GoModel image first: `docker build -t gomodel-bench:local ../../..`)
+
+## Reproducibility & caveats
+
+- **Pinned**: gateway image refs (overridable via `*_IMAGE` env), the Compose
+  plugin version, instance type, and the deterministic mock payload. Exact image
+  **digests** are recorded in each `*_image.json` so a run is fully traceable.
+- **AMI** resolves to the latest Amazon Linux 2023 via SSM (reproducible by
+  policy). Pin `var.ami_id` for a byte-identical OS.
+- **Free tier**: defaults to **t2.micro** — the 12-month-free-tier instance in
+  us-east-1 — with `standard` CPU credits (no surprise burst charges), a 20 GiB
+  gp3 root volume (free tier allows 30 GiB), the default VPC (no paid NAT/EIP),
+  and an Amazon Linux 2023 AMI. Image pulls are inbound traffic (free). In
+  regions where t2.micro is unavailable, set `INSTANCE_TYPE=t3.micro` (the
+  free-tier instance there). Newer accounts on AWS's credit-based free plan stay
+  within credit for a single short run.
+- **t2.micro is burstable** (1 vCPU, CPU-credit throttled). Treat absolute
+  latency as *indicative*; the value is the *relative* comparison on identical
+  hardware. Gateways run **one at a time** so they never contend, and the load is
+  kept modest (N=500, c=10) to stay within launch credits. For production-grade
+  absolute numbers, set `INSTANCE_TYPE=c7i.large` (not free tier).
+- **Cost**: a single free-tier instance for ~15–30 min — $0 within free-tier
+  allowance, otherwise a few cents.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/RESULTS.md b/docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
new file mode 100644
index 00000000..b764ef05
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
@@ -0,0 +1,135 @@
+# Results — 2026-06-25 (AWS c7i.large run)
+
+Reference run produced by `./run.sh` (raw data in `output/20260625-182538/`,
+machine summary in that dir's `summary.md` / `summary.json`). Four gateways:
+**GoModel, LiteLLM, Portkey, Bifrost**.
+
+- **Host**: AWS EC2 **c7i.large** (2 vCPU, 4 GiB, **non-burstable** — no CPU-credit
+  drift, so the tail is stable), Amazon Linux 2023, us-east-1.
+- **Load**: N=8000 requests/variant, concurrency 10, **2 randomized-order trials**
+  (latency = median across trials; p99 shown with its min–max), 200-request
+  process warmup + 50-request per-variant warmup, per-variant wall cap 10 s, 8 s
+  resource window, capacity sweep at c∈{1,16,128}. Shared in-process **mock**
+  backend, so every number is **gateway overhead**, not model latency.
+- **Parity**: retries disabled on every gateway, GoModel's circuit breaker disabled
+  (so the sweep can't trip it), and **LiteLLM run at its recommended worker count —
+  one worker per CPU core (`num_workers=2` on this 2-vCPU box)** so it isn't pinned
+  to a single core while the Go gateways use both.
+- Images (digests in `*_image.json`): GoModel (built from this repo), latest
+  `litellm:main-stable`, `portkeyai/gateway:latest`, `maximhq/bifrost:latest`.
+
+> Fast reference run (N=8000 × 2 trials) sized to finish end-to-end in well under
+> 20 minutes; the p99 min–max spreads are tight, so the medians are stable. Raise
+> `N`/`REPEATS` for a heavier run.
+
+## Latency — non-streaming (ms, median of trials)
+
+| Workload  | metric | baseline | GoModel | Bifrost | Portkey | LiteLLM |
+|-----------|--------|---------:|--------:|--------:|--------:|--------:|
+| chat      | p50 | 0.23 | **1.81** | 2.51 | 9.70 | 30.56 |
+| chat      | p99 | 2.77 | **6.88** | 18.27 | 30.54 | 39.26 |
+| responses | p50 | 0.26 | **2.01** | 2.73 | 9.07 | 39.12 |
+| responses | p99 | 2.33 | **7.28** | 16.55 | 26.92 | 48.60 |
+| messages  | p50 | 0.26 | **1.76** | 2.65 | ✗ | 61.06 |
+| messages  | p99 | 2.23 | **6.59** | 19.08 | ✗ | 98.12 |
+
+**GoModel has the lowest p50 and the tightest p99** (~7 ms vs Bifrost ~18 ms,
+Portkey ~31 ms, LiteLLM ~39 ms). `overhead p50` (gateway p50 − baseline p50):
+GoModel ≈ 1.6 ms, Bifrost ≈ 2.3 ms, Portkey ≈ 9.5 ms, LiteLLM ≈ 30 ms.
+
+## Latency — streaming (ms, median of trials)
+
+| Workload  | metric | GoModel | Bifrost | Portkey | LiteLLM |
+|-----------|--------|--------:|--------:|--------:|--------:|
+| chat      | TTFT p50 | **4.71** | 9.02 | 27.97 | 151.94 |
+| chat      | total p50 | **4.95** | 11.89 | 27.98 | 151.95 |
+| responses | TTFT p50 | **4.69** | 12.87 | 27.90 | 47.53 |
+| responses | total p50 | **5.00** | 14.94 | 27.93 | 47.55 |
+| messages  | TTFT p50 | **7.50** | †      | ✗ | 48.86 |
+| messages  | total p50 | **8.38** | †      | ✗ | 48.89 |
+
+† **Bifrost messages-stream is an idle-bound artifact, not a throughput number**
+(no terminal event over a non-native backend → 0 completions within the 10 s cap).
+
+## Throughput / capacity (chat non-stream, sustained req/s by concurrency)
+
+| target | c=1 | c=16 | c=128 | peak | knee |
+|--------|----:|-----:|------:|-----:|-----:|
+| baseline | 15510 | 29701 | 30015 | **30015** | 16 |
+| GoModel  | 2745 | 4928 | 4567 | **4928** | 16 |
+| Bifrost  | 1885 | 3088 | 2904 | **3088** | 16 |
+| Portkey  | 636 | 946 | 900 | **946** | 16 |
+| LiteLLM  | 227 | 324 | 254 | **324** | 16 |
+
+GoModel tops the gateways at **~4900 req/s**, ~1.6× Bifrost, ~5× Portkey, ~15×
+LiteLLM. All saturate by c=16 on 2 vCPUs.
+
+## Resources
+
+| Metric | GoModel | Portkey | Bifrost | LiteLLM |
+|--------|--------:|--------:|--------:|--------:|
+| Docker image, compressed pull (MB) | **16** | 59 | 77 | 372 |
+| Docker image, on-disk (MB)      | **47.2** | 177.4 | 230.7 | 1159.9 |
+| Cold start to first 200 (s) | **0.56** | 1.05 | 7.07 | 25.49 |
+| Peak RSS under load (MB)| **37.0** | 112.0 | 143.0 | 2272.3 |
+| Avg CPU under load (%) | 92.6 | 116.9 | 117.6 | 101.1 |
+| Sustained req/s (resource window) | **4824** | 960 | 2977 | 261 |
+| Efficiency (req/s per CPU %) | **52.1** | 8.2 | 25.3 | 2.6 |
+
+GoModel is the most CPU-efficient (**52 req/s per CPU-%**, ~2× Bifrost, ~6×
+Portkey, ~20× LiteLLM), the smallest image (**47 MB**), the smallest footprint
+(**37 MB** peak), and the fastest cold start (**0.56 s**).
+
+> **LiteLLM at its recommended config.** With `num_workers=2` (one per core) LiteLLM
+> is faster and higher-throughput than the earlier single-worker run (≈220 → 324
+> req/s; chat p50 ≈ 44 → 31 ms — a single worker was queuing the 10 concurrent
+> requests), but its **memory doubled to ~2.3 GB** (two ~1 GB worker processes) and
+> its **cold start rose to ~25 s**. Running LiteLLM "properly" widens the resource
+> gap, not narrows it.
+
+## Feature coverage (6 variants)
+
+| Gateway | chat | responses | messages | total |
+|---------|:----:|:---------:|:--------:|:-----:|
+| GoModel | ✓ | ✓ | ✓ | 6/6 |
+| LiteLLM | ✓ | ✓ | ✓ | 6/6 |
+| Bifrost | ✓ | ✓ | ✓† | 6/6 |
+| Portkey | ✓ | ✓ | ✗ | 4/6 |
+
+- **Portkey** errors on the Anthropic `/v1/messages` dialect in this single-provider
+  (openai → mock) setup; setup limitation, not a hard capability gap.
+- **Bifrost** serves Anthropic at `/anthropic/v1/messages`, needs an `openai/`-prefixed
+  model and `allow_private_network:true`; messages-streaming has the caveat above (†).
+
+## Takeaways
+
+- **GoModel** — best all-rounder: lowest p50 and tightest p99 (~7 ms), highest
+  gateway throughput (~4900 req/s), best CPU efficiency (52 req/s per %), smallest
+  image (47 MB) and memory (37 MB), fastest cold start (0.56 s), full 6/6 coverage.
+- **Bifrost** (Go) — second on throughput, low p50 but a heavier p99 tail; streaming
+  terminal-event gaps over a non-native backend.
+- **Portkey** (Node) — middle tier; no Anthropic Messages in this setup.
+- **LiteLLM** (Python) — full coverage, but even at its recommended 2-worker config
+  it is ~15× behind on throughput and carries a **1.16 GB image + ~2.3 GB RAM + ~25 s
+  cold start**. The cost of Python on the hot path.
+
+## Methodology notes
+
+- **Repeats + spread** — 2 trials, randomized gateway order each trial; latency is
+  the median across trials, p99 carries its min–max.
+- **Config parity** — retries off on all; GoModel's circuit breaker disabled (a few
+  transient errors under the c=128 sweep would otherwise trip it and blanket-503 its
+  own capacity); **LiteLLM at one worker per core (`num_workers`=vCPUs)**, its own
+  production recommendation, set automatically from `nproc`.
+- **Warm-up** — 200 global + 50 per-variant requests; the per-variant warmup
+  neutralizes LiteLLM's lazy per-dialect imports and, with >1 worker, warms each
+  worker before measuring.
+- **Throughput vs latency separated** — capacity comes from a time-boxed concurrency
+  sweep, not the latency-coupled rps in the latency tables.
+- **Per-variant wall cap (10 s)** — bounds idle-bound streaming variants; cap-aborted
+  requests are reported as `capped`, not `failed`.
+- **Resilient orchestration** — the remote benchmark runs detached (`setsid`) and the
+  orchestrator polls for the `meta.json` sentinel, so an SSH drop can't kill or hang
+  the run; `set -uo` so one flaky variant skips instead of aborting.
+- Reproduce with `./run.sh`; pin `var.ami_id` and the `*_IMAGE` digests for a
+  byte-identical rerun. Heavier run: `N=20000 REPEATS=5 ./run.sh`.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile
new file mode 100644
index 00000000..8285027f
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile
@@ -0,0 +1,18 @@
+# Builds the mock backend and load generator into one small static image.
+# Both binaries live in the final image; the compose service / docker run
+# command selects which one to execute.
+FROM golang:1.26-alpine AS build
+WORKDIR /src
+COPY go.mod ./
+COPY mock ./mock
+COPY loadgen ./loadgen
+RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/mock ./mock \
+ && CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/loadgen ./loadgen
+
+FROM gcr.io/distroless/static-debian12:nonroot
+COPY --from=build /out/mock /mock
+COPY --from=build /out/loadgen /loadgen
+# No ENTRYPOINT: each invocation picks the binary as its command, e.g.
+#   docker run img /mock          (compose `command: ["/mock"]`)
+#   docker run img /loadgen -url …
+CMD ["/mock"]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod
new file mode 100644
index 00000000..2552d97b
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod
@@ -0,0 +1,3 @@
+module gomodel-bench-tools
+
+go 1.26
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go
new file mode 100644
index 00000000..bcbcf6b8
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go
@@ -0,0 +1,442 @@
+// loadgen drives concurrent requests at one gateway endpoint and reports latency
+// percentiles. It speaks three request dialects (OpenAI chat, OpenAI responses,
+// Anthropic messages) in streaming and non-streaming modes, so a single binary
+// covers all six benchmark variants.
+//
+// Two closed-loop modes:
+//   - fixed count   (-n N):          send N requests at concurrency C, then stop.
+//   - time-boxed    (-duration D):   keep C workers busy for D, counting
+//                                    completions. Used by the capacity sweep to
+//                                    measure *sustained* throughput at each
+//                                    concurrency level (vs the latency-coupled
+//                                    "completed req/s @ c=N" the fixed mode reports).
+//
+// For streaming requests it records TTFT (time to first token/byte) separately
+// from total latency, plus inter-chunk gap percentiles (a pass-through gateway
+// relays each upstream chunk immediately; a buffering one clumps them). Output is
+// a JSON summary suitable for aggregation.
+package main
+
+import (
+	"bufio"
+	"bytes"
+	"context"
+	"encoding/json"
+	"flag"
+	"fmt"
+	"io"
+	"math"
+	"net/http"
+	"os"
+	"sort"
+	"strings"
+	"sync"
+	"time"
+)
+
+type headerList []string
+
+func (h *headerList) String() string { return strings.Join(*h, ",") }
+func (h *headerList) Set(v string) error {
+	*h = append(*h, v)
+	return nil
+}
+
+type result struct {
+	ttft   time.Duration
+	total  time.Duration
+	gaps   []time.Duration // streaming: time between consecutive received chunks
+	err    string
+	capped bool // aborted by the variant wall cap — excluded from ok/failed
+}
+
+func main() {
+	var (
+		url      = flag.String("url", "", "Target URL")
+		n        = flag.Int("n", 500, "Total requests (fixed-count mode; ignored when -duration > 0)")
+		c        = flag.Int("c", 10, "Concurrency")
+		duration = flag.Duration("duration", 0, "Time-boxed mode: keep -c workers busy for this long, count completions (capacity sweep)")
+		dialect  = flag.String("dialect", "chat", "Request dialect: chat | responses | messages")
+		stream   = flag.Bool("stream", false, "Stream the response")
+		model    = flag.String("model", "gpt-4o-mini", "Model name")
+		auth     = flag.String("auth", "sk-bench-test-key", "Bearer token for Authorization header")
+		jsonOut  = flag.String("json", "", "Write JSON summary to this file")
+		timeout  = flag.Duration("timeout", 30*time.Second, "Per-request hard timeout")
+		idle     = flag.Duration("idle", 1500*time.Millisecond, "Streaming: end the stream if no new data arrives for this long")
+		maxWall  = flag.Duration("max-wall", 0, "Fixed mode: stop launching new requests after this wall time (caps slow/idle-bound variants; 0 = no cap)")
+	)
+	var headers headerList
+	flag.Var(&headers, "H", "Extra header 'Key: Value' (repeatable)")
+	flag.Parse()
+
+	if *url == "" {
+		fmt.Fprintln(os.Stderr, "usage: loadgen -url <url> [-n] [-c] [-dialect chat|responses|messages] [-stream] [-model] [-H 'K: V']")
+		os.Exit(2)
+	}
+
+	body := buildBody(*dialect, *model, *stream)
+	// Tuned transport: keep a full set of hot keep-alive connections for the
+	// configured concurrency so the measured window isn't paying TCP setup. The
+	// stdlib default caps idle conns per host at 2, which would churn connections
+	// at c>2 and add noise to short windows.
+	tr := http.DefaultTransport.(*http.Transport).Clone()
+	tr.MaxIdleConns = *c * 2
+	tr.MaxIdleConnsPerHost = *c * 2
+	client := &http.Client{Transport: tr}
+
+	do := func(ctx context.Context) result {
+		return doRequest(ctx, client, *url, body, *dialect, *stream, *auth, headers, *timeout, *idle)
+	}
+
+	var results []result
+	var wall time.Duration
+	if *duration > 0 {
+		results, wall = driveTimeBoxed(do, *c, *duration)
+	} else {
+		results, wall = driveFixed(do, *n, *c, *maxWall)
+	}
+
+	report(*url, *dialect, *stream, len(results), *c, wall, *duration, results, *jsonOut)
+}
+
+// driveFixed sends up to n requests at concurrency c (closed loop). If maxWall>0
+// it caps the variant at that wall time: it stops launching new requests AND
+// cancels any in-flight ones via the shared context, so an idle-bound streaming
+// variant can't run for the full N at ~7 req/s (and can't stall the launch loop
+// on a full semaphore either). Fast variants reach N long before the cap; slow
+// ones return however many completed in the window. Requests aborted *by* the cap
+// are tagged capped (excluded from ok/failed) rather than counted as errors.
+func driveFixed(do func(context.Context) result, n, c int, maxWall time.Duration) ([]result, time.Duration) {
+	capCtx := context.Background()
+	if maxWall > 0 {
+		var cancel context.CancelFunc
+		capCtx, cancel = context.WithTimeout(capCtx, maxWall)
+		defer cancel()
+	}
+	var mu sync.Mutex
+	results := make([]result, 0, n)
+	sem := make(chan struct{}, c)
+	var wg sync.WaitGroup
+	start := time.Now()
+loop:
+	for i := 0; i < n; i++ {
+		select {
+		case <-capCtx.Done(): // cap reached — stop launching even if the sem is full
+			break loop
+		case sem <- struct{}{}:
+		}
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			defer func() { <-sem }()
+			r := do(capCtx)
+			if r.err != "" && capCtx.Err() != nil {
+				r.capped = true // aborted by the cap, not a genuine failure
+			}
+			mu.Lock()
+			results = append(results, r)
+			mu.Unlock()
+		}()
+	}
+	wg.Wait()
+	return results, time.Since(start)
+}
+
+// driveTimeBoxed keeps c workers continuously busy for d, returning every result
+// completed in the window. This measures sustained throughput at concurrency c
+// (the capacity sweep), independent of any fixed request count. The shared
+// context cancels each worker's final in-flight request at the deadline so the
+// tail can't overrun by a full per-request timeout.
+func driveTimeBoxed(do func(context.Context) result, c int, d time.Duration) ([]result, time.Duration) {
+	ctx, cancel := context.WithTimeout(context.Background(), d)
+	defer cancel()
+	var mu sync.Mutex
+	var results []result
+	var wg sync.WaitGroup
+	start := time.Now()
+	for range c {
+		wg.Add(1)
+		go func() {
+			defer wg.Done()
+			for ctx.Err() == nil {
+				r := do(ctx)
+				if r.err != "" && ctx.Err() != nil {
+					r.capped = true // tail request cut off at the window edge
+				}
+				mu.Lock()
+				results = append(results, r)
+				mu.Unlock()
+			}
+		}()
+	}
+	wg.Wait()
+	return results, time.Since(start)
+}
+
+func buildBody(dialect, model string, stream bool) []byte {
+	const prompt = "Say hello for a benchmark test."
+	var req map[string]any
+	switch dialect {
+	case "responses":
+		req = map[string]any{"model": model, "stream": stream, "input": prompt}
+	case "messages":
+		req = map[string]any{
+			"model": model, "stream": stream, "max_tokens": 256,
+			"messages": []map[string]any{{"role": "user", "content": prompt}},
+		}
+	default: // chat
+		req = map[string]any{
+			"model": model, "stream": stream,
+			"messages": []map[string]any{{"role": "user", "content": prompt}},
+		}
+	}
+	b, _ := json.Marshal(req)
+	return b
+}
+
+func doRequest(parent context.Context, client *http.Client, url string, body []byte, dialect string, stream bool, auth string, headers headerList, timeout, idle time.Duration) result {
+	ctx, cancel := context.WithTimeout(parent, timeout)
+	defer cancel()
+
+	req, err := http.NewRequestWithContext(ctx, http.MethodPost, url, bytes.NewReader(body))
+	if err != nil {
+		return result{err: err.Error()}
+	}
+	req.Header.Set("Content-Type", "application/json")
+	if auth != "" {
+		req.Header.Set("Authorization", "Bearer "+auth)
+	}
+	if dialect == "messages" {
+		req.Header.Set("anthropic-version", "2023-06-01")
+	}
+	for _, h := range headers {
+		k, v, ok := strings.Cut(h, ":")
+		if ok {
+			req.Header.Set(strings.TrimSpace(k), strings.TrimSpace(v))
+		}
+	}
+
+	startReq := time.Now()
+	resp, err := client.Do(req)
+	if err != nil {
+		return result{err: err.Error()}
+	}
+	defer resp.Body.Close()
+
+	if resp.StatusCode != http.StatusOK {
+		b, _ := io.ReadAll(io.LimitReader(resp.Body, 300))
+		return result{err: fmt.Sprintf("HTTP %d: %s", resp.StatusCode, strings.TrimSpace(string(b)))}
+	}
+
+	if !stream {
+		if _, err := io.Copy(io.Discard, resp.Body); err != nil {
+			return result{err: "read body: " + err.Error()}
+		}
+		d := time.Since(startReq)
+		return result{ttft: d, total: d}
+	}
+	return readStream(resp.Body, startReq, dialect, idle)
+}
+
+// readStream consumes an SSE body, recording TTFT at the first data line and
+// total latency at the last received chunk. A stream ends when any of these
+// occurs: the dialect's terminal marker is seen, the connection closes, or no
+// new data arrives for `idle` (a fallback for gateways that stream content but
+// never send a terminal event nor close — e.g. Bifrost's responses/messages
+// streams over an OpenAI-backed provider). Total latency is always measured to
+// the last byte, so the idle wait never inflates the reported latency.
+func readStream(r io.Reader, startReq time.Time, dialect string, idle time.Duration) result {
+	lines := make(chan string, 128)
+	go func() {
+		scanner := bufio.NewScanner(r)
+		scanner.Buffer(make([]byte, 0, 64*1024), 1024*1024)
+		for scanner.Scan() {
+			lines <- scanner.Text()
+		}
+		close(lines)
+	}()
+
+	var ttft, total, prev time.Duration
+	var gaps []time.Duration
+	gotFirst := false
+	timer := time.NewTimer(idle)
+	defer timer.Stop()
+
+	for {
+		select {
+		case line, ok := <-lines:
+			if !ok { // connection closed
+				if !gotFirst {
+					return result{err: "empty stream"}
+				}
+				return result{ttft: ttft, total: total, gaps: gaps}
+			}
+			if !strings.HasPrefix(line, "data: ") {
+				continue
+			}
+			now := time.Since(startReq)
+			if !gotFirst {
+				ttft = now
+				gotFirst = true
+			} else {
+				gaps = append(gaps, now-prev) // time since the previous chunk
+			}
+			prev = now
+			total = now // advance to the most recent chunk
+			if isTerminal(dialect, line[len("data: "):]) {
+				return result{ttft: ttft, total: total, gaps: gaps}
+			}
+			if !timer.Stop() {
+				select {
+				case <-timer.C:
+				default:
+				}
+			}
+			timer.Reset(idle)
+		case <-timer.C: // idle gap: treat as end-of-stream at the last chunk
+			if !gotFirst {
+				return result{err: "no data before idle timeout"}
+			}
+			return result{ttft: ttft, total: total, gaps: gaps}
+		}
+	}
+}
+
+// isTerminal recognizes each dialect's end-of-stream markers, including the
+// content-complete events some gateways send instead of a final wrapper event.
+func isTerminal(dialect, payload string) bool {
+	switch dialect {
+	case "responses":
+		return strings.Contains(payload, `"response.completed"`) ||
+			strings.Contains(payload, `"response.output_text.done"`)
+	case "messages":
+		return strings.Contains(payload, `"message_stop"`)
+	default: // chat
+		return payload == "[DONE]"
+	}
+}
+
+func report(url, dialect string, stream bool, n, c int, wall, duration time.Duration, results []result, jsonOut string) {
+	var ttfts, totals, gaps []float64
+	errs := map[string]int{}
+	errCount := 0
+	cappedCount := 0
+	for _, r := range results {
+		if r.capped { // cut off by the wall cap — neither a success nor a failure
+			cappedCount++
+			continue
+		}
+		if r.err != "" {
+			errCount++
+			errs[r.err]++
+			continue
+		}
+		ttfts = append(ttfts, float64(r.ttft.Microseconds()))
+		totals = append(totals, float64(r.total.Microseconds()))
+		for _, g := range r.gaps {
+			gaps = append(gaps, float64(g.Microseconds()))
+		}
+	}
+	ok := len(totals)
+	sort.Float64s(ttfts)
+	sort.Float64s(totals)
+	sort.Float64s(gaps)
+
+	mode := "nonstream"
+	if stream {
+		mode = "stream"
+	}
+	rps := 0.0
+	if wall > 0 {
+		rps = float64(ok) / wall.Seconds()
+	}
+	// measure documents how rps should be read: throughput = sustained capacity
+	// at concurrency c; latency = completed req/s @ c (coupled to per-req latency).
+	measure := "latency"
+	if duration > 0 {
+		measure = "throughput"
+	}
+
+	// "-" means machine-readable only: emit JSON to stdout, skip the human report.
+	if jsonOut == "-" {
+		writeSummary("-", url, dialect, mode, measure, n, ok, errCount, cappedCount, c, wall, rps, ttfts, totals, gaps, errs)
+		return
+	}
+
+	fmt.Printf("\n=== %s/%s  %s  (%s) ===\n", dialect, mode, url, measure)
+	fmt.Printf("requests: %d  ok: %d  failed: %d  capped: %d  concurrency: %d  wall: %s  rps: %.1f\n",
+		n, ok, errCount, cappedCount, c, wall.Round(time.Millisecond), rps)
+	if ok > 0 {
+		fmt.Printf("total latency ms  p50=%.2f p90=%.2f p95=%.2f p99=%.2f max=%.2f\n",
+			ms(pct(totals, 50)), ms(pct(totals, 90)), ms(pct(totals, 95)), ms(pct(totals, 99)), ms(totals[ok-1]))
+		if stream {
+			fmt.Printf("ttft ms           p50=%.2f p90=%.2f p95=%.2f p99=%.2f\n",
+				ms(pct(ttfts, 50)), ms(pct(ttfts, 90)), ms(pct(ttfts, 95)), ms(pct(ttfts, 99)))
+			if len(gaps) > 0 {
+				fmt.Printf("inter-chunk ms    p50=%.2f p90=%.2f p99=%.2f\n",
+					ms(pct(gaps, 50)), ms(pct(gaps, 90)), ms(pct(gaps, 99)))
+			}
+		}
+	}
+	for e, ct := range errs {
+		fmt.Printf("  error x%d: %s\n", ct, e)
+	}
+
+	if jsonOut != "" {
+		writeSummary(jsonOut, url, dialect, mode, measure, n, ok, errCount, cappedCount, c, wall, rps, ttfts, totals, gaps, errs)
+	}
+}
+
+func writeSummary(path, url, dialect, mode, measure string, n, ok, errCount, cappedCount, c int, wall time.Duration, rps float64, ttfts, totals, gaps []float64, errs map[string]int) {
+	sample := func(s []float64) map[string]any {
+		if len(s) == 0 {
+			return map[string]any{}
+		}
+		return map[string]any{
+			"p50_ms": ms(pct(s, 50)), "p90_ms": ms(pct(s, 90)),
+			"p95_ms": ms(pct(s, 95)), "p99_ms": ms(pct(s, 99)),
+			"min_ms": ms(s[0]), "max_ms": ms(s[len(s)-1]), "avg_ms": ms(avg(s)),
+		}
+	}
+	out := map[string]any{
+		"url": url, "dialect": dialect, "mode": mode, "measure": measure,
+		"requests": n, "ok": ok, "failed": errCount, "capped": cappedCount, "concurrency": c,
+		"wall_ms": wall.Milliseconds(), "rps": rps,
+		"total_latency": sample(totals),
+		"ttft":          sample(ttfts),
+		"inter_chunk":   sample(gaps),
+		"errors":        errs,
+	}
+	b, _ := json.MarshalIndent(out, "", "  ")
+	if path == "-" {
+		fmt.Println(string(b))
+		return
+	}
+	if err := os.WriteFile(path, b, 0o644); err != nil {
+		fmt.Fprintf(os.Stderr, "write %s: %v\n", path, err)
+		os.Exit(1)
+	}
+}
+
+func pct(sorted []float64, p float64) float64 {
+	if len(sorted) == 0 {
+		return 0
+	}
+	idx := p / 100 * float64(len(sorted)-1)
+	lo, hi := int(math.Floor(idx)), int(math.Ceil(idx))
+	if lo == hi {
+		return sorted[lo]
+	}
+	frac := idx - float64(lo)
+	return sorted[lo]*(1-frac) + sorted[hi]*frac
+}
+
+func avg(s []float64) float64 {
+	sum := 0.0
+	for _, v := range s {
+		sum += v
+	}
+	return sum / float64(len(s))
+}
+
+func ms(us float64) float64 { return math.Round(us/10) / 100 } // microseconds -> ms, 2dp
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go
new file mode 100644
index 00000000..8930cf8a
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go
@@ -0,0 +1,411 @@
+// Mock OpenAI/Anthropic-compatible backend for gateway benchmarking.
+//
+// It answers instantly with deterministic payloads so the benchmark measures
+// pure gateway overhead, not upstream model latency. Three dialects are served
+// so every gateway can be exercised through its own native translation path:
+//
+//	/v1/chat/completions  — OpenAI Chat Completions (stream + non-stream)
+//	/v1/responses         — OpenAI Responses        (stream + non-stream)
+//	/v1/messages          — Anthropic Messages      (stream + non-stream)
+//
+// Each path is also exposed without the /v1 prefix because some gateways strip
+// it before forwarding upstream.
+//
+// Recording mode (MOCK_RECORD=1): every upstream request is captured (method,
+// path, headers with secrets redacted, body) along with the canned response the
+// mock returned, and exposed via control endpoints so a harness can inspect how
+// each gateway *translated* a client request:
+//
+//	POST /__reset   clear the capture log
+//	GET  /__log     {"entries":[...]} all captured exchanges since reset
+//	GET  /__last    the most recent captured exchange
+//
+// Recording also enriches responses with provider-specific extras
+// (system_fingerprint, service_tier, x_provider_note) so response-normalization
+// fidelity is observable. Both behaviors are gated so the latency benchmark
+// stays byte-identical when MOCK_RECORD is unset.
+package main
+
+import (
+	"encoding/json"
+	"fmt"
+	"io"
+	"log"
+	"net/http"
+	"os"
+	"strings"
+	"sync"
+	"time"
+)
+
+func main() {
+	port := "9999"
+	if p := os.Getenv("MOCK_PORT"); p != "" {
+		port = p
+	}
+
+	mux := http.NewServeMux()
+	register(mux, "/chat/completions", handleChatCompletions)
+	register(mux, "/responses", handleResponses)
+	register(mux, "/messages", handleMessages)
+	mux.HandleFunc("/v1/models", handleModels)
+	mux.HandleFunc("/models", handleModels)
+	mux.HandleFunc("/health", func(w http.ResponseWriter, _ *http.Request) {
+		writeJSONBytes(w, http.StatusOK, []byte(`{"status":"ok"}`))
+	})
+	mux.HandleFunc("/__reset", handleReset)
+	mux.HandleFunc("/__log", handleLog)
+	mux.HandleFunc("/__last", handleLast)
+
+	log.Printf("Mock backend (openai+anthropic) listening on :%s (record=%v)", port, recording())
+	if err := http.ListenAndServe(":"+port, mux); err != nil {
+		log.Fatal(err)
+	}
+}
+
+// register binds a handler at both the canonical and /v1-prefixed path.
+func register(mux *http.ServeMux, path string, h http.HandlerFunc) {
+	mux.HandleFunc(path, h)
+	mux.HandleFunc("/v1"+path, h)
+}
+
+// ---------- request/response capture ----------
+
+func recording() bool { return os.Getenv("MOCK_RECORD") == "1" }
+
+// entry is one captured upstream exchange: the request a gateway sent and the
+// response the mock returned for it.
+type entry struct {
+	Seq      int               `json:"seq"`
+	Time     string            `json:"time"`
+	Method   string            `json:"method"`
+	Path     string            `json:"path"`
+	Query    string            `json:"query,omitempty"`
+	Headers  map[string]string `json:"headers"`
+	Body     json.RawMessage   `json:"body,omitempty"`
+	BodyText string            `json:"body_text,omitempty"` // set when body is not valid JSON
+	Stream   bool              `json:"stream"`
+	Response any               `json:"response,omitempty"`
+}
+
+var rec struct {
+	mu      sync.Mutex
+	entries []*entry
+	seq     int
+}
+
+var sensitiveHeaders = map[string]bool{
+	"authorization": true, "x-api-key": true, "api-key": true,
+	"x-portkey-api-key": true, "x-goog-api-key": true,
+}
+
+// begin reads and (in recording mode) captures the request, returning the entry
+// so the handler can attach the response it produces. Returns ok=false if the
+// request was already rejected.
+func begin(w http.ResponseWriter, r *http.Request) (*entry, bool) {
+	if r.Method != http.MethodPost {
+		http.Error(w, "method not allowed", http.StatusMethodNotAllowed)
+		return nil, false
+	}
+	raw, _ := io.ReadAll(r.Body)
+	var sr streamReq
+	if err := json.Unmarshal(raw, &sr); err != nil {
+		http.Error(w, "invalid request body", http.StatusBadRequest)
+		return nil, false
+	}
+	e := &entry{
+		Time: time.Now().UTC().Format(time.RFC3339Nano), Method: r.Method,
+		Path: r.URL.Path, Query: r.URL.RawQuery, Headers: captureHeaders(r),
+		Stream: sr.Stream,
+	}
+	if json.Valid(raw) {
+		e.Body = json.RawMessage(raw)
+	} else {
+		e.BodyText = string(raw)
+	}
+	if recording() {
+		rec.mu.Lock()
+		rec.seq++
+		e.Seq = rec.seq
+		rec.entries = append(rec.entries, e)
+		rec.mu.Unlock()
+	}
+	return e, true
+}
+
+func captureHeaders(r *http.Request) map[string]string {
+	h := make(map[string]string, len(r.Header))
+	for k, v := range r.Header {
+		val := strings.Join(v, ", ")
+		if sensitiveHeaders[strings.ToLower(k)] {
+			val = fmt.Sprintf("redacted(len=%d)", len(val))
+		}
+		h[k] = val
+	}
+	return h
+}
+
+func handleReset(w http.ResponseWriter, _ *http.Request) {
+	rec.mu.Lock()
+	rec.entries = nil
+	rec.seq = 0
+	rec.mu.Unlock()
+	writeJSONBytes(w, http.StatusOK, []byte(`{"ok":true}`))
+}
+
+func handleLog(w http.ResponseWriter, _ *http.Request) {
+	rec.mu.Lock()
+	defer rec.mu.Unlock()
+	writeJSON(w, map[string]any{"entries": rec.entries})
+}
+
+func handleLast(w http.ResponseWriter, _ *http.Request) {
+	rec.mu.Lock()
+	defer rec.mu.Unlock()
+	if len(rec.entries) == 0 {
+		writeJSONBytes(w, http.StatusNotFound, []byte(`{"error":"no entries"}`))
+		return
+	}
+	writeJSON(w, rec.entries[len(rec.entries)-1])
+}
+
+// streamTokens is the deterministic body streamed token-by-token. Kept short and
+// fixed so every run transfers identical bytes.
+var streamTokens = []string{
+	"This ", "is ", "a ", "benchmark ", "response ", "from ", "the ", "mock ",
+	"backend ", "server. ", "It ", "contains ", "enough ", "text ", "to ", "be ",
+	"representative ", "of ", "a ", "typical ", "short ", "AI ", "response ",
+	"that ", "would ", "be ", "returned ", "in ", "production ", "use ", "cases.",
+}
+
+func fullText() string { return strings.Join(streamTokens, "") }
+
+// providerExtras returns provider-specific fields (only in recording mode) so
+// response-normalization fidelity is observable across gateways.
+func providerExtras() map[string]any {
+	if !recording() {
+		return nil
+	}
+	return map[string]any{
+		"system_fingerprint": "fp_mock_0001",
+		"service_tier":       "default",
+		"x_provider_note":    "mock-extra-field",
+	}
+}
+
+func merge(base map[string]any, extra map[string]any) map[string]any {
+	for k, v := range extra {
+		base[k] = v
+	}
+	return base
+}
+
+// ---------- OpenAI Chat Completions ----------
+
+func handleChatCompletions(w http.ResponseWriter, r *http.Request) {
+	e, ok := begin(w, r)
+	if !ok {
+		return
+	}
+	if e.Stream {
+		streamChatCompletion(w, e)
+	} else {
+		nonStreamChatCompletion(w, e)
+	}
+}
+
+func nonStreamChatCompletion(w http.ResponseWriter, e *entry) {
+	resp := merge(map[string]any{
+		"id":      "chatcmpl-bench-001",
+		"object":  "chat.completion",
+		"created": time.Now().Unix(),
+		"model":   "gpt-4o-mini",
+		"choices": []map[string]any{{
+			"index":         0,
+			"message":       map[string]any{"role": "assistant", "content": fullText()},
+			"finish_reason": "stop",
+		}},
+		"usage": map[string]any{"prompt_tokens": 25, "completion_tokens": 35, "total_tokens": 60},
+	}, providerExtras())
+	respond(w, e, resp)
+}
+
+func streamChatCompletion(w http.ResponseWriter, e *entry) {
+	flusher := beginSSE(w)
+	if flusher == nil {
+		return
+	}
+	setStreamResp(e, "chat.completion.chunk")
+	now := time.Now().Unix()
+	send(w, flusher, "", fmt.Sprintf(`{"id":"chatcmpl-bench-001","object":"chat.completion.chunk","created":%d,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"role":"assistant","content":""},"finish_reason":null}]}`, now))
+	for _, tok := range streamTokens {
+		send(w, flusher, "", fmt.Sprintf(`{"id":"chatcmpl-bench-001","object":"chat.completion.chunk","created":%d,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{"content":%q},"finish_reason":null}]}`, now, tok))
+	}
+	send(w, flusher, "", fmt.Sprintf(`{"id":"chatcmpl-bench-001","object":"chat.completion.chunk","created":%d,"model":"gpt-4o-mini","choices":[{"index":0,"delta":{},"finish_reason":"stop"}],"usage":{"prompt_tokens":25,"completion_tokens":35,"total_tokens":60}}`, now))
+	send(w, flusher, "", "[DONE]")
+}
+
+// ---------- OpenAI Responses ----------
+
+func handleResponses(w http.ResponseWriter, r *http.Request) {
+	e, ok := begin(w, r)
+	if !ok {
+		return
+	}
+	if e.Stream {
+		streamResponses(w, e)
+	} else {
+		nonStreamResponses(w, e)
+	}
+}
+
+func nonStreamResponses(w http.ResponseWriter, e *entry) {
+	resp := merge(map[string]any{
+		"id": "resp-bench-001", "object": "response", "created_at": time.Now().Unix(),
+		"model": "gpt-4o-mini", "status": "completed",
+		"output": []map[string]any{{
+			"type": "message", "id": "msg-bench-001", "role": "assistant",
+			"content": []map[string]any{{"type": "output_text", "text": fullText()}},
+		}},
+		"usage": map[string]any{"input_tokens": 25, "output_tokens": 35, "total_tokens": 60},
+	}, providerExtras())
+	respond(w, e, resp)
+}
+
+func streamResponses(w http.ResponseWriter, e *entry) {
+	flusher := beginSSE(w)
+	if flusher == nil {
+		return
+	}
+	setStreamResp(e, "response.*")
+	now := time.Now().Unix()
+	send(w, flusher, "response.created", mustJSON(map[string]any{"id": "resp-bench-001", "object": "response", "created_at": now, "model": "gpt-4o-mini", "status": "in_progress", "output": []any{}}))
+	send(w, flusher, "response.output_item.added", mustJSON(map[string]any{"type": "message", "id": "msg-bench-001", "role": "assistant", "content": []any{}}))
+	send(w, flusher, "response.content_part.added", mustJSON(map[string]any{"type": "output_text", "text": ""}))
+	for _, tok := range streamTokens {
+		send(w, flusher, "response.output_text.delta", mustJSON(map[string]any{"type": "response.output_text.delta", "delta": tok}))
+	}
+	send(w, flusher, "response.output_text.done", mustJSON(map[string]any{"type": "response.output_text.done", "text": fullText()}))
+	send(w, flusher, "response.completed", mustJSON(map[string]any{
+		"id": "resp-bench-001", "object": "response", "status": "completed",
+		"output": []map[string]any{{"type": "message", "id": "msg-bench-001", "role": "assistant",
+			"content": []map[string]any{{"type": "output_text", "text": fullText()}}}},
+		"usage": map[string]any{"input_tokens": 25, "output_tokens": 35, "total_tokens": 60},
+	}))
+}
+
+// ---------- Anthropic Messages ----------
+
+func handleMessages(w http.ResponseWriter, r *http.Request) {
+	e, ok := begin(w, r)
+	if !ok {
+		return
+	}
+	if e.Stream {
+		streamMessages(w, e)
+	} else {
+		nonStreamMessages(w, e)
+	}
+}
+
+func nonStreamMessages(w http.ResponseWriter, e *entry) {
+	resp := merge(map[string]any{
+		"id": "msg-bench-001", "type": "message", "role": "assistant",
+		"model":   "claude-3-5-sonnet",
+		"content": []map[string]any{{"type": "text", "text": fullText()}},
+		"stop_reason": "end_turn", "stop_sequence": nil,
+		"usage": map[string]any{"input_tokens": 25, "output_tokens": 35},
+	}, providerExtras())
+	respond(w, e, resp)
+}
+
+func streamMessages(w http.ResponseWriter, e *entry) {
+	flusher := beginSSE(w)
+	if flusher == nil {
+		return
+	}
+	setStreamResp(e, "message_*")
+	send(w, flusher, "message_start", mustJSON(map[string]any{"type": "message_start", "message": map[string]any{
+		"id": "msg-bench-001", "type": "message", "role": "assistant", "model": "claude-3-5-sonnet",
+		"content": []any{}, "stop_reason": nil, "usage": map[string]any{"input_tokens": 25, "output_tokens": 1},
+	}}))
+	send(w, flusher, "content_block_start", mustJSON(map[string]any{"type": "content_block_start", "index": 0, "content_block": map[string]any{"type": "text", "text": ""}}))
+	for _, tok := range streamTokens {
+		send(w, flusher, "content_block_delta", mustJSON(map[string]any{"type": "content_block_delta", "index": 0, "delta": map[string]any{"type": "text_delta", "text": tok}}))
+	}
+	send(w, flusher, "content_block_stop", mustJSON(map[string]any{"type": "content_block_stop", "index": 0}))
+	send(w, flusher, "message_delta", mustJSON(map[string]any{"type": "message_delta", "delta": map[string]any{"stop_reason": "end_turn", "stop_sequence": nil}, "usage": map[string]any{"output_tokens": 35}}))
+	send(w, flusher, "message_stop", mustJSON(map[string]any{"type": "message_stop"}))
+}
+
+// ---------- Models ----------
+
+func handleModels(w http.ResponseWriter, _ *http.Request) {
+	writeJSON(w, map[string]any{
+		"object": "list",
+		"data": []map[string]any{
+			{"id": "gpt-4o-mini", "object": "model", "owned_by": "openai", "created": time.Now().Unix()},
+			{"id": "claude-3-5-sonnet", "object": "model", "owned_by": "anthropic", "created": time.Now().Unix()},
+		},
+	})
+}
+
+// ---------- Shared helpers ----------
+
+// streamReq is the only field the mock needs to decode from any request body.
+type streamReq struct {
+	Stream bool `json:"stream"`
+}
+
+// respond writes a non-stream JSON response and records it on the entry.
+func respond(w http.ResponseWriter, e *entry, v map[string]any) {
+	e.Response = v
+	writeJSON(w, v)
+}
+
+// setStreamResp records a compact description of a streamed canned response.
+func setStreamResp(e *entry, kind string) {
+	e.Response = map[string]any{"stream": true, "event_kind": kind, "text": fullText()}
+}
+
+func beginSSE(w http.ResponseWriter) http.Flusher {
+	flusher, ok := w.(http.Flusher)
+	if !ok {
+		http.Error(w, "streaming not supported", http.StatusInternalServerError)
+		return nil
+	}
+	w.Header().Set("Content-Type", "text/event-stream")
+	w.Header().Set("Cache-Control", "no-cache")
+	w.Header().Set("Connection", "keep-alive")
+	return flusher
+}
+
+// send writes one SSE frame. An empty event name omits the event: line (OpenAI
+// chat style); a name emits "event: <name>" (Responses / Anthropic style).
+func send(w http.ResponseWriter, flusher http.Flusher, event, data string) {
+	if event != "" {
+		fmt.Fprintf(w, "event: %s\n", event)
+	}
+	fmt.Fprintf(w, "data: %s\n\n", data)
+	flusher.Flush()
+}
+
+func writeJSON(w http.ResponseWriter, v any) {
+	w.Header().Set("Content-Type", "application/json")
+	if err := json.NewEncoder(w).Encode(v); err != nil {
+		log.Printf("encode response: %v", err)
+	}
+}
+
+func writeJSONBytes(w http.ResponseWriter, status int, payload []byte) {
+	w.Header().Set("Content-Type", "application/json")
+	w.WriteHeader(status)
+	if _, err := w.Write(payload); err != nil {
+		log.Printf("write response: %v", err)
+	}
+}
+
+func mustJSON(v any) string {
+	b, _ := json.Marshal(v)
+	return string(b)
+}
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json b/docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json
new file mode 100644
index 00000000..0adbef1c
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json
@@ -0,0 +1,60 @@
+{
+  "$schema": "https://www.getbifrost.ai/schema",
+  "client": {
+    "drop_excess_requests": false,
+    "initial_pool_size": 5000,
+    "enable_logging": false,
+    "enforce_auth_on_inference": false,
+    "allowed_origins": [
+      "*"
+    ]
+  },
+  "config_store": {
+    "enabled": true,
+    "type": "sqlite",
+    "config": {
+      "path": "/app/data/config.db"
+    }
+  },
+  "logs_store": {
+    "enabled": false
+  },
+  "providers": {
+    "openai": {
+      "keys": [
+        {
+          "name": "mock",
+          "value": "sk-bench-test-key",
+          "models": [
+            "gpt-4o-mini"
+          ],
+          "weight": 1
+        }
+      ],
+      "network_config": {
+        "base_url": "http://mock:9999",
+        "default_request_timeout_in_seconds": 60,
+        "max_retries": 0,
+        "allow_private_network": true
+      }
+    },
+    "anthropic": {
+      "keys": [
+        {
+          "name": "mock",
+          "value": "sk-bench-test-key",
+          "models": [
+            "claude-3-5-sonnet"
+          ],
+          "weight": 1
+        }
+      ],
+      "network_config": {
+        "base_url": "http://mock:9999",
+        "default_request_timeout_in_seconds": 60,
+        "max_retries": 0,
+        "allow_private_network": true
+      }
+    }
+  }
+}
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml b/docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml
new file mode 100644
index 00000000..fafffa5c
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml
@@ -0,0 +1,19 @@
+# LiteLLM proxy config for the benchmark. One model, routed at the mock backend
+# over the shared docker network. Retries/spend-logging disabled so we measure
+# routing overhead, not bookkeeping.
+model_list:
+  - model_name: gpt-4o-mini
+    litellm_params:
+      model: openai/gpt-4o-mini
+      api_key: sk-bench-test-key
+      api_base: http://mock:9999/v1
+
+general_settings:
+  master_key: null
+  disable_spend_logs: true
+
+litellm_settings:
+  num_retries: 0
+  request_timeout: 60
+  drop_params: true
+  set_verbose: false
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml b/docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml
new file mode 100644
index 00000000..73d169f1
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml
@@ -0,0 +1,99 @@
+# Benchmark topology: a shared mock backend plus exactly one gateway at a time.
+#
+# Each gateway lives in its own profile so we can bring them up sequentially and
+# never let them contend for the (single-vCPU, free-tier) instance:
+#
+#   docker compose --profile gomodel up -d   # mock + gomodel
+#   docker compose --profile litellm up -d   # mock + litellm
+#   docker compose --profile portkey up -d   # mock + portkey
+#
+# The mock has no profile, so it starts alongside whichever gateway is selected.
+# The load generator is run separately via `docker run --network benchnet`.
+#
+# Image refs are overridable via env so exact versions/digests stay pinnable.
+
+networks:
+  default:
+    name: benchnet
+
+services:
+  mock:
+    image: ${BENCH_TOOLS_IMAGE:-bench-tools:local}
+    command: ["/mock"]
+    environment:
+      - MOCK_PORT=9999
+    ports:
+      - "9999:9999"   # published so the runner can record a no-gateway baseline
+    restart: "no"
+
+  gomodel:
+    profiles: ["gomodel"]
+    image: ${GOMODEL_IMAGE:-gomodel-bench:local}
+    depends_on: [mock]
+    ports:
+      - "8080:8080"
+    environment:
+      - PORT=8080
+      - GOMODEL_MASTER_KEY=
+      - OPENAI_API_KEY=sk-bench-test-key
+      - OPENAI_BASE_URL=http://mock:9999/v1
+      - LOGGING_ENABLED=false
+      - USAGE_ENABLED=false
+      - METRICS_ENABLED=false
+      - SWAGGER_ENABLED=false
+      - PPROF_ENABLED=false
+      - ENABLE_PASSTHROUGH_ROUTES=false
+      # Config parity with the no-retry / no-breaker peers (LiteLLM, Bifrost):
+      #  - retries off (default is 3).
+      #  - circuit breaker effectively disabled. It has no on/off env, so set an
+      #    unreachable failure threshold. This matters under load: a few transient
+      #    upstream errors at high concurrency were tripping the default breaker
+      #    (threshold 5), which then blanket-503'd every request ("circuit breaker
+      #    is open") and made GoModel's own capacity/resource numbers read as ~0.
+      #    No other gateway has a breaker, so disabling it keeps the test fair.
+      - RETRY_MAX_RETRIES=0
+      - CIRCUIT_BREAKER_FAILURE_THRESHOLD=1000000000
+      - STORAGE_TYPE=sqlite
+      - SQLITE_PATH=/app/data/gomodel-bench.db
+      - GOMODEL_CACHE_DIR=/app/.cache
+    restart: "no"
+
+  litellm:
+    profiles: ["litellm"]
+    image: ${LITELLM_IMAGE:-ghcr.io/berriai/litellm:main-stable}
+    depends_on: [mock]
+    ports:
+      - "4000:4000"
+    volumes:
+      - ./configs/litellm-config.yaml:/app/config.yaml:ro
+    # LiteLLM's own recommendation is one worker per CPU core. run-on-instance.sh
+    # sets LITELLM_NUM_WORKERS=$(nproc) so this matches the box (2 on c7i.large),
+    # giving LiteLLM the same multi-core access the Go gateways get for free.
+    command: ["--config", "/app/config.yaml", "--port", "4000", "--num_workers", "${LITELLM_NUM_WORKERS:-1}"]
+    restart: "no"
+
+  portkey:
+    profiles: ["portkey"]
+    image: ${PORTKEY_IMAGE:-portkeyai/gateway:latest}
+    depends_on: [mock]
+    ports:
+      - "8787:8787"
+    environment:
+      # Allow the mock service hostname through Portkey's SSRF/private-IP filter.
+      - TRUSTED_CUSTOM_HOSTS=mock
+    restart: "no"
+
+  bifrost:
+    profiles: ["bifrost"]
+    image: ${BIFROST_IMAGE:-maximhq/bifrost:latest}
+    depends_on: [mock]
+    ports:
+      - "8089:8089"
+    environment:
+      - APP_PORT=8089
+      - APP_HOST=0.0.0.0       # bind all interfaces (default binds localhost)
+    volumes:
+      # Provider config (openai + anthropic -> mock); /app/data stays writable
+      # for Bifrost's sqlite config store.
+      - ./configs/bifrost-config.json:/app/data/config.json:ro
+    restart: "no"
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh b/docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh
new file mode 100755
index 00000000..bfb384b4
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh
@@ -0,0 +1,379 @@
+#!/usr/bin/env bash
+# Runs the gateway latency + capacity + resource benchmark on the local docker host.
+#
+# Designed to run ON the provisioned EC2 instance (invoked by ../run.sh over
+# SSH), but it works on any docker host. Two passes:
+#
+#   Pass A — latency: REPEATS independent trials. Each trial brings up exactly one
+#            gateway at a time (no contention), warms it, drives all six request
+#            variants, tears it down. Gateway *order is randomized every trial* so
+#            no gateway is pinned to the most-favorable slot; results land in
+#            results/run<k>/. Aggregation (median + spread across trials) is left
+#            to scripts/summarize.py.
+#
+#   Pass B — capacity + footprint (once): per gateway, measure cold-start latency,
+#            image size, a throughput-vs-concurrency sweep (sustained req/s at each
+#            concurrency level — true capacity, not latency-coupled), and CPU/mem
+#            under sustained load.
+#
+# Results are written as JSON to ./results/ for the orchestrator to collect.
+#
+# NOTE: deliberately NOT `set -e`. This is a resilient benchmark harness — a
+# single flaky docker/compose/curl on one variant must not abort the whole run;
+# it should skip to the next variant and still reach the final meta.json sentinel
+# the orchestrator polls for. Failures are visible in each variant's ok/failed.
+set -uo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+cd "$SCRIPT_DIR"
+RESULTS_DIR="$SCRIPT_DIR/results"
+COMPOSE=(docker compose -p bench)
+
+# Load knobs. Defaults target a non-burstable box (c7i.large); see ../run.sh.
+N="${N:-20000}"          # requests per variant (large enough for a stable p99)
+C="${C:-10}"             # reference concurrency for the latency pass
+REPEATS="${REPEATS:-5}"  # independent latency trials (median + spread)
+WARMUP="${WARMUP:-100}"  # global chat warmup after a gateway starts (process/connection init)
+WARMUP_VARIANT="${WARMUP_VARIANT:-30}"  # per-variant warmup (per-dialect lazy-import cold start)
+RESOURCE_SECONDS="${RESOURCE_SECONDS:-15}"  # sustained-load window for CPU/mem sampling
+REST_SECONDS="${REST_SECONDS:-5}"           # settle gap between targets (cooldown)
+# Per-variant wall cap: fast variants hit full N in seconds; this only bites the
+# idle-bound streaming variants (e.g. Bifrost streams over a non-native backend
+# fall back to the 1.5s idle timeout → ~7 req/s, which would take ~50 min for N).
+MAX_VARIANT_SECONDS="${MAX_VARIANT_SECONDS:-60}"
+SWEEP_CONCURRENCY="${SWEEP_CONCURRENCY:-1 2 4 8 16 32 64 128 256}"  # capacity-sweep points
+SWEEP_DURATION="${SWEEP_DURATION:-8}"       # seconds of sustained load per sweep point
+GATEWAYS="${GATEWAYS:-gomodel litellm portkey bifrost}"
+# LiteLLM recommends one worker per CPU core; match the box so it isn't pinned to a
+# single core while the Go gateways use all of them. Exported for docker-compose's
+# ${LITELLM_NUM_WORKERS} substitution. (Per-variant warmup already warms each
+# dialect; with >1 worker the warmup also spreads across workers.)
+export LITELLM_NUM_WORKERS="${LITELLM_NUM_WORKERS:-$(nproc 2>/dev/null || echo 1)}"
+
+AUTH="sk-bench-test-key"
+
+log() { printf '\n\033[1;34m>>> %s\033[0m\n' "$*"; }
+
+rm -rf "$RESULTS_DIR"; mkdir -p "$RESULTS_DIR"
+
+# ── helpers ────────────────────────────────────────────────────────
+# epoch as a float second (python3 is present on AL2023 + macOS; coarse fallback).
+epoch() { python3 -c 'import time;print(time.time())' 2>/dev/null || date +%s; }
+
+# shuffle a space-separated list; seed varies per call so trials differ in order.
+shuffle() {
+  printf '%s\n' $1 | awk -v seed="${2:-$RANDOM}" 'BEGIN{srand(seed)} {print rand()"\t"$0}' \
+    | sort -k1,1n | cut -f2- | tr '\n' ' '
+}
+
+# svc:internal port  +  any extra loadgen headers (Portkey routing).
+gw_port() { case "$1" in gomodel) echo 8080;; litellm) echo 4000;; portkey) echo 8787;; bifrost) echo 8089;; mock) echo 9999;; esac; }
+
+# gw_headers fills the global HDRS array with loadgen -H args for the target.
+HDRS=()
+gw_headers() {
+  HDRS=()
+  case "$1" in
+    portkey)
+      HDRS=(-H 'x-portkey-provider: openai' -H 'x-portkey-custom-host: http://mock:9999/v1')
+      ;;
+  esac
+}
+
+# Model name per gateway. Bifrost routes by an explicit "provider/model" prefix.
+gw_model() { case "$1" in bifrost) echo "openai/gpt-4o-mini";; *) echo "gpt-4o-mini";; esac; }
+
+# Path per (gateway, dialect). Bifrost exposes Anthropic Messages under /anthropic/v1/messages.
+gw_path() {  # target dialect default_path
+  if [[ "$1" == "bifrost" && "$2" == "messages" ]]; then echo "/anthropic/v1/messages"; else echo "$3"; fi
+}
+
+# The six benchmark variants: dialect | mode | path
+VARIANTS=(
+  "chat|nonstream|/v1/chat/completions"
+  "chat|stream|/v1/chat/completions"
+  "responses|nonstream|/v1/responses"
+  "responses|stream|/v1/responses"
+  "messages|nonstream|/v1/messages"
+  "messages|stream|/v1/messages"
+)
+
+BENCH_TOOLS_IMAGE="${BENCH_TOOLS_IMAGE:-bench-tools:local}"
+
+# loadgen runs in a throwaway container on the shared benchnet network so it can
+# reach gateways/mock by service name. JSON summary comes back on stdout.
+run_variant() {
+  local target="$1" svc="$2" spec="$3" outfile="$4"
+  local dialect mode path
+  IFS='|' read -r dialect mode path <<< "$spec"
+  path="$(gw_path "$target" "$dialect" "$path")"
+  local port; port="$(gw_port "$svc")"
+  local url="http://${svc}:${port}${path}"
+
+  local base=(-url "$url" -dialect "$dialect" -model "$(gw_model "$target")" -c "$C" -auth "$AUTH" -json -)
+  [[ "$mode" == "stream" ]] && base+=(-stream)
+  [[ "$MAX_VARIANT_SECONDS" -gt 0 ]] && base+=(-max-wall "${MAX_VARIANT_SECONDS}s")
+  gw_headers "$target"
+  if [[ ${#HDRS[@]} -gt 0 ]]; then base+=("${HDRS[@]}"); fi
+
+  # Per-variant warmup: warm THIS exact dialect+mode before measuring. Python
+  # gateways (LiteLLM) lazily import per-dialect translation modules on first use,
+  # so a chat-only warmup leaves responses/messages cold and inflates their tails.
+  if [[ "$WARMUP_VARIANT" -gt 0 ]]; then
+    docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen \
+      "${base[@]}" -n "$WARMUP_VARIANT" >/dev/null 2>&1 || true
+  fi
+
+  docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen \
+    "${base[@]}" -n "$N" > "$outfile" 2>/dev/null || true
+
+  # `|| true`: a single empty/missing summary must never abort the whole run.
+  local ok fail
+  ok="$(grep -o '"ok": *[0-9]*' "$outfile" 2>/dev/null | head -1 | grep -o '[0-9]*' || true)"
+  fail="$(grep -o '"failed": *[0-9]*' "$outfile" 2>/dev/null | head -1 | grep -o '[0-9]*' || true)"
+  printf '    %-8s %-10s %-9s ok=%-6s failed=%s\n' "$target" "$dialect" "$mode" "${ok:-?}" "${fail:-?}"
+}
+
+# run_sweep drives a throughput-vs-concurrency sweep (chat, non-stream) so we can
+# read each gateway's saturation point — sustained req/s at each concurrency, via
+# loadgen's time-boxed mode (not the latency pass's fixed-N, latency-coupled rps).
+run_sweep() {
+  local label="$1" svc="$2" port; port="$(gw_port "$svc")"
+  local url="http://${svc}:${port}/v1/chat/completions"
+  mkdir -p "$RESULTS_DIR/sweep"
+  gw_headers "$label"; local hdr=(); [[ ${#HDRS[@]} -gt 0 ]] && hdr=("${HDRS[@]}")
+  for cc in $SWEEP_CONCURRENCY; do
+    local args=(-url "$url" -dialect chat -model "$(gw_model "$label")" -c "$cc" -duration "${SWEEP_DURATION}s" -auth "$AUTH" -json -)
+    [[ ${#hdr[@]} -gt 0 ]] && args+=("${hdr[@]}")
+    docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen "${args[@]}" \
+      > "$RESULTS_DIR/sweep/${label}_c${cc}.json" 2>/dev/null || true
+    local rps; rps="$(grep -o '"rps": *[0-9.]*' "$RESULTS_DIR/sweep/${label}_c${cc}.json" 2>/dev/null | head -1 | grep -o '[0-9.]*' || true)"
+    printf '    sweep %-8s c=%-4s rps=%s\n' "$label" "$cc" "${rps:-?}"
+  done
+}
+
+# awk program that normalizes a docker-stats MemUsage field to MiB, then prints
+# "mem_mb,cpu_pct".
+STAT_AWK='
+function tomib(s,  v){ v=s; gsub(/[^0-9.]/,"",v); v=v+0;
+  if (s ~ /GiB|GB/) return v*1024;
+  if (s ~ /MiB|MB/) return v;
+  if (s ~ /KiB|kB/) return v/1024;
+  if (s ~ /[0-9]B/) return v/1048576;
+  return v }
+{ split($0,a,";"); mem=a[1]; sub(/ ?\/.*/,"",mem);
+  cpu=a[2]; gsub(/[^0-9.]/,"",cpu);
+  m=tomib(mem); if (m>0) printf "%.2f,%s\n", m, cpu }'
+
+SAMPLER_PID=""
+start_sampler() {
+  local cname="$1" csv="$2"
+  echo "mem_mb,cpu_pct" > "$csv"
+  (
+    while docker ps --format '{{.Names}}' | grep -q "^${cname}$"; do
+      docker stats --no-stream --format '{{.MemUsage}};{{.CPUPerc}}' "$cname" 2>/dev/null \
+        | awk "$STAT_AWK" >> "$csv" || true
+    done
+  ) &
+  SAMPLER_PID=$!
+}
+
+# Drive sustained chat load at a gateway for ~RESOURCE_SECONDS so the sampler
+# captures the container under genuine pressure. Writes loadgen's summary to
+# $3 so the achieved rps shares the exact window the CPU sample covers (lets
+# summarize.py compute a self-consistent rps-per-CPU% efficiency).
+sustained_load() {
+  local gw="$1" hostport="$2" outfile="$3"
+  local args=(-url "http://${gw}:${hostport}/v1/chat/completions" -dialect chat -model "$(gw_model "$gw")" -duration "${RESOURCE_SECONDS}s" -c "$C" -auth "$AUTH" -json -)
+  gw_headers "$gw"; if [[ ${#HDRS[@]} -gt 0 ]]; then args+=("${HDRS[@]}"); fi
+  docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen "${args[@]}" > "$outfile" 2>/dev/null || true
+}
+
+stop_sampler() {
+  [[ -n "$SAMPLER_PID" ]] && kill "$SAMPLER_PID" 2>/dev/null || true
+  [[ -n "$SAMPLER_PID" ]] && wait "$SAMPLER_PID" 2>/dev/null || true
+  SAMPLER_PID=""
+}
+
+summarize_resources() {  # csv -> json {peak_mem_mb, avg_mem_mb, avg_cpu_pct, samples}
+  [[ -f "$1" ]] || { printf '{"peak_mem_mb":0,"avg_mem_mb":0,"avg_cpu_pct":0,"samples":0}'; return 0; }
+  awk -F, 'NR>1 && $1>0 { n++; s_mem+=$1; s_cpu+=$2; if($1>peak)peak=$1 }
+    END {
+      if(n>0) printf "{\"peak_mem_mb\":%.1f,\"avg_mem_mb\":%.1f,\"avg_cpu_pct\":%.1f,\"samples\":%d}", peak, s_mem/n, s_cpu/n, n;
+      else printf "{\"peak_mem_mb\":0,\"avg_mem_mb\":0,\"avg_cpu_pct\":0,\"samples\":0}"
+    }' "$1"
+}
+
+record_image() {  # gateway image_ref -> results/<gw>_image.json
+  local gw="$1" ref="$2"
+  local size digest compressed
+  size="$(docker image inspect "$ref" --format '{{.Size}}' 2>/dev/null || echo 0)"
+  digest="$(docker image inspect "$ref" --format '{{if .RepoDigests}}{{index .RepoDigests 0}}{{else}}{{.Id}}{{end}}' 2>/dev/null || echo unknown)"
+  # Compressed size = what you actually pull/store: gzip the saved image (uniform
+  # across the locally-built gomodel image and the pulled competitor images).
+  compressed="$(docker save "$ref" 2>/dev/null | gzip -c | wc -c | tr -d ' ' || echo 0)"
+  printf '{"gateway":"%s","image":"%s","size_bytes":%s,"size_mb":%.1f,"compressed_bytes":%s,"compressed_mb":%.1f,"digest":"%s"}\n' \
+    "$gw" "$ref" "${size:-0}" "$(awk "BEGIN{print ${size:-0}/1048576}")" \
+    "${compressed:-0}" "$(awk "BEGIN{print ${compressed:-0}/1048576}")" "$digest" \
+    > "$RESULTS_DIR/${gw}_image.json"
+}
+
+wait_ready() {  # gateway host_port -> poll a real chat request until HTTP 200
+  local target="$1" hostport="$2" tries="${3:-60}"
+  gw_headers "$target"
+  local hdr=(); if [[ ${#HDRS[@]} -gt 0 ]]; then hdr=("${HDRS[@]}"); fi
+  local code
+  for ((i=0;i<tries;i++)); do
+    code="$(curl -s -o /dev/null -w '%{http_code}' -m 5 -X POST \
+      "http://localhost:${hostport}/v1/chat/completions" \
+      -H 'Content-Type: application/json' -H "Authorization: Bearer $AUTH" ${hdr[@]+"${hdr[@]}"} \
+      -d "{\"model\":\"$(gw_model "$target")\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}]}" 2>/dev/null || echo 000)"
+    [[ "$code" == "200" ]] && return 0
+    sleep 2
+  done
+  echo "  WARN: $target did not return 200 within $((tries*2))s (last code: ${code:-?})" >&2
+  return 1
+}
+
+# bring a gateway up and time cold-start latency (compose up -> first HTTP 200).
+# Leaves the gateway running. Writes results/<gw>_startup.json.
+measure_startup() {
+  local gw="$1" hostport; hostport="$(gw_port "$gw")"
+  local t0 t1 code ready=0
+  gw_headers "$gw"; local hdr=(); [[ ${#HDRS[@]} -gt 0 ]] && hdr=("${HDRS[@]}")
+  t0="$(epoch)"
+  GOMODEL_IMAGE="${GOMODEL_IMAGE:-gomodel-bench:local}" \
+    "${COMPOSE[@]}" --profile "$gw" up -d "$gw" >/dev/null 2>&1 || true
+  for ((i=0;i<600;i++)); do  # up to ~120s, 0.2s resolution
+    code="$(curl -s -o /dev/null -w '%{http_code}' -m 5 -X POST \
+      "http://localhost:${hostport}/v1/chat/completions" \
+      -H 'Content-Type: application/json' -H "Authorization: Bearer $AUTH" ${hdr[@]+"${hdr[@]}"} \
+      -d "{\"model\":\"$(gw_model "$gw")\",\"messages\":[{\"role\":\"user\",\"content\":\"ping\"}]}" 2>/dev/null || echo 000)"
+    [[ "$code" == "200" ]] && { ready=1; break; }
+    sleep 0.2
+  done
+  t1="$(epoch)"
+  local elapsed; elapsed="$(awk -v a="$t0" -v b="$t1" 'BEGIN{printf "%.3f", b-a}')"
+  printf '{"gateway":"%s","startup_s":%s,"ready":%s}\n' "$gw" "$elapsed" "$ready" \
+    > "$RESULTS_DIR/${gw}_startup.json"
+  echo "    startup: ${gw} ${elapsed}s (ready=$ready)"
+}
+
+warmup_gateway() {
+  local gw="$1" hostport; hostport="$(gw_port "$gw")"
+  local warm_args=(-url "http://${gw}:${hostport}/v1/chat/completions" -dialect chat -model "$(gw_model "$gw")" -n "$WARMUP" -c "$C" -auth "$AUTH" -json -)
+  gw_headers "$gw"; if [[ ${#HDRS[@]} -gt 0 ]]; then warm_args+=("${HDRS[@]}"); fi
+  docker run --rm --network benchnet "$BENCH_TOOLS_IMAGE" /loadgen "${warm_args[@]}" >/dev/null 2>&1 || true
+}
+
+image_ref() { case "$1" in
+  gomodel) echo "${GOMODEL_IMAGE:-gomodel-bench:local}";;
+  litellm) echo "${LITELLM_IMAGE:-ghcr.io/berriai/litellm:main-stable}";;
+  portkey) echo "${PORTKEY_IMAGE:-portkeyai/gateway:latest}";;
+  bifrost) echo "${BIFROST_IMAGE:-maximhq/bifrost:latest}";;
+esac; }
+
+# ── Build the bench-tools image ────────────────────────────────────
+log "Building bench-tools image"
+docker build -q -t "$BENCH_TOOLS_IMAGE" ./bench-tools >/dev/null
+
+# ── Pull latest competitor images up front (digests recorded per gateway) ──
+for gw in $GATEWAYS; do
+  [[ "$gw" == "gomodel" ]] && continue
+  docker pull -q "$(image_ref "$gw")" 2>/dev/null || true
+done
+
+# ── Clean any leftover state, then bring up the shared mock ────────
+"${COMPOSE[@]}" --profile gomodel --profile litellm --profile portkey --profile bifrost down -v >/dev/null 2>&1 || true
+log "Starting mock backend"
+"${COMPOSE[@]}" up -d mock
+sleep 2
+
+# ── PASS A: latency, REPEATS trials, randomized target order ───────
+for r in $(seq 1 "$REPEATS"); do
+  RUN_DIR="$RESULTS_DIR/run${r}"; mkdir -p "$RUN_DIR"
+  "${COMPOSE[@]}" up -d mock >/dev/null 2>&1 || true  # ensure the shared mock is up
+  ORDER="$(shuffle "baseline $GATEWAYS" "$((r * 7919 + RANDOM))")"
+  log "Latency trial ${r}/${REPEATS}  (order: ${ORDER})"
+  for t in $ORDER; do
+    if [[ "$t" == "baseline" ]]; then
+      for spec in "${VARIANTS[@]}"; do
+        IFS='|' read -r dialect mode _ <<< "$spec"
+        run_variant "baseline" "mock" "$spec" "$RUN_DIR/baseline_${dialect}_${mode}.json"
+      done
+    else
+      GOMODEL_IMAGE="${GOMODEL_IMAGE:-gomodel-bench:local}" \
+        "${COMPOSE[@]}" --profile "$t" up -d "$t" >/dev/null 2>&1 || true
+      wait_ready "$t" "$(gw_port "$t")" || true
+      warmup_gateway "$t"
+      for spec in "${VARIANTS[@]}"; do
+        IFS='|' read -r dialect mode _ <<< "$spec"
+        run_variant "$t" "$t" "$spec" "$RUN_DIR/${t}_${dialect}_${mode}.json"
+      done
+      # Remove only this gateway's container — NOT `compose down`, which would
+      # also tear down the profile-less mock and break the next baseline.
+      "${COMPOSE[@]}" --profile "$t" rm -sf "$t" >/dev/null 2>&1 || true
+    fi
+    sleep "$REST_SECONDS"
+  done
+done
+
+# ── PASS B: capacity sweep + startup + footprint, once, randomized ─
+log "Capacity + footprint pass"
+"${COMPOSE[@]}" up -d mock >/dev/null 2>&1 || true  # ensure the shared mock is up
+# Baseline capacity ceiling first (mock is already up, no gateway lifecycle).
+run_sweep "baseline" "mock"
+
+for gw in $(shuffle "$GATEWAYS"); do
+  ref="$(image_ref "$gw")"
+  log "Capacity: $gw  (image: $ref)"
+  measure_startup "$gw"          # brings the gateway up + times cold start
+  record_image "$gw" "$ref"
+  warmup_gateway "$gw"
+  run_sweep "$gw" "$gw"
+
+  cname="bench-${gw}-1"
+  csv="$RESULTS_DIR/${gw}_resources.csv"
+  load_json="$RESULTS_DIR/${gw}_sustained.json"
+  idle_mem="$(docker stats --no-stream --format '{{.MemUsage}};0' "$cname" 2>/dev/null | awk "$STAT_AWK" | cut -d, -f1 || true)"
+  start_sampler "$cname" "$csv"
+  sustained_load "$gw" "$(gw_port "$gw")" "$load_json"
+  stop_sampler
+
+  res="$(summarize_resources "$csv")"
+  load_rps="$(grep -o '"rps": *[0-9.]*' "$load_json" 2>/dev/null | head -1 | grep -o '[0-9.]*' || true)"
+  printf '{"gateway":"%s","idle_mem_mb":%s,"load_rps":%s,"under_load":%s}\n' \
+    "$gw" "${idle_mem:-0}" "${load_rps:-0}" "$res" > "$RESULTS_DIR/${gw}_resources.json"
+  echo "    resources: idle=${idle_mem:-0}MiB load_rps=${load_rps:-0} $res"
+
+  "${COMPOSE[@]}" --profile "$gw" rm -sf "$gw" >/dev/null 2>&1 || true
+  sleep "$REST_SECONDS"
+done
+
+"${COMPOSE[@]}" down -v >/dev/null 2>&1 || true
+
+# ── Run metadata ───────────────────────────────────────────────────
+IMDS_TOKEN="$(curl -s -m 2 -X PUT 'http://169.254.169.254/latest/api/token' -H 'X-aws-ec2-metadata-token-ttl-seconds: 60' 2>/dev/null || true)"
+INSTANCE_TYPE_META="$(curl -s -m 2 -H "X-aws-ec2-metadata-token: $IMDS_TOKEN" http://169.254.169.254/latest/meta-data/instance-type 2>/dev/null || true)"
+[[ "$INSTANCE_TYPE_META" == *"<"* || -z "$INSTANCE_TYPE_META" ]] && INSTANCE_TYPE_META="unknown"
+cat > "$RESULTS_DIR/meta.json" <<JSON
+{
+  "n_requests": $N,
+  "max_variant_seconds": $MAX_VARIANT_SECONDS,
+  "concurrency": $C,
+  "repeats": $REPEATS,
+  "litellm_num_workers": $LITELLM_NUM_WORKERS,
+  "warmup": $WARMUP,
+  "resource_seconds": $RESOURCE_SECONDS,
+  "rest_seconds": $REST_SECONDS,
+  "sweep_concurrency": "$(echo "$SWEEP_CONCURRENCY" | tr ' ' ',')",
+  "sweep_duration_s": $SWEEP_DURATION,
+  "gateways": "$(echo "$GATEWAYS" | tr ' ' ',')",
+  "instance_type": "$INSTANCE_TYPE_META",
+  "cpus": $(nproc 2>/dev/null || echo 1),
+  "kernel": "$(uname -r)"
+}
+JSON
+
+log "Done. Results in $RESULTS_DIR"
+ls -1 "$RESULTS_DIR"
diff --git a/docs/2026-06-25_aws_gateway_benchmark/run.sh b/docs/2026-06-25_aws_gateway_benchmark/run.sh
new file mode 100755
index 00000000..42b24ad1
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/run.sh
@@ -0,0 +1,148 @@
+#!/usr/bin/env bash
+# End-to-end AWS gateway benchmark orchestrator.
+#
+#   1. build the GoModel image (linux/amd64) and save it for transfer
+#   2. terraform apply  -> EC2 instance (default c7i.large; NOT free tier)
+#   3. wait for SSH + docker, ship the harness, load the GoModel image
+#   4. run the containerized benchmark (6 variants x 4 gateways + baseline,
+#      REPEATS latency trials + a capacity sweep)
+#   5. pull results back and summarize
+#   6. terraform destroy  -> guaranteed teardown (runs even on failure)
+#
+# Teardown is wired to an EXIT trap so the instance is always destroyed. Set
+# KEEP=1 to leave it running for debugging.
+#
+# Usage:  ./run.sh                       # full run, then destroy
+#         N=20000 C=10 REPEATS=5 ./run.sh
+#         INSTANCE_TYPE=t2.micro ./run.sh   # cheaper/burstable (free tier)
+#         KEEP=1 ./run.sh                # don't destroy at the end
+set -euo pipefail
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+TF_DIR="$SCRIPT_DIR/terraform"
+REMOTE_DIR="$SCRIPT_DIR/remote"
+PROJECT_ROOT="$(cd "$SCRIPT_DIR/../.." && pwd)"
+
+TF="${TERRAFORM:-terraform}"
+REGION="${REGION:-us-east-1}"
+INSTANCE_TYPE="${INSTANCE_TYPE:-c7i.large}"  # 2 vCPU, non-burstable (stable tail); NOT free tier
+N="${N:-20000}"
+C="${C:-10}"
+REPEATS="${REPEATS:-5}"
+GATEWAYS="${GATEWAYS:-gomodel litellm portkey bifrost}"
+GOMODEL_IMAGE_TAG="gomodel-bench:local"
+IMAGE_TAR="/tmp/gomodel-bench-amd64.tar.gz"
+
+STAMP="$(date -u +%Y%m%d-%H%M%S)"
+OUT_DIR="$SCRIPT_DIR/output/$STAMP"
+
+log()  { printf '\n\033[1;34m>>> %s\033[0m\n' "$*"; }
+err()  { printf '\033[1;31m!!! %s\033[0m\n' "$*" >&2; }
+
+destroy() {
+  if [[ "${KEEP:-0}" == "1" ]]; then
+    err "KEEP=1 set — leaving instance up. Destroy later with: (cd $TF_DIR && $TF destroy -auto-approve)"
+    return
+  fi
+  log "Destroying AWS resources (terraform destroy)"
+  (cd "$TF_DIR" && $TF destroy -auto-approve -var "region=$REGION" >/dev/null 2>&1) \
+    && echo "  teardown complete" || err "TEARDOWN FAILED — check: (cd $TF_DIR && $TF destroy)"
+}
+trap destroy EXIT
+
+command -v "$TF" >/dev/null || { err "terraform not found (set TERRAFORM=/path/to/terraform)"; exit 1; }
+command -v docker >/dev/null || { err "docker required to build the GoModel image"; exit 1; }
+
+# ── 1. Build + save GoModel image for amd64 ────────────────────────
+log "Building GoModel image (linux/amd64)"
+docker buildx build --platform linux/amd64 -t "$GOMODEL_IMAGE_TAG" --load "$PROJECT_ROOT"
+log "Saving image -> $IMAGE_TAR"
+docker save "$GOMODEL_IMAGE_TAG" | gzip > "$IMAGE_TAR"
+
+# ── 2. Provision ───────────────────────────────────────────────────
+MY_IP="$(curl -s https://checkip.amazonaws.com | tr -d '[:space:]')"
+log "Provisioning $INSTANCE_TYPE in $REGION (SSH locked to ${MY_IP}/32)"
+(cd "$TF_DIR" && $TF init -input=false >/dev/null && \
+  $TF apply -auto-approve -input=false \
+    -var "region=$REGION" -var "instance_type=$INSTANCE_TYPE" \
+    -var "ssh_ingress_cidr=${MY_IP}/32")
+
+IP="$(cd "$TF_DIR" && $TF output -raw public_ip)"
+KEY="$(cd "$TF_DIR" && $TF output -raw ssh_private_key_path)"
+USER="$(cd "$TF_DIR" && $TF output -raw ssh_user)"
+SSH_OPTS=(-o StrictHostKeyChecking=no -o UserKnownHostsFile=/dev/null -o ConnectTimeout=10 -i "$KEY")
+echo "  instance: $USER@$IP  (key: $KEY)"
+[[ -f "$KEY" ]] || { err "private key not found at $KEY"; exit 1; }
+
+# ── 3. Wait for SSH + bootstrap ────────────────────────────────────
+log "Waiting for SSH"
+for i in $(seq 1 60); do
+  ssh "${SSH_OPTS[@]}" "$USER@$IP" true 2>/dev/null && break
+  if [[ $i == 60 ]]; then
+    err "SSH never came up — last attempt error:"
+    ssh "${SSH_OPTS[@]}" "$USER@$IP" true 2>&1 | sed 's/^/    /' || true
+    exit 1
+  fi
+  sleep 5
+done
+log "Waiting for docker bootstrap (user-data)"
+for i in $(seq 1 60); do
+  ssh "${SSH_OPTS[@]}" "$USER@$IP" 'test -f ~/.bootstrap-done && docker info >/dev/null 2>&1' && break
+  sleep 5
+  [[ $i == 60 ]] && { err "docker bootstrap never finished"; exit 1; }
+done
+echo "  ready"
+
+# ── 4. Ship harness + image, run ───────────────────────────────────
+log "Shipping harness to instance"
+ssh "${SSH_OPTS[@]}" "$USER@$IP" 'rm -rf ~/bench && mkdir -p ~/bench'
+rsync -az -e "ssh ${SSH_OPTS[*]}" --exclude results "$REMOTE_DIR/" "$USER@$IP:~/bench/"
+scp "${SSH_OPTS[@]}" "$IMAGE_TAR" "$USER@$IP:~/gomodel-bench-amd64.tar.gz"
+
+log "Loading GoModel image on instance"
+ssh "${SSH_OPTS[@]}" "$USER@$IP" 'gunzip -c ~/gomodel-bench-amd64.tar.gz | docker load'
+
+# Forward all benchmark knobs to the instance (only the ones that are set).
+REMOTE_ENV="N=$N C=$C REPEATS=$REPEATS GATEWAYS='$GATEWAYS' GOMODEL_IMAGE=$GOMODEL_IMAGE_TAG"
+for v in MAX_VARIANT_SECONDS SWEEP_CONCURRENCY SWEEP_DURATION RESOURCE_SECONDS REST_SECONDS WARMUP WARMUP_VARIANT; do
+  if [[ -n "${!v:-}" ]]; then REMOTE_ENV="$REMOTE_ENV $v='${!v}'"; fi
+done
+
+# Launch DETACHED with setsid so the benchmark survives any SSH drop or hang —
+# the controlling session no longer owns the process. We then poll for the
+# terminal sentinel (results/meta.json, written only at the very end). This is
+# the fix for the earlier run dying with the SSH session still half-open.
+log "Launching benchmark detached (N=$N C=$C REPEATS=$REPEATS gateways: $GATEWAYS)"
+ssh "${SSH_OPTS[@]}" "$USER@$IP" \
+  "cd ~/bench && chmod +x run-on-instance.sh && rm -f run.log && \
+   setsid env $REMOTE_ENV bash run-on-instance.sh > run.log 2>&1 < /dev/null & echo launched"
+
+log "Waiting for benchmark (polling every 15s; survives SSH drops)"
+POLL_MAX="${POLL_MAX:-160}"   # 160 * 15s = 40 min ceiling
+done_ok=0
+for ((i=0; i<POLL_MAX; i++)); do
+  sleep 15
+  if ssh "${SSH_OPTS[@]}" "$USER@$IP" 'test -f ~/bench/results/meta.json' 2>/dev/null; then
+    done_ok=1; echo "  benchmark complete (meta.json present)"; break
+  fi
+  # After warmup, a missing run-on-instance process + no meta = it died; collect partial.
+  if (( i > 3 )) && ! ssh "${SSH_OPTS[@]}" "$USER@$IP" 'pgrep -f "[r]un-on-instance.sh" >/dev/null' 2>/dev/null; then
+    err "remote benchmark ended without meta.json — collecting partial results"; break
+  fi
+  if (( i % 4 == 0 )); then
+    ssh "${SSH_OPTS[@]}" "$USER@$IP" 'sed "s/\x1b\[[0-9;]*m//g" ~/bench/run.log 2>/dev/null | grep -E ">>>|trial [0-9]/" | tail -2' 2>/dev/null || true
+  fi
+done
+(( done_ok == 1 )) || err "polling ended (timeout or early exit) — proceeding to collect whatever exists"
+ssh "${SSH_OPTS[@]}" "$USER@$IP" 'echo "--- tail of remote run.log ---"; tail -25 ~/bench/run.log' 2>/dev/null || true
+
+# ── 5. Collect + summarize ─────────────────────────────────────────
+log "Collecting results -> $OUT_DIR"
+mkdir -p "$OUT_DIR"
+rsync -az -e "ssh ${SSH_OPTS[*]}" "$USER@$IP:~/bench/results/" "$OUT_DIR/"
+
+if command -v python3 >/dev/null; then
+  python3 "$SCRIPT_DIR/scripts/summarize.py" --results-dir "$OUT_DIR" | tee "$OUT_DIR/summary.txt"
+fi
+log "Raw + summarized results in: $OUT_DIR"
+# destroy() runs on EXIT
diff --git a/docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py b/docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py
new file mode 100644
index 00000000..8097ac0c
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py
@@ -0,0 +1,296 @@
+#!/usr/bin/env python3
+"""Normalize the raw benchmark JSON into human tables + one summary.json.
+
+Reads a results directory produced by run-on-instance.sh:
+
+  results/
+    run1/ … runN/   latency trials (per target+variant JSON)   <- aggregated here
+    sweep/          throughput-vs-concurrency capacity points
+    <gw>_image.json <gw>_startup.json <gw>_resources.json
+    meta.json
+
+Latency is reported as the MEDIAN across trials with a min–max spread on the
+noisy tail (p99) and on rps, so single-window jitter no longer drives the story.
+Also emits overhead-vs-baseline, a capacity-sweep table (sustained req/s and the
+saturation knee), startup latency, and rps-per-CPU% efficiency.
+
+Back-compat: a flat results dir (no run* subdirs) is treated as a single trial.
+Stdlib only.
+"""
+import argparse
+import glob
+import json
+import os
+import re
+import statistics
+
+TARGETS = ["baseline", "gomodel", "litellm", "portkey", "bifrost"]
+VARIANTS = [
+    ("chat", "nonstream"), ("chat", "stream"),
+    ("responses", "nonstream"), ("responses", "stream"),
+    ("messages", "nonstream"), ("messages", "stream"),
+]
+
+
+def load(path):
+    try:
+        with open(path) as f:
+            return json.load(f)
+    except (OSError, ValueError):
+        return None
+
+
+def run_dirs(rd):
+    """Trial dirs: run* subdirs if present, else the flat dir (single trial)."""
+    runs = sorted(glob.glob(os.path.join(rd, "run*")),
+                  key=lambda p: int(re.sub(r"\D", "", os.path.basename(p)) or 0))
+    return runs or [rd]
+
+
+def med(xs):
+    xs = [x for x in xs if isinstance(x, (int, float))]
+    return statistics.median(xs) if xs else None
+
+
+def spread(xs):
+    xs = [x for x in xs if isinstance(x, (int, float))]
+    return (min(xs), max(xs)) if xs else (None, None)
+
+
+def fnum(v, dp=2):
+    try:
+        return f"{float(v):.{dp}f}"
+    except (TypeError, ValueError):
+        return "—"
+
+
+# ── latency aggregation ───────────────────────────────────────────────────────
+def collect(runs, target, dialect, mode):
+    """All trial summaries for one (target, variant)."""
+    out = []
+    for r in runs:
+        d = load(os.path.join(r, f"{target}_{dialect}_{mode}.json"))
+        if d:
+            out.append(d)
+    return out
+
+
+def agg_variant(trials):
+    """Median (+ spread) of the metrics we care about across trials."""
+    def field(path):
+        vals = []
+        for t in trials:
+            cur = t
+            for k in path:
+                cur = (cur or {}).get(k) if isinstance(cur, dict) else None
+            vals.append(cur)
+        return vals
+
+    p99s = field(["total_latency", "p99_ms"])
+    rpss = field(["rps"])
+    return {
+        "trials": len(trials),
+        "ok": sum(t.get("ok", 0) for t in trials),
+        "failed": sum(t.get("failed", 0) for t in trials),
+        "rps": med(rpss), "rps_spread": spread(rpss),
+        "p50": med(field(["total_latency", "p50_ms"])),
+        "p90": med(field(["total_latency", "p90_ms"])),
+        "p99": med(p99s), "p99_spread": spread(p99s),
+        "ttft_p50": med(field(["ttft", "p50_ms"])),
+        "gap_p50": med(field(["inter_chunk", "p50_ms"])),
+        "gap_p99": med(field(["inter_chunk", "p99_ms"])),
+    }
+
+
+# ── capacity sweep ────────────────────────────────────────────────────────────
+def sweep_curve(rd, target):
+    """{concurrency: rps} for a target, read from results/sweep/<t>_c<cc>.json."""
+    curve = {}
+    for p in glob.glob(os.path.join(rd, "sweep", f"{target}_c*.json")):
+        m = re.search(r"_c(\d+)\.json$", p)
+        d = load(p)
+        if m and d and isinstance(d.get("rps"), (int, float)):
+            curve[int(m.group(1))] = d["rps"]
+    return dict(sorted(curve.items()))
+
+
+def sweep_stats(curve):
+    if not curve:
+        return {}
+    peak_c = max(curve, key=curve.get)
+    peak = curve[peak_c]
+    # saturation knee: lowest concurrency reaching >=95% of peak rps.
+    knee = next((c for c in sorted(curve) if curve[c] >= 0.95 * peak), peak_c)
+    return {"peak_rps": peak, "peak_c": peak_c, "knee_c": knee, "curve": curve}
+
+
+def main():
+    ap = argparse.ArgumentParser()
+    ap.add_argument("--results-dir", required=True)
+    args = ap.parse_args()
+    rd = args.results_dir
+
+    meta = load(os.path.join(rd, "meta.json")) or {}
+    runs = run_dirs(rd)
+    present = sorted({os.path.basename(p).split("_")[0]
+                      for p in glob.glob(os.path.join(runs[0], "*_*_*.json"))})
+    targets = [t for t in TARGETS if t in present]
+
+    summary = {"meta": meta, "trials": len(runs), "latency": {}, "capacity": {}, "resources": {}}
+
+    print("\n" + "=" * 86)
+    print("GATEWAY BENCHMARK SUMMARY")
+    print("=" * 86)
+    if meta:
+        print(f"instance={meta.get('instance_type')} cpus={meta.get('cpus')} "
+              f"N={meta.get('n_requests')} c={meta.get('concurrency')} "
+              f"trials={meta.get('repeats', len(runs))}")
+    print(f"(latency = median across {len(runs)} trial(s); p99/rps show [min–max])")
+
+    # ── Latency (median across trials) ─────────────────────────────────────────
+    base_p50 = {}
+    for dialect, mode in VARIANTS:
+        b = agg_variant(collect(runs, "baseline", dialect, mode))
+        base_p50[(dialect, mode)] = b.get("p50")
+
+    print("\nLATENCY  (ms; rps = completed req/s @ c={})".format(meta.get("concurrency", "?")))
+    hdr = (f"{'target':9} {'variant':18} {'ok/fail':>11} {'rps':>7} {'p50':>7} "
+           f"{'p99':>7} {'p99 range':>15} {'ttft':>7} {'ovhd':>7}")
+    print(hdr); print("-" * len(hdr))
+    for t in targets:
+        summary["latency"][t] = {}
+        for dialect, mode in VARIANTS:
+            a = agg_variant(collect(runs, t, dialect, mode))
+            if not a["trials"]:
+                continue
+            key = f"{dialect}/{mode}"
+            ovhd = (a["p50"] - base_p50[(dialect, mode)]
+                    if a["p50"] is not None and base_p50.get((dialect, mode)) is not None else None)
+            lo, hi = a["p99_spread"]
+            rng = f"{fnum(lo)}–{fnum(hi)}" if lo is not None else "—"
+            print(f"{t:9} {key:18} {str(a['ok'])+'/'+str(a['failed']):>11} "
+                  f"{fnum(a['rps'],0):>7} {fnum(a['p50']):>7} {fnum(a['p99']):>7} "
+                  f"{rng:>15} {fnum(a['ttft_p50']):>7} {fnum(ovhd):>7}")
+            a["overhead_p50"] = ovhd
+            summary["latency"][t][key] = a
+        print()
+
+    # ── Capacity sweep ─────────────────────────────────────────────────────────
+    print("CAPACITY  (chat non-stream; sustained req/s by concurrency)")
+    sweep_targets = [t for t in TARGETS if sweep_curve(rd, t)]
+    if sweep_targets:
+        concs = sorted({c for t in sweep_targets for c in sweep_curve(rd, t)})
+        hdrc = f"{'target':9} " + " ".join(f"c{c:>6}" for c in concs) + f" {'peak':>8} {'@c':>4} {'knee':>5}"
+        print(hdrc); print("-" * len(hdrc))
+        for t in sweep_targets:
+            curve = sweep_curve(rd, t)
+            s = sweep_stats(curve)
+            row = f"{t:9} " + " ".join(f"{fnum(curve.get(c), 0):>7}" for c in concs)
+            row += f" {fnum(s['peak_rps'],0):>8} {s['peak_c']:>4} {s['knee_c']:>5}"
+            print(row)
+            summary["capacity"][t] = s
+    else:
+        print("  (no sweep data)")
+    print()
+
+    # ── Resources / footprint ──────────────────────────────────────────────────
+    print("RESOURCES  (per gateway; img_zip = compressed pull size)")
+    hdr2 = (f"{'gateway':9} {'img_zip':>8} {'img_disk':>9} {'startup_s':>10} {'idle_mb':>9} "
+            f"{'peak_mb':>9} {'avg_cpu%':>9} {'load_rps':>9} {'rps/cpu%':>9}")
+    print(hdr2); print("-" * len(hdr2))
+    for t in [x for x in targets if x != "baseline"]:
+        img = load(os.path.join(rd, f"{t}_image.json")) or {}
+        res = load(os.path.join(rd, f"{t}_resources.json")) or {}
+        startup = load(os.path.join(rd, f"{t}_startup.json")) or {}
+        ul = res.get("under_load", {})
+        load_rps = res.get("load_rps") or 0
+        cpu = ul.get("avg_cpu_pct") or 0
+        eff = (load_rps / cpu) if cpu else None
+        print(f"{t:9} {fnum(img.get('compressed_mb'),1):>8} {fnum(img.get('size_mb'),1):>9} "
+              f"{fnum(startup.get('startup_s'),2):>10} "
+              f"{fnum(res.get('idle_mem_mb'),1):>9} {fnum(ul.get('peak_mem_mb'),1):>9} "
+              f"{fnum(cpu,1):>9} {fnum(load_rps,0):>9} {fnum(eff,1):>9}")
+        summary["resources"][t] = {"image": img, "resources": res, "startup": startup,
+                                   "rps_per_cpu_pct": eff}
+    print()
+
+    out = os.path.join(rd, "summary.json")
+    with open(out, "w") as f:
+        json.dump(summary, f, indent=2)
+    md = write_markdown(rd, meta, runs, targets)
+    print(f"wrote {out}\nwrote {md}")
+
+
+def write_markdown(rd, meta, runs, targets):
+    """Emit clean GitHub-flavored Markdown tables."""
+    L = ["# Gateway Benchmark Summary\n"]
+    if meta:
+        L.append(f"`instance={meta.get('instance_type')} cpus={meta.get('cpus')} "
+                 f"N={meta.get('n_requests')} c={meta.get('concurrency')} "
+                 f"trials={meta.get('repeats', len(runs))}`\n")
+    L.append(f"_Latency = median across {len(runs)} trial(s); p99 shows the min–max "
+             "across trials. rps in the latency table is completed req/s at the "
+             "fixed concurrency (latency-coupled); see the capacity table for "
+             "sustained throughput._\n")
+
+    base_p50 = {(d, m): agg_variant(collect(runs, "baseline", d, m)).get("p50")
+                for d, m in VARIANTS}
+
+    L.append("## Latency (ms, median of trials)\n")
+    L.append("| target | variant | ok/fail | rps | p50 | p90 | p99 | p99 min–max | ttft p50 | gap p50 | overhead p50 |")
+    L.append("|---|---|--:|--:|--:|--:|--:|--:|--:|--:|--:|")
+    for t in targets:
+        for dialect, mode in VARIANTS:
+            a = agg_variant(collect(runs, t, dialect, mode))
+            if not a["trials"]:
+                continue
+            ovhd = (a["p50"] - base_p50[(dialect, mode)]
+                    if a["p50"] is not None and base_p50.get((dialect, mode)) is not None else None)
+            lo, hi = a["p99_spread"]
+            rng = f"{fnum(lo)}–{fnum(hi)}" if lo is not None else "—"
+            gap = fnum(a["gap_p50"]) if mode == "stream" else ""
+            ttft = fnum(a["ttft_p50"]) if mode == "stream" else ""
+            L.append(f"| {t} | {dialect}/{mode} | {a['ok']}/{a['failed']} | {fnum(a['rps'],0)} | "
+                     f"{fnum(a['p50'])} | {fnum(a['p90'])} | {fnum(a['p99'])} | {rng} | "
+                     f"{ttft} | {gap} | {fnum(ovhd)} |")
+    L.append("")
+
+    # capacity
+    sweep_targets = [t for t in TARGETS if sweep_curve(rd, t)]
+    if sweep_targets:
+        concs = sorted({c for t in sweep_targets for c in sweep_curve(rd, t)})
+        L.append("## Capacity (chat non-stream, sustained req/s by concurrency)\n")
+        L.append("| target | " + " | ".join(f"c={c}" for c in concs) + " | peak rps | @c | knee c |")
+        L.append("|---|" + "--:|" * (len(concs) + 3))
+        for t in sweep_targets:
+            curve = sweep_curve(rd, t)
+            s = sweep_stats(curve)
+            cells = " | ".join(fnum(curve.get(c), 0) for c in concs)
+            L.append(f"| {t} | {cells} | {fnum(s['peak_rps'],0)} | {s['peak_c']} | {s['knee_c']} |")
+        L.append("")
+
+    L.append("## Resources\n")
+    L.append("| gateway | image MB (compressed) | image MB (on-disk) | startup s | idle MB | peak MB | avg CPU % | load rps | rps/CPU% |")
+    L.append("|---|--:|--:|--:|--:|--:|--:|--:|--:|")
+    for t in [x for x in targets if x != "baseline"]:
+        img = load(os.path.join(rd, f"{t}_image.json")) or {}
+        res = load(os.path.join(rd, f"{t}_resources.json")) or {}
+        startup = load(os.path.join(rd, f"{t}_startup.json")) or {}
+        ul = res.get("under_load", {})
+        load_rps = res.get("load_rps") or 0
+        cpu = ul.get("avg_cpu_pct") or 0
+        eff = (load_rps / cpu) if cpu else None
+        L.append(f"| {t} | {fnum(img.get('compressed_mb'),1)} | {fnum(img.get('size_mb'),1)} | "
+                 f"{fnum(startup.get('startup_s'),2)} | "
+                 f"{fnum(res.get('idle_mem_mb'),1)} | {fnum(ul.get('peak_mem_mb'),1)} | "
+                 f"{fnum(cpu,1)} | {fnum(load_rps,0)} | {fnum(eff,1)} |")
+    L.append("")
+
+    path = os.path.join(rd, "summary.md")
+    with open(path, "w") as f:
+        f.write("\n".join(L))
+    return path
+
+
+if __name__ == "__main__":
+    main()
diff --git a/docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf b/docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf
new file mode 100644
index 00000000..cb1e5ae7
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf
@@ -0,0 +1,148 @@
+terraform {
+  required_version = ">= 1.6"
+  required_providers {
+    aws = {
+      source  = "hashicorp/aws"
+      version = "~> 5.60"
+    }
+    tls = {
+      source  = "hashicorp/tls"
+      version = "~> 4.0"
+    }
+    local = {
+      source  = "hashicorp/local"
+      version = "~> 2.5"
+    }
+  }
+}
+
+provider "aws" {
+  region = var.region
+}
+
+# ── AMI: latest Amazon Linux 2023 x86_64 (override via var.ami_id) ──
+data "aws_ssm_parameter" "al2023" {
+  name = "/aws/service/ami-amazon-linux-latest/al2023-ami-kernel-default-x86_64"
+}
+
+locals {
+  ami_id = var.ami_id != "" ? var.ami_id : data.aws_ssm_parameter.al2023.value
+  # credit_specification is only valid for burstable T-family instances; on a
+  # fixed-performance type (c7i.large default) it must be omitted entirely.
+  is_burstable = can(regex("^t[0-9]", var.instance_type))
+}
+
+# ── Default VPC / subnet (free-tier friendly, no NAT) ──────────────
+data "aws_vpc" "default" {
+  default = true
+}
+
+data "aws_subnets" "default" {
+  filter {
+    name   = "vpc-id"
+    values = [data.aws_vpc.default.id]
+  }
+}
+
+# ── SSH keypair generated locally, written to disk for the runner ──
+resource "tls_private_key" "bench" {
+  algorithm = "ED25519"
+}
+
+resource "local_sensitive_file" "private_key" {
+  content         = tls_private_key.bench.private_key_openssh
+  filename        = "${path.module}/bench_key.pem"
+  file_permission = "0600"
+}
+
+resource "aws_key_pair" "bench" {
+  key_name_prefix = "gomodel-bench-"
+  public_key      = tls_private_key.bench.public_key_openssh
+  tags            = var.tags
+}
+
+# ── Security group: SSH only, from the operator's IP ───────────────
+resource "aws_security_group" "bench" {
+  name_prefix = "gomodel-bench-"
+  description = "SSH access for the gateway benchmark instance"
+  vpc_id      = data.aws_vpc.default.id
+
+  ingress {
+    description = "SSH"
+    from_port   = 22
+    to_port     = 22
+    protocol    = "tcp"
+    cidr_blocks = [var.ssh_ingress_cidr]
+  }
+
+  egress {
+    description = "All outbound"
+    from_port   = 0
+    to_port     = 0
+    protocol    = "-1"
+    cidr_blocks = ["0.0.0.0/0"]
+  }
+
+  tags = var.tags
+}
+
+# ── Instance bootstrap: install docker + compose plugin ────────────
+locals {
+  user_data = <<-EOF
+    #!/bin/bash
+    set -euxo pipefail
+
+    # 2 GiB swap: a 1 GiB free-tier instance can't hold memory-heavy gateways
+    # (LiteLLM idles near ~1 GiB). Swap lets every gateway run so the memory
+    # comparison is complete; the reported RSS still exposes the difference.
+    if [ ! -f /swapfile ]; then
+      fallocate -l 2G /swapfile || dd if=/dev/zero of=/swapfile bs=1M count=2048
+      chmod 600 /swapfile
+      mkswap /swapfile
+      swapon /swapfile
+      echo '/swapfile none swap sw 0 0' >> /etc/fstab
+    fi
+
+    dnf update -y
+    dnf install -y docker git
+    systemctl enable --now docker
+    usermod -aG docker ec2-user
+
+    # Docker Compose v2 plugin (pinned).
+    mkdir -p /usr/libexec/docker/cli-plugins
+    curl -fsSL -o /usr/libexec/docker/cli-plugins/docker-compose \
+      "https://github.com/docker/compose/releases/download/${var.compose_plugin_version}/docker-compose-linux-x86_64"
+    chmod +x /usr/libexec/docker/cli-plugins/docker-compose
+
+    # Readiness marker the orchestrator polls for.
+    touch /home/ec2-user/.bootstrap-done
+  EOF
+}
+
+resource "aws_instance" "bench" {
+  ami                         = local.ami_id
+  instance_type               = var.instance_type
+  key_name                    = aws_key_pair.bench.key_name
+  vpc_security_group_ids      = [aws_security_group.bench.id]
+  subnet_id                   = tolist(data.aws_subnets.default.ids)[0]
+  associate_public_ip_address = true
+  user_data                   = local.user_data
+
+  # Only burstable (T-family) instances accept a credit specification. Standard
+  # credits avoid surprise burst charges there; fixed-performance types (the
+  # c7i.large default) omit this block entirely — and have no credit drift, which
+  # is exactly why they make the better latency reference.
+  dynamic "credit_specification" {
+    for_each = local.is_burstable ? [1] : []
+    content {
+      cpu_credits = "standard"
+    }
+  }
+
+  root_block_device {
+    volume_type = "gp3"
+    volume_size = var.root_volume_gb
+  }
+
+  tags = merge(var.tags, { Name = "gomodel-gateway-benchmark" })
+}
diff --git a/docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf b/docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf
new file mode 100644
index 00000000..bee4ba70
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf
@@ -0,0 +1,29 @@
+output "public_ip" {
+  description = "Public IPv4 of the benchmark instance."
+  value       = aws_instance.bench.public_ip
+}
+
+output "public_dns" {
+  description = "Public DNS of the benchmark instance."
+  value       = aws_instance.bench.public_dns
+}
+
+output "ssh_user" {
+  description = "SSH login user for Amazon Linux 2023."
+  value       = "ec2-user"
+}
+
+output "ssh_private_key_path" {
+  description = "Absolute path to the generated private key."
+  value       = abspath(local_sensitive_file.private_key.filename)
+}
+
+output "instance_id" {
+  value = aws_instance.bench.id
+}
+
+output "ami_id" {
+  description = "Resolved AMI id used (record for reproducibility)."
+  # SSM-resolved public AMI alias is not secret; unwrap so it can be recorded.
+  value = nonsensitive(local.ami_id)
+}
diff --git a/docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf b/docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf
new file mode 100644
index 00000000..afbc3381
--- /dev/null
+++ b/docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf
@@ -0,0 +1,50 @@
+variable "region" {
+  description = "AWS region to provision the benchmark instance in."
+  type        = string
+  default     = "us-east-1"
+}
+
+variable "instance_type" {
+  description = <<-EOT
+    EC2 instance type. Default c7i.large (2 vCPU, 4 GiB, non-burstable) gives a
+    stable tail with no CPU-credit drift — the right reference for latency/p99.
+    It is NOT free-tier eligible (~$0.09/hr on-demand in us-east-1). For a
+    free-tier run set instance_type=t2.micro (1 vCPU, burstable) explicitly;
+    treat its absolute latencies as indicative only.
+  EOT
+  type        = string
+  default     = "c7i.large"
+}
+
+variable "ssh_ingress_cidr" {
+  description = "CIDR allowed to SSH in. Set to <your-ip>/32. Defaults to fully open if left empty (NOT recommended)."
+  type        = string
+  default     = "0.0.0.0/0"
+}
+
+variable "ami_id" {
+  description = "Override the AMI. Empty = latest Amazon Linux 2023 x86_64 via SSM (reproducible by policy, not by digest)."
+  type        = string
+  default     = ""
+}
+
+variable "root_volume_gb" {
+  description = "Root EBS volume size (GiB). Free tier allows up to 30 GiB."
+  type        = number
+  default     = 20
+}
+
+variable "compose_plugin_version" {
+  description = "Pinned Docker Compose v2 plugin version installed via user-data."
+  type        = string
+  default     = "v2.29.7"
+}
+
+variable "tags" {
+  description = "Tags applied to all resources."
+  type        = map(string)
+  default = {
+    Project = "gomodel-gateway-benchmark"
+    Owner   = "benchmark"
+  }
+}
diff --git a/docs/about/benchmarks.mdx b/docs/about/benchmarks.mdx
index 612b1925..cca49c8c 100644
--- a/docs/about/benchmarks.mdx
+++ b/docs/about/benchmarks.mdx
@@ -1,134 +1,99 @@
 ---
 title: "Benchmarks"
-description: "A summary of GoModel benchmark results, with links to full write-ups and methodology."
+description: "A short, up-to-date summary of GoModel benchmark results, with a link to the full write-up and the tooling to reproduce it."
 icon: "gauge"
 ---
 
 ## Benchmark snapshot
 
-This page is a short reference for one public benchmark run comparing GoModel
-and LiteLLM on OpenAI-compatible traffic.
+This page is a short reference for our latest public benchmark: GoModel against
+**LiteLLM, Portkey, and Bifrost**, all pointed at the same instant mock backend so
+the numbers reflect gateway overhead, not model latency.
 
-The full article contains the complete write-up, all charts, and the original
-discussion:
-[GoModel vs LiteLLM Benchmark: Speed, Throughput, and Resource Usage](https://enterpilot.io/blog/gomodel-vs-litellm-benchmark/).
+The full article has the complete write-up, all the context, and the charts:
+[AI Gateway Benchmark 2026: GoModel vs LiteLLM, Portkey & Bifrost](https://enterpilot.io/blog/gomodel-vs-litellm-portkey-bifrost-june-2026/).
 
 <Note>
-  This benchmark is a point-in-time snapshot published on March 5, 2026. Treat
-  it as data, not dogma. Gateway performance depends on workload, provider mix,
-  deployment setup, and tuning.
+  This is a point-in-time snapshot from a June 2026 run on AWS. Treat it as data,
+  not dogma. Gateway performance depends on your workload, provider mix, deployment
+  setup, and tuning. Older runs (March 2026, LiteLLM only, on localhost) are still
+  on the blog for history.
 </Note>
 
-## Visual snapshot
+## What we tested
 
-![Benchmark dashboard from the original blog post](./images/benchmark-dashboard.png)
+A simple, like-for-like setup:
 
-Chart source and full context:
-[Original benchmark post](https://enterpilot.io/blog/gomodel-vs-litellm-benchmark/).
+- One gateway at a time, in Docker, on an AWS `c7i.large` (2 vCPU, 4 GiB).
+- The same shared mock backend for everyone, so we measure only gateway overhead.
+- Six workloads: chat completions, the Responses API, and Anthropic messages -
+  each streaming and non-streaming.
+- `8,000` requests per workload at concurrency `10`, across two randomized-order
+  trials (latency is the median across them).
+- Fair config: retries off for everyone, GoModel's circuit breaker off, and
+  LiteLLM run at its recommended one worker per CPU core.
 
 ## At a glance
 
-In this benchmark run, GoModel came out ahead on the main operational signals
-most teams care about:
+GoModel came out ahead on every operational signal most teams care about:
+the tightest latency tail, the highest sustained throughput, the smallest image
+and memory, and the fastest cold start.
 
-- Added latency
-- Throughput under concurrency
-- CPU overhead
-- Memory overhead
+| Gateway | p50 (ms) | p99 (ms) | Throughput (req/s) | Peak RAM | Image (compressed) | Cold start |
+| --- | --- | --- | --- | --- | --- | --- |
+| **GoModel** | **`1.8`** | **`6.9`** | **`4,900`** | **`37 MB`** | **`16 MB`** | **`0.56 s`** |
+| Bifrost | `2.5` | `18.3` | `3,100` | `143 MB` | `77 MB` | `7.1 s` |
+| Portkey | `9.7` | `30.5` | `950` | `112 MB` | `59 MB` | `1.1 s` |
+| LiteLLM | `30.6` | `39.3` | `324` | `2.3 GB` | `372 MB` | `25.5 s` |
 
-## Test shape
-
-The comparison used a simple like-for-like setup:
-
-- OpenAI-compatible `/v1/chat/completions`
-- The same prompt and request shape on both sides
-- Concurrency levels of `1`, `4`, and `8`
-- A focus on clean runs with `0%` errors
-- Metrics including req/s, latency percentiles, CPU usage, and RSS memory
-
-This docs page keeps only the primary comparison matrix from the blog post.
-
-## Reference table
-
-| Gateway | Concurrency | Success | Error % | Req/s | p50 ms | p95 ms | p99 ms | CPU avg % | RSS avg MB |
-| --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
-| GoModel | `1` | `12/12` | `0.00` | `9.61` | `86.4` | `141.1` | `144.4` | `0.81` | `45.4` |
-| GoModel | `4` | `12/12` | `0.00` | `44.66` | `56.1` | `139.5` | `139.5` | `0.23` | `46.0` |
-| GoModel | `8` | `12/12` | `0.00` | `52.75` | `98.4` | `130.6` | `131.1` | `1.13` | `46.0` |
-| LiteLLM | `1` | `12/12` | `0.00` | `8.64` | `96.2` | `190.3` | `213.9` | `9.21` | `320.3` |
-| LiteLLM | `4` | `12/12` | `0.00` | `36.82` | `104.7` | `149.5` | `149.5` | `5.20` | `320.8` |
-| LiteLLM | `8` | `12/12` | `0.00` | `35.81` | `188.7` | `244.4` | `244.9` | `5.95` | `321.5` |
+Latency is chat completions, non-streaming (representative). Throughput is the
+sustained rate from a separate concurrency sweep. Image size is the compressed
+pull size.
 
 ## Key readouts
 
-Some useful reads from that March 5, 2026 run:
-
-- Lower p95 latency at every tested concurrency level.
-- Higher throughput across the benchmark matrix.
-- `45-46 MB` RSS, while LiteLLM stayed near `320-321 MB`.
-- Less CPU in these runs.
-
-At the highest tested concurrency, GoModel reached `52.75 req/s` versus
-LiteLLM at `35.81 req/s`.
+- GoModel has both the lowest median (`1.8 ms`) and the tightest tail (`6.9 ms`).
+- It pushes the most traffic per box (`~4,900 req/s`) and the most per CPU core.
+- It is the smallest to ship and run: a `16 MB` compressed image and `37 MB` of
+  RAM under load, ready to serve `0.56 s` after launch.
+- LiteLLM, even at its recommended multi-worker config, uses `~2.3 GB` of RAM and
+  takes `~25 s` to start - the cost of Python on the hot path.
+- Portkey did not serve the Anthropic messages dialect in this single-provider
+  setup, so it covers 4 of the 6 workloads.
 
 ## Reproduce it yourself
 
-All the tooling used in the published benchmark is available in this repository.
-
-### Prerequisites
-
-- Go 1.26.4+
-- Python 3.10+ with `matplotlib` and `numpy`
-- `jq`, `curl`
-- A Groq API key (or any OpenAI-compatible provider — adjust the script)
-- `litellm[proxy]` (`pip install "litellm[proxy]"`)
-
-### Scripts
+The whole thing is one command. It provisions a small AWS box, runs all four
+gateways against the same mock backend, prints the tables, and tears the
+infrastructure back down on its own.
 
-The benchmark suite lives in [`docs/about/benchmark-tools/`](https://github.com/ENTERPILOT/GoModel/tree/main/docs/about/benchmark-tools):
+<Warning>
+  This runs on **paid** AWS infrastructure, not the free tier. A `c7i.large` is
+  about $0.09/hour and the run self-destructs within an hour or two, so budget
+  **under $1** per run to be safe. If you pass `KEEP=1` or a teardown fails, you
+  keep paying until you destroy the box - so confirm it is gone.
+</Warning>
 
-| File | Purpose |
-| --- | --- |
-| [`compare.sh`](https://github.com/ENTERPILOT/GoModel/blob/main/docs/about/benchmark-tools/compare.sh) | Builds GoModel, starts both gateways, runs the full benchmark matrix, and writes a `REPORT.md` |
-| [`bench_main.go`](https://github.com/ENTERPILOT/GoModel/blob/main/docs/about/benchmark-tools/bench_main.go) | Source for the `bench` CLI that sends requests and collects latency + process metrics |
-| [`plot_benchmark_charts.py`](https://github.com/ENTERPILOT/GoModel/blob/main/docs/about/benchmark-tools/plot_benchmark_charts.py) | Generates per-metric charts and a combined dashboard from the JSON results |
-
-### Quick start
+The harness lives in the repo at
+[`docs/2026-06-25_aws_gateway_benchmark/`](https://github.com/ENTERPILOT/GoModel/tree/main/docs/2026-06-25_aws_gateway_benchmark):
 
 ```bash
-# 1. Clone GoModel and set up your .env with GROQ_API_KEY
+# Needs Docker, Terraform, and AWS credentials
 git clone https://github.com/ENTERPILOT/GoModel.git
-cd gomodel
-echo "GROQ_API_KEY=gsk_..." > .env
-
-# 2. Run the full comparison (builds GoModel, starts LiteLLM, benchmarks both)
-bash docs/about/benchmark-tools/compare.sh
-
-# 3. Generate charts from the latest result
-pip install matplotlib numpy
-python3 docs/about/benchmark-tools/plot_benchmark_charts.py benchmark-results/<timestamp>
-```
-
-The script creates a timestamped directory under `benchmark-results/` containing
-JSON result files, gateway logs, and a `REPORT.md` with the results table.
-
-### Tuning
-
-You can override defaults via environment variables:
-
-```bash
-REQUESTS=100 CONCURRENCIES="1 4 8 16" MAX_TOKENS=16 bash docs/about/benchmark-tools/compare.sh
+cd gomodel/docs/2026-06-25_aws_gateway_benchmark
+./run.sh
 ```
 
-See the top of `compare.sh` for the full list of knobs.
+Knobs like `N` (requests per workload) and `REPEATS` (trials) are env vars, e.g.
+`N=20000 REPEATS=5 ./run.sh` for a heavier run. For a quick local check against
+just LiteLLM, the older localhost harness is still in
+[`docs/about/benchmark-tools/`](https://github.com/ENTERPILOT/GoModel/tree/main/docs/about/benchmark-tools).
 
 ## Why this page is short
 
-This page is intentionally shorter and more operational than the blog version.
-
-It exists so docs readers can see the benchmark result quickly without reading a
-full article inside the product docs. If you want the full narrative, more
-charts, and the original context, use the source post.
+It is meant to give you the result fast, inside the product docs, without a full
+article. For the narrative, the charts, and the methodology details, read the
+[full post](https://enterpilot.io/blog/gomodel-vs-litellm-portkey-bifrost-june-2026/).
 
 No single benchmark settles the question for every environment. If you are
 evaluating gateways seriously, reproduce the test against your own traffic and