ENTERPILOT · SantiagoDePolonia · Jun 26, 2026 · Jun 26, 2026 · coderabbitai · Jun 26, 2026
diff --git a/docs/2026-06-25_aws_gateway_benchmark/.gitignore b/docs/2026-06-25_aws_gateway_benchmark/.gitignore
@@ -0,0 +1,9 @@
+# Benchmark outputs and local Terraform state / secrets — never commit.
+output/
+remote/results/
+terraform/.terraform/
+terraform/.terraform.lock.hcl
-terraform/.terraform.lock.hcl
-terraform/.terraform.lock.hcl
+terraform/*.tfstate
+terraform/*.tfstate.*
+terraform/bench_key.pem
+*.tar.gz
diff --git a/docs/2026-06-25_aws_gateway_benchmark/README.md b/docs/2026-06-25_aws_gateway_benchmark/README.md
@@ -0,0 +1,122 @@
+# AWS gateway latency & resource benchmark — GoModel vs LiteLLM vs Portkey vs Bifrost
+
+A reproducible, one-command benchmark that provisions a free-tier AWS instance,
+runs four AI gateways through identical workloads against a deterministic mock
+backend, measures latency and resource cost, and tears the infrastructure down.
+
+Because every gateway talks to the **same local mock backend**, the numbers
+reflect *gateway overhead*, not upstream model latency or network jitter.
+
+## What it compares
+
+Four OpenAI-compatible gateways, each pointed at the mock:
+
+| Gateway  | Image                                | How it reaches the mock |
+|----------|--------------------------------------|-------------------------|
+| GoModel  | built from this repo (`Dockerfile`)  | `OPENAI_BASE_URL` env   |
+| LiteLLM  | `ghcr.io/berriai/litellm:main-stable`| `configs/litellm-config.yaml` |
+| Portkey  | `portkeyai/gateway:latest`           | `x-portkey-custom-host` header (+ `TRUSTED_CUSTOM_HOSTS=mock`) |
+| Bifrost  | `maximhq/bifrost:latest`             | `configs/bifrost-config.json` (`network_config.base_url` + `allow_private_network`) |
+
+Per-gateway quirks the harness handles automatically (see `gw_model`/`gw_path` in
+`run-on-instance.sh`): Bifrost needs an explicit `openai/`-prefixed model, serves
+the Anthropic dialect at `/anthropic/v1/messages` (not `/v1/messages`), and must
+allow private-network egress to reach the mock.
+
+### Workloads — 6 variants
+
+The common denominator across OpenAI-compatible gateways, in both modes:
+
+| Dialect   | Endpoint               | non-stream | stream |
+|-----------|------------------------|:----------:|:------:|
+| Chat      | `/v1/chat/completions` | ✓ | ✓ |
+| Responses | `/v1/responses`        | ✓ | ✓ |
+| Messages  | `/v1/messages` (Anthropic) | ✓ | ✓ |
+
+A **baseline** (load sent straight to the mock, no gateway) runs first as the
+latency floor. Variants a gateway does not implement are reported as failures
+rather than silently skipped — e.g. Portkey's OSS gateway does not serve the
+Anthropic Messages dialect here, so its messages variants fail; that asymmetry is
+the finding. Streaming uses a terminal-marker **or idle-gap** end-of-stream
+detection (`loadgen -idle`), so a gateway that streams content without sending a
+terminal event (Bifrost) is still measured to last-byte rather than hanging.
+
+### Metrics captured
+
+- **Latency** — total-latency p50/p90/p95/p99, plus **TTFT** (time to first
+  token) for streaming, and throughput (RPS). Driven by the `loadgen` tool.
+- **Docker image size** — `docker image inspect` size + repo digest per gateway.
+- **Memory** — idle RSS after warmup and peak RSS under load (`docker stats`).
+- **CPU** — average CPU% under load (`docker stats`).
+
+## Layout
+
+```
+terraform/            free-tier EC2 + SSH key + security group (apply/destroy)
+remote/               everything shipped to and run on the instance
+  bench-tools/        Go mock backend + loadgen (one small image)
+  configs/            litellm config
+  docker-compose.yml  mock + one gateway per profile (benchnet network)
+  run-on-instance.sh  builds images, runs 6 variants x N gateways, samples stats
+scripts/summarize.py  raw JSON -> latency + resource tables + summary.json
+run.sh                orchestrator: build -> apply -> run -> collect -> destroy
+```
+
+## Prerequisites
+
+- AWS credentials configured (`aws sts get-caller-identity` works)
+- Terraform ≥ 1.6, Docker (with `buildx`), `rsync`, `ssh`, Python 3
+- An AWS account with default VPC in the chosen region
+
+## Run it
+
+```bash
+cd docs/2026-06-25_aws_gateway_benchmark
+./run.sh                         # full run in us-east-1, then auto-destroy
+N=1000 C=20 ./run.sh             # heavier load
+REGION=eu-west-1 ./run.sh        # different region
+GATEWAYS="gomodel litellm" ./run.sh   # subset
+KEEP=1 ./run.sh                  # leave the instance up for debugging
+```
+
+`run.sh` always tears the instance down via an EXIT trap, even on failure. If a
+run is interrupted, reconcile manually:
+
+```bash
+cd terraform && terraform destroy -auto-approve
+```
+
+Results land in `output/<timestamp>/` (raw per-variant JSON, `summary.json`,
+and the printed `summary.txt` table).
+
+## Local dry-run (no AWS)
+
+The instance-side harness runs on any Docker host:
+
+```bash
+cd remote && N=30 C=5 GATEWAYS="gomodel litellm portkey bifrost" ./run-on-instance.sh
+```
+
+(Build the GoModel image first: `docker build -t gomodel-bench:local ../../..`)
+
+## Reproducibility & caveats
+
+- **Pinned**: gateway image refs (overridable via `*_IMAGE` env), the Compose
+  plugin version, instance type, and the deterministic mock payload. Exact image
+  **digests** are recorded in each `*_image.json` so a run is fully traceable.
+- **AMI** resolves to the latest Amazon Linux 2023 via SSM (reproducible by
+  policy). Pin `var.ami_id` for a byte-identical OS.
+- **Free tier**: defaults to **t2.micro** — the 12-month-free-tier instance in
+  us-east-1 — with `standard` CPU credits (no surprise burst charges), a 20 GiB
+  gp3 root volume (free tier allows 30 GiB), the default VPC (no paid NAT/EIP),
+  and an Amazon Linux 2023 AMI. Image pulls are inbound traffic (free). In
+  regions where t2.micro is unavailable, set `INSTANCE_TYPE=t3.micro` (the
+  free-tier instance there). Newer accounts on AWS's credit-based free plan stay
+  within credit for a single short run.
+- **t2.micro is burstable** (1 vCPU, CPU-credit throttled). Treat absolute
+  latency as *indicative*; the value is the *relative* comparison on identical
+  hardware. Gateways run **one at a time** so they never contend, and the load is
+  kept modest (N=500, c=10) to stay within launch credits. For production-grade
+  absolute numbers, set `INSTANCE_TYPE=c7i.large` (not free tier).
+- **Cost**: a single free-tier instance for ~15–30 min — $0 within free-tier
+  allowance, otherwise a few cents.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/RESULTS.md b/docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
@@ -0,0 +1,135 @@
+# Results — 2026-06-25 (AWS c7i.large run)
+
+Reference run produced by `./run.sh` (raw data in `output/20260625-182538/`,
+machine summary in that dir's `summary.md` / `summary.json`). Four gateways:
+**GoModel, LiteLLM, Portkey, Bifrost**.
+
+- **Host**: AWS EC2 **c7i.large** (2 vCPU, 4 GiB, **non-burstable** — no CPU-credit
+  drift, so the tail is stable), Amazon Linux 2023, us-east-1.
+- **Load**: N=8000 requests/variant, concurrency 10, **2 randomized-order trials**
+  (latency = median across trials; p99 shown with its min–max), 200-request
+  process warmup + 50-request per-variant warmup, per-variant wall cap 10 s, 8 s
+  resource window, capacity sweep at c∈{1,16,128}. Shared in-process **mock**
+  backend, so every number is **gateway overhead**, not model latency.
+- **Parity**: retries disabled on every gateway, GoModel's circuit breaker disabled
+  (so the sweep can't trip it), and **LiteLLM run at its recommended worker count —
+  one worker per CPU core (`num_workers=2` on this 2-vCPU box)** so it isn't pinned
+  to a single core while the Go gateways use both.
+- Images (digests in `*_image.json`): GoModel (built from this repo), latest
+  `litellm:main-stable`, `portkeyai/gateway:latest`, `maximhq/bifrost:latest`.
+
+> Fast reference run (N=8000 × 2 trials) sized to finish end-to-end in well under
+> 20 minutes; the p99 min–max spreads are tight, so the medians are stable. Raise
+> `N`/`REPEATS` for a heavier run.
+
+## Latency — non-streaming (ms, median of trials)
+
+| Workload  | metric | baseline | GoModel | Bifrost | Portkey | LiteLLM |
+|-----------|--------|---------:|--------:|--------:|--------:|--------:|
+| chat      | p50 | 0.23 | **1.81** | 2.51 | 9.70 | 30.56 |
+| chat      | p99 | 2.77 | **6.88** | 18.27 | 30.54 | 39.26 |
+| responses | p50 | 0.26 | **2.01** | 2.73 | 9.07 | 39.12 |
+| responses | p99 | 2.33 | **7.28** | 16.55 | 26.92 | 48.60 |
+| messages  | p50 | 0.26 | **1.76** | 2.65 | ✗ | 61.06 |
+| messages  | p99 | 2.23 | **6.59** | 19.08 | ✗ | 98.12 |
+
+**GoModel has the lowest p50 and the tightest p99** (~7 ms vs Bifrost ~18 ms,
+Portkey ~31 ms, LiteLLM ~39 ms). `overhead p50` (gateway p50 − baseline p50):
+GoModel ≈ 1.6 ms, Bifrost ≈ 2.3 ms, Portkey ≈ 9.5 ms, LiteLLM ≈ 30 ms.
+
+## Latency — streaming (ms, median of trials)
+
+| Workload  | metric | GoModel | Bifrost | Portkey | LiteLLM |
+|-----------|--------|--------:|--------:|--------:|--------:|
+| chat      | TTFT p50 | **4.71** | 9.02 | 27.97 | 151.94 |
+| chat      | total p50 | **4.95** | 11.89 | 27.98 | 151.95 |
+| responses | TTFT p50 | **4.69** | 12.87 | 27.90 | 47.53 |
+| responses | total p50 | **5.00** | 14.94 | 27.93 | 47.55 |
+| messages  | TTFT p50 | **7.50** | †      | ✗ | 48.86 |
+| messages  | total p50 | **8.38** | †      | ✗ | 48.89 |
+
+† **Bifrost messages-stream is an idle-bound artifact, not a throughput number**
+(no terminal event over a non-native backend → 0 completions within the 10 s cap).
+
+## Throughput / capacity (chat non-stream, sustained req/s by concurrency)
+
+| target | c=1 | c=16 | c=128 | peak | knee |
+|--------|----:|-----:|------:|-----:|-----:|
+| baseline | 15510 | 29701 | 30015 | **30015** | 16 |
+| GoModel  | 2745 | 4928 | 4567 | **4928** | 16 |
+| Bifrost  | 1885 | 3088 | 2904 | **3088** | 16 |
+| Portkey  | 636 | 946 | 900 | **946** | 16 |
+| LiteLLM  | 227 | 324 | 254 | **324** | 16 |
+
+GoModel tops the gateways at **~4900 req/s**, ~1.6× Bifrost, ~5× Portkey, ~15×
+LiteLLM. All saturate by c=16 on 2 vCPUs.
+
+## Resources
+
+| Metric | GoModel | Portkey | Bifrost | LiteLLM |
+|--------|--------:|--------:|--------:|--------:|
+| Docker image, compressed pull (MB) | **16** | 59 | 77 | 372 |
+| Docker image, on-disk (MB)      | **47.2** | 177.4 | 230.7 | 1159.9 |
+| Cold start to first 200 (s) | **0.56** | 1.05 | 7.07 | 25.49 |
+| Peak RSS under load (MB)| **37.0** | 112.0 | 143.0 | 2272.3 |
+| Avg CPU under load (%) | 92.6 | 116.9 | 117.6 | 101.1 |
+| Sustained req/s (resource window) | **4824** | 960 | 2977 | 261 |
+| Efficiency (req/s per CPU %) | **52.1** | 8.2 | 25.3 | 2.6 |
+
+GoModel is the most CPU-efficient (**52 req/s per CPU-%**, ~2× Bifrost, ~6×
+Portkey, ~20× LiteLLM), the smallest image (**47 MB**), the smallest footprint
+(**37 MB** peak), and the fastest cold start (**0.56 s**).
+
+> **LiteLLM at its recommended config.** With `num_workers=2` (one per core) LiteLLM
+> is faster and higher-throughput than the earlier single-worker run (≈220 → 324
+> req/s; chat p50 ≈ 44 → 31 ms — a single worker was queuing the 10 concurrent
+> requests), but its **memory doubled to ~2.3 GB** (two ~1 GB worker processes) and
+> its **cold start rose to ~25 s**. Running LiteLLM "properly" widens the resource
+> gap, not narrows it.
+
+## Feature coverage (6 variants)
+
+| Gateway | chat | responses | messages | total |
+|---------|:----:|:---------:|:--------:|:-----:|
+| GoModel | ✓ | ✓ | ✓ | 6/6 |
+| LiteLLM | ✓ | ✓ | ✓ | 6/6 |
+| Bifrost | ✓ | ✓ | ✓† | 6/6 |
+| Portkey | ✓ | ✓ | ✗ | 4/6 |
+
+- **Portkey** errors on the Anthropic `/v1/messages` dialect in this single-provider
+  (openai → mock) setup; setup limitation, not a hard capability gap.
+- **Bifrost** serves Anthropic at `/anthropic/v1/messages`, needs an `openai/`-prefixed
+  model and `allow_private_network:true`; messages-streaming has the caveat above (†).
+
+## Takeaways
+
+- **GoModel** — best all-rounder: lowest p50 and tightest p99 (~7 ms), highest
+  gateway throughput (~4900 req/s), best CPU efficiency (52 req/s per %), smallest
+  image (47 MB) and memory (37 MB), fastest cold start (0.56 s), full 6/6 coverage.
+- **Bifrost** (Go) — second on throughput, low p50 but a heavier p99 tail; streaming
+  terminal-event gaps over a non-native backend.
+- **Portkey** (Node) — middle tier; no Anthropic Messages in this setup.
+- **LiteLLM** (Python) — full coverage, but even at its recommended 2-worker config
+  it is ~15× behind on throughput and carries a **1.16 GB image + ~2.3 GB RAM + ~25 s
+  cold start**. The cost of Python on the hot path.
+
+## Methodology notes
+
+- **Repeats + spread** — 2 trials, randomized gateway order each trial; latency is
+  the median across trials, p99 carries its min–max.
+- **Config parity** — retries off on all; GoModel's circuit breaker disabled (a few
+  transient errors under the c=128 sweep would otherwise trip it and blanket-503 its
+  own capacity); **LiteLLM at one worker per core (`num_workers`=vCPUs)**, its own
+  production recommendation, set automatically from `nproc`.
+- **Warm-up** — 200 global + 50 per-variant requests; the per-variant warmup
+  neutralizes LiteLLM's lazy per-dialect imports and, with >1 worker, warms each
+  worker before measuring.
+- **Throughput vs latency separated** — capacity comes from a time-boxed concurrency
+  sweep, not the latency-coupled rps in the latency tables.
+- **Per-variant wall cap (10 s)** — bounds idle-bound streaming variants; cap-aborted
+  requests are reported as `capped`, not `failed`.
+- **Resilient orchestration** — the remote benchmark runs detached (`setsid`) and the
+  orchestrator polls for the `meta.json` sentinel, so an SSH drop can't kill or hang
+  the run; `set -uo` so one flaky variant skips instead of aborting.
+- Reproduce with `./run.sh`; pin `var.ami_id` and the `*_IMAGE` digests for a
+  byte-identical rerun. Heavier run: `N=20000 REPEATS=5 ./run.sh`.
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile
@@ -0,0 +1,18 @@
+# Builds the mock backend and load generator into one small static image.
+# Both binaries live in the final image; the compose service / docker run
+# command selects which one to execute.
+FROM golang:1.26-alpine AS build
+WORKDIR /src
+COPY go.mod ./
+COPY mock ./mock
+COPY loadgen ./loadgen
+RUN CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/mock ./mock \
+ && CGO_ENABLED=0 go build -ldflags="-s -w" -o /out/loadgen ./loadgen
+
+FROM gcr.io/distroless/static-debian12:nonroot
+COPY --from=build /out/mock /mock
+COPY --from=build /out/loadgen /loadgen
+# No ENTRYPOINT: each invocation picks the binary as its command, e.g.
+#   docker run img /mock          (compose `command: ["/mock"]`)
+#   docker run img /loadgen -url …
+CMD ["/mock"]
diff --git a/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod b/docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod
@@ -0,0 +1,3 @@
+module gomodel-bench-tools
+
+go 1.26