Skip to content

docs(benchmark): add AWS gateway benchmark and refresh benchmarks page#429

Merged
SantiagoDePolonia merged 1 commit into
mainfrom
docs/aws-gateway-benchmark
Jun 26, 2026
Merged

docs(benchmark): add AWS gateway benchmark and refresh benchmarks page#429
SantiagoDePolonia merged 1 commit into
mainfrom
docs/aws-gateway-benchmark

Conversation

@SantiagoDePolonia

@SantiagoDePolonia SantiagoDePolonia commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

What

A reproducible AWS benchmark comparing GoModel against LiteLLM, Portkey, and Bifrost, plus the refreshed docs benchmarks page. This PR is the benchmark core only - the write-up and co-located tooling are split into draft #430.

docs/2026-06-25_aws_gateway_benchmark/:

  • RESULTS.md - the documented results.
  • Harness - one command (run.sh): builds the GoModel image, provisions a small AWS box (Terraform), runs all four gateways against a shared instant mock backend (so numbers reflect gateway overhead, not model latency), collects results, and tears the infra back down. Tooling under remote/ (Go loadgen + mock, gateway configs, compose), terraform/, scripts/summarize.py.
  • Updated docs/about/benchmarks.mdx to this June 2026 run (with a paid-AWS cost note).

Results (c7i.large, shared mock backend)

Gateway p99 (ms) Throughput Peak RAM Image (compressed) Cold start
GoModel 6.9 4,900 req/s 37 MB 16 MB 0.56 s
Bifrost 18.3 3,100 143 MB 77 MB 7.1 s
Portkey 30.5 950 112 MB 59 MB 1.1 s
LiteLLM 39.3 324 2.3 GB 372 MB 25.5 s

Split out → draft #430

The benchmark write-up (ARTICLE.md, cover.png + make_cover.py, charts/) and the co-located qa/ and translation/ tooling are in #430 to keep this PR focused.

Not committed

Terraform state, the provider cache, the generated SSH key, and raw run output are gitignored.

🤖 Generated with Claude Code

@coderabbitai

coderabbitai Bot commented Jun 26, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

The PR adds an AWS gateway benchmark harness under docs/2026-06-25_aws_gateway_benchmark, including Terraform provisioning, Docker Compose services, Go benchmarking tools, local and remote run scripts, result summarization, and updated benchmark documentation and published results.

Changes

AWS Gateway Benchmark Harness

Layer / File(s) Summary
Infrastructure and topology
docs/2026-06-25_aws_gateway_benchmark/.gitignore, docs/2026-06-25_aws_gateway_benchmark/remote/configs/*, docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml, docs/2026-06-25_aws_gateway_benchmark/terraform/*
Terraform provisions the EC2 benchmark host, Compose defines the shared mock and gateway profiles, and local ignore/config files set the deployment inputs.
Benchmark tools image
docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile, docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod, docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go, docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go
A Go module and distroless image package mock and loadgen; the mock serves deterministic OpenAI/Anthropic-compatible endpoints and loadgen measures fixed-count and time-boxed request flows.
Execution and summarization
docs/2026-06-25_aws_gateway_benchmark/run.sh, docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh, docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py
The local launcher provisions AWS infrastructure, transfers the harness and image, starts the remote run, collects results, and writes summary JSON/Markdown from the generated benchmark artifacts.
Benchmark docs and results
docs/2026-06-25_aws_gateway_benchmark/README.md, docs/2026-06-25_aws_gateway_benchmark/RESULTS.md, docs/about/benchmarks.mdx
The benchmark README, results note, and public benchmarks page describe the June 2026 harness, outputs, reproduction steps, and published comparisons.

Sequence Diagram(s)

sequenceDiagram
  participant RunSh as "run.sh"
  participant Terraform
  participant EC2Instance
  participant RunOnInstance as "run-on-instance.sh"
  participant Loadgen as "loadgen"
  participant SummarizePy as "scripts/summarize.py"
  RunSh->>Terraform: apply benchmark infrastructure
  RunSh->>EC2Instance: rsync harness and upload image tarball
  RunSh->>RunOnInstance: start detached benchmark run
  RunOnInstance->>Loadgen: run latency variants and sweeps
  RunSh->>SummarizePy: generate summary.txt from results
Loading

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

Poem

🐇 I hopped through clouds of code today,
With mock and loadgen on the way.
The sweep ran fast, the tables shone,
And results came back all neatly grown.
I left a carrot by the gate.

🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 33.08% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly summarizes the main change: adding an AWS gateway benchmark and updating the benchmarks page.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.
Description Check ✅ Passed Check skipped - CodeRabbit’s high-level summary is enabled.
✨ Finishing Touches
📝 Generate docstrings
  • Create stacked PR
  • Commit on current branch
🧪 Generate unit tests (beta)
  • Create PR with unit tests
  • Commit unit tests in branch docs/aws-gateway-benchmark

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands.

@mintlify

mintlify Bot commented Jun 26, 2026

Copy link
Copy Markdown

Preview deployment for your docs. Learn more about Mintlify Previews.

Project Status Preview Updated (UTC)
gomodel 🟢 Ready View Preview Jun 26, 2026, 10:35 AM

💡 Tip: Enable Workflows to automatically generate PRs for you.

@codecov-commenter

Copy link
Copy Markdown

⚠️ Please install the 'codecov app svg image' to ensure uploads and comments are reliably processed by Codecov.

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@greptile-apps

greptile-apps Bot commented Jun 26, 2026

Copy link
Copy Markdown

Confidence Score: 4/5

Safe to merge for the docs and harness additions; one infrastructure default should be tightened before the terraform module is used independently.

The terraform ssh_ingress_cidr variable defaults to 0.0.0.0/0, which leaves SSH world-open whenever someone runs terraform apply directly rather than through run.sh. The orchestrator always overrides it with the caller's IP, but the module in isolation is insecure and lives in a public repository. The README also incorrectly advertises a 'free-tier' instance while the default is c7i.large (~$0.09/hr). No changes to production Go code.

docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf (ssh_ingress_cidr default), docs/2026-06-25_aws_gateway_benchmark/README.md (free-tier claim)

Security Review

  • World-open SSH default (terraform/variables.tf): ssh_ingress_cidr defaults to 0.0.0.0/0. Any user who runs terraform apply directly without run.sh provisions an EC2 instance with SSH exposed to the entire internet. run.sh always overrides this with MY_IP/32, but the terraform module itself is insecure in isolation.

Sequence Diagram

%%{init: {'theme': 'neutral'}}%%
sequenceDiagram
    participant User
    participant run.sh
    participant Terraform
    participant AWS
    participant EC2
    participant run-on-instance.sh
    participant Docker

    User->>run.sh: ./run.sh
    run.sh->>run.sh: docker buildx build (GoModel linux/amd64)
    run.sh->>Terraform: "terraform apply (region, instance_type, ssh_ingress_cidr=MY_IP/32)"
    Terraform->>AWS: provision c7i.large + SG + ED25519 keypair
    AWS-->>Terraform: public IP, key path
    run.sh->>EC2: poll SSH until ready
    run.sh->>EC2: rsync harness + scp GoModel image tar
    run.sh->>EC2: setsid bash run-on-instance.sh (detached)

    loop REPEATS latency trials (randomized gateway order)
        run-on-instance.sh->>Docker: compose up mock
        run-on-instance.sh->>Docker: compose up gateway (one at a time)
        run-on-instance.sh->>Docker: loadgen -n N (6 variants x gateway)
        Docker-->>run-on-instance.sh: JSON latency summary per variant
        run-on-instance.sh->>Docker: compose rm gateway
    end

    run-on-instance.sh->>Docker: capacity sweep + cold-start + peak RSS
    run-on-instance.sh->>run-on-instance.sh: write results/meta.json (sentinel)
    run.sh->>EC2: poll for meta.json every 15s
    run.sh->>EC2: rsync results/
    run.sh->>run.sh: python3 summarize.py
    run.sh->>Terraform: terraform destroy (EXIT trap)
    Terraform->>AWS: terminate instance + delete SG + keypair
Loading
%%{init: {'theme': 'base', 'themeVariables': {"darkMode": true, "background": "#0d1117", "primaryColor": "#21262d", "primaryTextColor": "#e6edf3", "primaryBorderColor": "#8b949e", "lineColor": "#8b949e", "textColor": "#e6edf3", "edgeLabelBackground": "#161b22", "actorBkg": "#21262d", "actorBorder": "#8b949e", "actorTextColor": "#e6edf3", "actorLineColor": "#8b949e", "signalColor": "#8b949e", "signalTextColor": "#e6edf3", "noteBkgColor": "#373320", "noteBorderColor": "#d4a72c", "noteTextColor": "#f0e6c0", "labelBoxBkgColor": "#21262d", "labelBoxBorderColor": "#8b949e", "labelTextColor": "#e6edf3", "loopTextColor": "#e6edf3", "activationBkgColor": "#30363d", "activationBorderColor": "#8b949e"}}}%%
sequenceDiagram
    participant User
    participant run.sh
    participant Terraform
    participant AWS
    participant EC2
    participant run-on-instance.sh
    participant Docker

    User->>run.sh: ./run.sh
    run.sh->>run.sh: docker buildx build (GoModel linux/amd64)
    run.sh->>Terraform: "terraform apply (region, instance_type, ssh_ingress_cidr=MY_IP/32)"
    Terraform->>AWS: provision c7i.large + SG + ED25519 keypair
    AWS-->>Terraform: public IP, key path
    run.sh->>EC2: poll SSH until ready
    run.sh->>EC2: rsync harness + scp GoModel image tar
    run.sh->>EC2: setsid bash run-on-instance.sh (detached)

    loop REPEATS latency trials (randomized gateway order)
        run-on-instance.sh->>Docker: compose up mock
        run-on-instance.sh->>Docker: compose up gateway (one at a time)
        run-on-instance.sh->>Docker: loadgen -n N (6 variants x gateway)
        Docker-->>run-on-instance.sh: JSON latency summary per variant
        run-on-instance.sh->>Docker: compose rm gateway
    end

    run-on-instance.sh->>Docker: capacity sweep + cold-start + peak RSS
    run-on-instance.sh->>run-on-instance.sh: write results/meta.json (sentinel)
    run.sh->>EC2: poll for meta.json every 15s
    run.sh->>EC2: rsync results/
    run.sh->>run.sh: python3 summarize.py
    run.sh->>Terraform: terraform destroy (EXIT trap)
    Terraform->>AWS: terminate instance + delete SG + keypair
Loading

Reviews (1): Last reviewed commit: "docs(benchmark): add AWS gateway benchma..." | Re-trigger Greptile

Comment on lines +19 to +23
variable "ssh_ingress_cidr" {
description = "CIDR allowed to SSH in. Set to <your-ip>/32. Defaults to fully open if left empty (NOT recommended)."
type = string
default = "0.0.0.0/0"
}

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P1 security World-open SSH default

The ssh_ingress_cidr variable defaults to "0.0.0.0/0", exposing port 22 to the public internet for any user who runs terraform apply directly (rather than through run.sh). The description warns this is not recommended, but the default still enables it. A safer default would be a validation-required empty string or a non-routable placeholder that forces the caller to supply a real CIDR — run.sh always passes ${MY_IP}/32, but the module in isolation is world-open.

@@ -0,0 +1,122 @@
# AWS gateway latency & resource benchmark — GoModel vs LiteLLM vs Portkey vs Bifrost

A reproducible, one-command benchmark that provisions a free-tier AWS instance,

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 "free-tier" claim is inaccurate for the default instance

The README describes this benchmark as provisioning "a free-tier AWS instance," but the default instance_type is c7i.large, which is explicitly documented in variables.tf as "NOT free-tier eligible (~$0.09/hr on-demand in us-east-1)." A user who clones the repo and runs ./run.sh without overriding the instance type will be charged. The free-tier language should be removed or replaced with an explicit cost note.

# ── 4. Ship harness + image, run ───────────────────────────────────
log "Shipping harness to instance"
ssh "${SSH_OPTS[@]}" "$USER@$IP" 'rm -rf ~/bench && mkdir -p ~/bench'
rsync -az -e "ssh ${SSH_OPTS[*]}" --exclude results "$REMOTE_DIR/" "$USER@$IP:~/bench/"

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

P2 Array expansion loses quoting in rsync -e argument

"ssh ${SSH_OPTS[*]}" joins all array elements with a space and embeds them into a single string. If $KEY (the private key path) ever contains a space, the rsync -e value will be split and misinterpreted. The same pattern appears on line 142. Using ${SSH_OPTS[*]} is fragile; printf '%q ' "${SSH_OPTS[@]}" or an explicit quoted expansion is safer. In practice the key path is terraform/bench_key.pem so no space exists today, but it is worth hardening before this becomes a template others copy.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 22

🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/2026-06-25_aws_gateway_benchmark/.gitignore`:
- Line 5: The .gitignore entry is incorrectly excluding the Terraform lockfile,
which can allow provider versions to drift between benchmark runs. Remove the
terraform/.terraform.lock.hcl ignore rule so the lockfile stays tracked and
reproducible alongside the Terraform benchmark setup.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/README.md`:
- Around line 103-109: The README layout block is missing a fenced code language
tag, which triggers markdownlint MD040. Update the fenced block in the
documentation so the snippet is marked as plain text by adding the appropriate
language specifier, keeping the content unchanged. Locate the markdown fence
around the repository layout section and adjust that block only.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py`:
- Around line 296-319: The per-case error guard in the main loop is too narrow
because `evaluate()` can still raise and abort the whole run. Extend the
existing `try`/`except` around the `for case in cases` block in `run_qa.py` so
it also covers `evaluate(case, res, audit, attempted)` and the
`_record(...)`/printing path that depends on it. Keep the current `ERROR`
handling pattern used for `run_case`, and make sure any exception from
`evaluate` or `apply_operator` is captured with the case id and recorded instead
of stopping the loop.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json`:
- Around line 54-55: The oracle for chat.json currently allows an invalid
primary color; update the quality rule so the accepted colors no longer include
green. Adjust the stream validation entry under the target "stream" to keep the
same structure, but remove green from the contains_any list and leave only the
valid colors referenced by this prompt.
- Around line 75-80: The chat streaming spec currently only checks for a
non-empty stream and a terminal [DONE], so it can pass even if usage data is
dropped. Update the expect block in the chat fixture to assert that the streamed
output contains a chunk with usage, using the existing stream/quality assertion
style. Keep the stream_options translation note intact and make the new usage
assertion specific enough to catch missing provider-shaped usage chunks.
- Around line 175-180: The current chat QA check only verifies non-empty content
and a keyword, so it does not validate the `json_schema` contract. Update the
expectation around `message.content` in the chat benchmark spec to assert that
the response is valid JSON and matches the required schema with exactly the
expected fields, using the existing `expect`/`quality` structure in this test.
This should ensure the `response_format` handling is actually exercised and a
plain-text response will fail.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json`:
- Around line 30-36: The stateful conversation flow is only checking the payload
shape, so `get`, `update`, and `delete` can pass even if they operate on the
wrong record. Update the expectations in the conversation benchmark flow to
assert that the returned `$.id` matches the captured `${conversation_id}`
wherever these steps are used, alongside the existing `$.object` checks. Make
the same equality validation in the later conversation assertions so path
interpolation and object identity are verified end to end.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json`:
- Around line 31-38: The batch embedding spec currently checks only the result
count and embedding shape, so it does not prove output order is preserved.
Update the `expect` block in the embeddings batch case to assert each returned
item is tied to its original input position, using the returned `index` field or
equivalent per-item position checks in `$.data`, so the test fails if
`embeddings.json` responses are reordered.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json`:
- Around line 56-65: The test in the QA spec is locking in a bug by requiring a
400 response for unknown top-level fields. Update the expectation in the error
case so it asserts the stable behavior instead: no 5xx/crash, audit still
records $.data.request_body.x_qa_marker, and the response can be either
forwarded 400 or sanitized 200. Adjust the status and response assertions in the
spec’s expect block to allow both outcomes while keeping the existing
field-capture checks.

In `@docs/2026-06-25_aws_gateway_benchmark/README.md`:
- Line 54: The fenced layout block in the README is missing a language tag and
triggers markdownlint MD040. Update the fenced block in the documentation to use
a text language identifier, and make sure the change is applied to the layout
block near the benchmark description so the markdown renders and lints
correctly.
- Around line 3-5: The README summary is inconsistent with the actual defaults
in run.sh, so update the documentation or the script to match. In the README
section describing the benchmark defaults, align the stated instance type and
load settings with the values used by run.sh (INSTANCE_TYPE and N), and make
sure the “free-tier” wording is only used if the default configuration truly is
free-tier. Use the run.sh defaults and the README benchmark overview text as the
key references.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile`:
- Around line 4-12: The Dockerfile for bench-tools uses floating base image tags
in both the build stage and runtime stage, so update the FROM entries for
golang:1.26-alpine and gcr.io/distroless/static-debian12:nonroot to immutable
SHA256 digests. Keep the rest of the build steps in place and preserve the
existing multi-stage structure so the Dockerfile remains reproducible for
benchmarking.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go`:
- Around line 286-303: The early return in readStream leaves the SSE response
body partially unread, which prevents doRequest’s deferred resp.Body.Close from
enabling keep-alive reuse. Update readStream so that when isTerminal matches it
either drains the remaining response body before returning or signals the
goroutine reading lines to stop cleanly, using the existing readStream/doRequest
flow and the terminal-frame handling around isTerminal. Verify that streaming
requests can reuse connections under load after the fix.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go`:
- Around line 126-134: The recording path has a data race because `begin`
appends `e` to `rec.entries` before `e.Response` is populated, while `handleLog`
and `handleLast` read `entry.Response` under `rec.mu` concurrently. Fix this in
the `begin`/`respond`/`setStreamResp` flow by either moving the append to
`rec.entries` until after `Response` is fully set, or by protecting the
`Response` write with `rec.mu` so reads and writes are synchronized.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml`:
- Around line 3-4: Update the stale host description comment in
docker-compose.yml so it matches the benchmark environment used elsewhere; the
note near the top still says “single-vCPU, free-tier” even though the
docker-compose topology and the benchmark assumptions in the same file are for a
c7i.large-class, two-core setup. Adjust that comment text to describe the actual
instance class and core count so the guidance stays consistent with the
benchmark configuration.
- Around line 61-88: The peer gateway service image defaults in the docker
compose setup are still using mutable tags, which can cause benchmark results to
drift over time. Update the default image references for the relevant service
definitions in docker-compose.yml, such as the litellm and portkey entries, to
immutable digests while keeping the existing environment-variable overrides
intact for flexibility. Use the service names and image fields in this compose
block to locate and change the defaults.

In `@docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py`:
- Around line 41-42: The cover caption in make_cover.py still claims the
benchmark ran on a free-tier AWS box, but it should match the actual published
setup. Update the fig.text caption text to reflect the c7i.large benchmark
environment referenced by ARTICLE.md, using the same caption-building code path
so the cover copy stays consistent with the article.
- Around line 25-27: The cover chart data is still using the old June 2026
t2.micro benchmark values, so it no longer matches the published article
figures. Update the IMG and RAM constants in make_cover.py to the final
c7i.large numbers from ARTICLE.md, and revise the nearby comment so it describes
the correct run and units. Keep the ordering and labels intact so the generated
hero visual aligns with the table below it.

In `@docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py`:
- Around line 135-137: The target discovery in summarize.py is only scanning the
first run directory, so later successful trials are ignored when the first trial
is partial or empty. Update the present/targets logic near the globbing code to
aggregate matching JSON basenames across all entries in runs, not just runs[0],
so TARGETS reflects any target that appears in any trial. Use the existing runs,
present, and TARGETS symbols to keep the fix localized.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf`:
- Around line 122-148: Harden aws_instance.bench by adding metadata_options to
require IMDSv2 and disable IMDSv1, and update the root_block_device in the same
resource to explicitly enable EBS encryption for the gp3 root volume. Make the
changes directly in the aws_instance "bench" block so the benchmark host keeps
the same behavior while satisfying the static analysis findings.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/README.md`:
- Line 69: The GoModel entry in the addressing table is using the container port
instead of the host port, which makes it inconsistent with the runner
configuration. Update the GoModel row in the README table to reflect the host
mapping used by capture.py and docker-compose.yml, and keep it aligned with the
other rows that list host-facing ports. Locate the table row containing GoModel
and correct the port value so it matches the actual accessible address.

In `@docs/about/benchmarks.mdx`:
- Around line 75-76: The reproduce snippet has a casing mismatch: the repository
is cloned into the GoModel directory, but the follow-up cd command uses a
lowercase path that will fail on case-sensitive systems. Update the benchmark
instructions so the directory name used after git clone matches the actual
checkout folder exactly, keeping the reproduction steps in
docs/about/benchmarks.mdx consistent.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: 1d00ad02-ff89-4cbc-a129-81d6b8bcbd3b

📥 Commits

Reviewing files that changed from the base of the PR and between 2662963 and cfe0d9c.

⛔ Files ignored due to path filters (5)
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/cover.png is excluded by !**/*.png
📒 Files selected for processing (42)
  • docs/2026-06-25_aws_gateway_benchmark/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md
  • docs/2026-06-25_aws_gateway_benchmark/README.md
  • docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
  • docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/qa/README.md
  • docs/2026-06-25_aws_gateway_benchmark/qa/models.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go
  • docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json
  • docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml
  • docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml
  • docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh
  • docs/2026-06-25_aws_gateway_benchmark/run.sh
  • docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py
  • docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py
  • docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf
  • docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf
  • docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf
  • docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/translation/README.md
  • docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py
  • docs/2026-06-25_aws_gateway_benchmark/translation/capture.py
  • docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json
  • docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml
  • docs/about/benchmarks.mdx

output/
remote/results/
terraform/.terraform/
terraform/.terraform.lock.hcl

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Do not ignore the Terraform lockfile.

Ignoring .terraform.lock.hcl lets provider selections drift between runs, which undercuts the reproducible benchmark goal even when the Terraform sources are unchanged.

Suggested fix
-terraform/.terraform.lock.hcl
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
terraform/.terraform.lock.hcl
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/.gitignore` at line 5, The .gitignore
entry is incorrectly excluding the Terraform lockfile, which can allow provider
versions to drift between benchmark runs. Remove the
terraform/.terraform.lock.hcl ignore rule so the lockfile stays tracked and
reproducible alongside the Terraform benchmark setup.

Comment thread docs/2026-06-25_aws_gateway_benchmark/qa/README.md Outdated
Comment on lines +296 to +319
for case in cases:
t0 = time.time()
try:
res, audit, attempted, skip = run_case(case, client, models, variables,
do_audit=not args.no_audit)
except Exception as e: # noqa: BLE001 — never let one case abort the run
res, audit, attempted, skip = None, None, False, None
err = f"{type(e).__name__}: {e}"
results.append(_record(case, "ERROR", [], err, None, None, time.time() - t0))
print(f"ERR {case['id']}: {err}")
continue

if skip:
results.append(_record(case, "SKIP", [], skip, res, audit, time.time() - t0))
print(f"skip {case['id']}: {skip}")
continue

if audit and (audit.get("data") or {}).get("request_body") is not None:
audit_bodies_seen = True

status, checks, detail = evaluate(case, res, audit, attempted)
rec = _record(case, status, checks, detail, res, audit, time.time() - t0)
results.append(rec)
print(f"{report.STATUS_GLYPH.get(status, status):4} {case['id']}: {detail}")

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟡 Minor | ⚡ Quick win

evaluate() runs outside the per-case try, so a bad assertion can abort the whole run.

The try (Lines 298-306) wraps only run_case, but the comment states the goal is to "never let one case abort the run." evaluate() on Line 316 is outside it, and apply_operator can raise on a malformed spec operand — e.g. re.error for a bad regex, or AttributeError from expected.lower() when a contains/contains_any operand isn't a string. One such case would crash the entire run instead of being recorded as ERROR.

🛡️ Extend the guard to cover evaluation
         t0 = time.time()
         try:
             res, audit, attempted, skip = run_case(case, client, models, variables,
                                                     do_audit=not args.no_audit)
+            if skip:
+                results.append(_record(case, "SKIP", [], skip, res, audit, time.time() - t0))
+                print(f"skip {case['id']}: {skip}")
+                continue
+
+            if audit and (audit.get("data") or {}).get("request_body") is not None:
+                audit_bodies_seen = True
+
+            status, checks, detail = evaluate(case, res, audit, attempted)
         except Exception as e:  # noqa: BLE001 — never let one case abort the run
             res, audit, attempted, skip = None, None, False, None
             err = f"{type(e).__name__}: {e}"
             results.append(_record(case, "ERROR", [], err, None, None, time.time() - t0))
             print(f"ERR  {case['id']}: {err}")
             continue
-
-        if skip:
-            results.append(_record(case, "SKIP", [], skip, res, audit, time.time() - t0))
-            print(f"skip {case['id']}: {skip}")
-            continue
-
-        if audit and (audit.get("data") or {}).get("request_body") is not None:
-            audit_bodies_seen = True
-
-        status, checks, detail = evaluate(case, res, audit, attempted)
         rec = _record(case, status, checks, detail, res, audit, time.time() - t0)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
for case in cases:
t0 = time.time()
try:
res, audit, attempted, skip = run_case(case, client, models, variables,
do_audit=not args.no_audit)
except Exception as e: # noqa: BLE001 — never let one case abort the run
res, audit, attempted, skip = None, None, False, None
err = f"{type(e).__name__}: {e}"
results.append(_record(case, "ERROR", [], err, None, None, time.time() - t0))
print(f"ERR {case['id']}: {err}")
continue
if skip:
results.append(_record(case, "SKIP", [], skip, res, audit, time.time() - t0))
print(f"skip {case['id']}: {skip}")
continue
if audit and (audit.get("data") or {}).get("request_body") is not None:
audit_bodies_seen = True
status, checks, detail = evaluate(case, res, audit, attempted)
rec = _record(case, status, checks, detail, res, audit, time.time() - t0)
results.append(rec)
print(f"{report.STATUS_GLYPH.get(status, status):4} {case['id']}: {detail}")
for case in cases:
t0 = time.time()
try:
res, audit, attempted, skip = run_case(case, client, models, variables,
do_audit=not args.no_audit)
if skip:
results.append(_record(case, "SKIP", [], skip, res, audit, time.time() - t0))
print(f"skip {case['id']}: {skip}")
continue
if audit and (audit.get("data") or {}).get("request_body") is not None:
audit_bodies_seen = True
status, checks, detail = evaluate(case, res, audit, attempted)
except Exception as e: # noqa: BLE001 — never let one case abort the run
res, audit, attempted, skip = None, None, False, None
err = f"{type(e).__name__}: {e}"
results.append(_record(case, "ERROR", [], err, None, None, time.time() - t0))
print(f"ERR {case['id']}: {err}")
continue
rec = _record(case, status, checks, detail, res, audit, time.time() - t0)
results.append(rec)
print(f"{report.STATUS_GLYPH.get(status, status):4} {case['id']}: {detail}")
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py` around lines 296 - 319,
The per-case error guard in the main loop is too narrow because `evaluate()` can
still raise and abort the whole run. Extend the existing `try`/`except` around
the `for case in cases` block in `run_qa.py` so it also covers `evaluate(case,
res, audit, attempted)` and the `_record(...)`/printing path that depends on it.
Keep the current `ERROR` handling pattern used for `run_case`, and make sure any
exception from `evaluate` or `apply_operator` is captured with the case id and
recorded instead of stopping the loop.

Comment on lines +54 to +55
"stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]},
"quality": [{"target": "stream", "contains_any": ["red", "blue", "yellow", "green"]}]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Remove green from the accepted primary colors.

This oracle currently accepts an incorrect answer, because green is not a primary color for this prompt.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json` around lines 54 -
55, The oracle for chat.json currently allows an invalid primary color; update
the quality rule so the accepted colors no longer include green. Adjust the
stream validation entry under the target "stream" to keep the same structure,
but remove green from the contains_any list and leave only the valid colors
referenced by this prompt.

Comment thread docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json Outdated
Comment thread docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py Outdated
Comment on lines +135 to +137
present = sorted({os.path.basename(p).split("_")[0]
for p in glob.glob(os.path.join(runs[0], "*_*_*.json"))})
targets = [t for t in TARGETS if t in present]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Discover present targets across all trials, not just runs[0].

The harness is explicitly resilient to partial failures, but present is computed only from the first run dir. If trial 1 produced no/partial JSON (e.g. an early flaky variant) while later trials succeeded, those targets are dropped from the whole summary even though data exists in run2..N.

♻️ Proposed fix
-    present = sorted({os.path.basename(p).split("_")[0]
-                      for p in glob.glob(os.path.join(runs[0], "*_*_*.json"))})
+    present = sorted({os.path.basename(p).split("_")[0]
+                      for r in runs
+                      for p in glob.glob(os.path.join(r, "*_*_*.json"))})
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
present = sorted({os.path.basename(p).split("_")[0]
for p in glob.glob(os.path.join(runs[0], "*_*_*.json"))})
targets = [t for t in TARGETS if t in present]
present = sorted({os.path.basename(p).split("_")[0]
for r in runs
for p in glob.glob(os.path.join(r, "*_*_*.json"))})
targets = [t for t in TARGETS if t in present]
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py` around lines 135
- 137, The target discovery in summarize.py is only scanning the first run
directory, so later successful trials are ignored when the first trial is
partial or empty. Update the present/targets logic near the globbing code to
aggregate matching JSON basenames across all entries in runs, not just runs[0],
so TARGETS reflects any target that appears in any trial. Use the existing runs,
present, and TARGETS symbols to keep the fix localized.

Comment on lines +122 to +148
resource "aws_instance" "bench" {
ami = local.ami_id
instance_type = var.instance_type
key_name = aws_key_pair.bench.key_name
vpc_security_group_ids = [aws_security_group.bench.id]
subnet_id = tolist(data.aws_subnets.default.ids)[0]
associate_public_ip_address = true
user_data = local.user_data

# Only burstable (T-family) instances accept a credit specification. Standard
# credits avoid surprise burst charges there; fixed-performance types (the
# c7i.large default) omit this block entirely — and have no credit drift, which
# is exactly why they make the better latency reference.
dynamic "credit_specification" {
for_each = local.is_burstable ? [1] : []
content {
cpu_credits = "standard"
}
}

root_block_device {
volume_type = "gp3"
volume_size = var.root_volume_gb
}

tags = merge(var.tags, { Name = "gomodel-gateway-benchmark" })
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🔵 Trivial | 💤 Low value

Harden the instance: enforce IMDSv2 and encrypt the root volume.

Two cheap posture improvements flagged by static analysis on aws_instance.bench:

  • IMDSv1 is left enabled (no metadata_options), allowing token-less metadata access (AWS-0028 / CKV_AWS_79).
  • The gp3 root volume is unencrypted (AWS-0131).

Even for an ephemeral benchmark host these are one-line fixes.

🛡️ Proposed hardening
   root_block_device {
     volume_type = "gp3"
     volume_size = var.root_volume_gb
+    encrypted   = true
   }
+
+  metadata_options {
+    http_tokens   = "required"
+    http_endpoint = "enabled"
+  }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
resource "aws_instance" "bench" {
ami = local.ami_id
instance_type = var.instance_type
key_name = aws_key_pair.bench.key_name
vpc_security_group_ids = [aws_security_group.bench.id]
subnet_id = tolist(data.aws_subnets.default.ids)[0]
associate_public_ip_address = true
user_data = local.user_data
# Only burstable (T-family) instances accept a credit specification. Standard
# credits avoid surprise burst charges there; fixed-performance types (the
# c7i.large default) omit this block entirely — and have no credit drift, which
# is exactly why they make the better latency reference.
dynamic "credit_specification" {
for_each = local.is_burstable ? [1] : []
content {
cpu_credits = "standard"
}
}
root_block_device {
volume_type = "gp3"
volume_size = var.root_volume_gb
}
tags = merge(var.tags, { Name = "gomodel-gateway-benchmark" })
}
resource "aws_instance" "bench" {
ami = local.ami_id
instance_type = var.instance_type
key_name = aws_key_pair.bench.key_name
vpc_security_group_ids = [aws_security_group.bench.id]
subnet_id = tolist(data.aws_subnets.default.ids)[0]
associate_public_ip_address = true
user_data = local.user_data
# Only burstable (T-family) instances accept a credit specification. Standard
# credits avoid surprise burst charges there; fixed-performance types (the
# c7i.large default) omit this block entirely — and have no credit drift, which
# is exactly why they make the better latency reference.
dynamic "credit_specification" {
for_each = local.is_burstable ? [1] : []
content {
cpu_credits = "standard"
}
}
root_block_device {
volume_type = "gp3"
volume_size = var.root_volume_gb
encrypted = true
}
metadata_options {
http_tokens = "required"
http_endpoint = "enabled"
}
tags = merge(var.tags, { Name = "gomodel-gateway-benchmark" })
}
🧰 Tools
🪛 Checkov (3.3.1)

[medium] 122-148: Ensure that detailed monitoring is enabled for EC2 instances

(CKV_AWS_126)


[low] 122-148: Ensure that EC2 is EBS optimized

(CKV_AWS_135)


[high] 122-148: EC2 instance should not have public IP.

(CKV_AWS_88)


[high] 122-148: Ensure Instance Metadata Service Version 1 is not enabled

(CKV_AWS_79)

🪛 Trivy (0.69.3)

[error] 122-148: aws_instance should activate session tokens for Instance Metadata Service.

Instance does not require IMDS access to require a token.

Rule: AWS-0028

Resource: aws_instance.bench

Learn more

(IaC/AWS)


[error] 122-148: Instance with unencrypted block device.

Root block device is not encrypted.

Rule: AWS-0131

Resource: aws_instance.bench

Learn more

(IaC/AWS)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf` around lines 122 -
148, Harden aws_instance.bench by adding metadata_options to require IMDSv2 and
disable IMDSv1, and update the root_block_device in the same resource to
explicitly enable EBS encryption for the gp3 root volume. Make the changes
directly in the aws_instance "bench" block so the benchmark host keeps the same
behavior while satisfying the static analysis findings.

Source: Linters/SAST tools


| gateway | port | model | messages path | extra headers |
|---|--|---|---|---|
| GoModel | 8080 | `gpt-4o-mini` | `/v1/messages` | — |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

GoModel host port in the addressing table is inconsistent with the runner.

This table is captioned "handled by capture.py", but capture.py defaults GoModel to http://localhost:18080 and docker-compose.yml maps host 18080:8080 (explicitly to avoid clashing with a local dev GoModel on 8080). The other rows (4000/8787/8089) match their host ports, so listing 8080 here is misleading.

📝 Proposed fix
-| GoModel | 8080 | `gpt-4o-mini` | `/v1/messages` | — |
+| GoModel | 18080 | `gpt-4o-mini` | `/v1/messages` | — |
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
| GoModel | 8080 | `gpt-4o-mini` | `/v1/messages` ||
| GoModel | 18080 | `gpt-4o-mini` | `/v1/messages` ||
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/README.md` at line 69, The
GoModel entry in the addressing table is using the container port instead of the
host port, which makes it inconsistent with the runner configuration. Update the
GoModel row in the README table to reflect the host mapping used by capture.py
and docker-compose.yml, and keep it aligned with the other rows that list
host-facing ports. Locate the table row containing GoModel and correct the port
value so it matches the actual accessible address.

Comment thread docs/about/benchmarks.mdx

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 5

♻️ Duplicate comments (9)
docs/about/benchmarks.mdx (1)

75-76: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fix directory casing in the reproduce snippet.

Still unfixed from prior review: git clone checks out into GoModel/, but cd gomodel/... uses lowercase and will fail on case-sensitive filesystems.

🐛 Proposed fix
 git clone https://github.com/ENTERPILOT/GoModel.git
-cd gomodel/docs/2026-06-25_aws_gateway_benchmark
+cd GoModel/docs/2026-06-25_aws_gateway_benchmark
 ./run.sh
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/about/benchmarks.mdx` around lines 75 - 76, The reproduce snippet has a
directory casing mismatch: the repository is cloned into GoModel/, but the
follow-up cd command uses a lowercase path that will break on case-sensitive
filesystems. Update the snippet to use the same casing as the cloned repository
name, and ensure the benchmark directory path in the repro instructions matches
the actual checkout location referenced by the clone step.
docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile (1)

4-4: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Pin the bench-tools base images by digest for reproducibility.

Floating tags remain unaddressed from prior review. Both the build stage (golang:1.26-alpine) and runtime stage (gcr.io/distroless/static-debian12:nonroot) can shift between builds, undermining benchmark reproducibility.

Suggested shape
-FROM golang:1.26-alpine AS build
+FROM golang:1.26-alpine@sha256:<digest> AS build
...
-FROM gcr.io/distroless/static-debian12:nonroot
+FROM gcr.io/distroless/static-debian12:nonroot@sha256:<digest>

Also applies to: 12-12

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile` at line
4, The Dockerfile still uses floating base image tags, which can change between
builds and break benchmark reproducibility. Update both the build stage image in
the Dockerfile and the runtime stage image to pinned digests instead of
tag-based references, and keep the existing stage names (build/runtime) intact
so the image sources remain stable across runs.
docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json (3)

54-55: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Remove green from the accepted primary colors.

This oracle still accepts an incorrect answer, so a wrong streamed response can pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json` around lines 54 -
55, The oracle for the stream quality check still allows an incorrect response
because “green” is included in the accepted primary colors. Update the quality
expectation in the chat.json spec so the stream validation only accepts the
intended colors, and remove “green” from the contains_any list used by the
stream oracle.

75-78: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Assert that a usage chunk is present in the stream.

The current checks only require a non-empty stream and [DONE], so dropping stream_options.include_usage still passes.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json` around lines 75 -
78, The stream validation in the chat JSON spec is too weak because it only
checks for non-empty output and a terminal [DONE], so it can miss regressions
where usage is not included. Update the expectation in the chat stream spec to
explicitly assert that a usage chunk appears in the streamed events, alongside
the existing stream and quality checks, so the scenario fails if
stream_options.include_usage is removed.

175-178: 🎯 Functional Correctness | 🟠 Major | 🏗️ Heavy lift

Validate the json_schema contract, not just a keyword.

A plain-text answer mentioning Paris still satisfies this case, so a dropped or schema-invalid response_format can pass.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json` around lines 175 -
178, The current QA case only checks for a keyword in
$.choices[0].message.content, so it can pass even if response_format/json_schema
is missing or invalid. Update the spec in chat.json to validate the json_schema
contract directly using the existing response/quality structure, and reference
the affected expectation block around the choices[0].message.content check so
the test asserts schema-conforming output rather than just containing “paris”.
docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json (1)

56-65: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't lock the suite to today's unknown-field behavior.

Requiring 400 here bakes in the current forwarding bug; the stable invariant is “no 5xx, no crash, audit captures x_qa_marker,” while allowing either forwarded-400 or sanitized-200.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json` around lines 56 -
65, The error expectation in the QA spec is too strict because it locks the
suite to the current unknown-field forwarding behavior. Update the check around
the response expectations in this spec so it asserts the stable invariants
instead: no 5xx/no crash and the audit still captures x_qa_marker, while
permitting either a forwarded 400 from the provider or a sanitized 200. Keep the
references to the existing response and audit paths in the spec so the test
remains valid if the gateway behavior changes.
docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json (2)

30-35: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Bind the retrieve result to ${conversation_id}.

$.id is only checked for presence, so this can pass against the wrong record.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json` around
lines 30 - 35, The retrieve expectation only verifies that $.id exists, so it
can succeed for the wrong conversation record. Update the conversation retrieval
test in conversations.json to bind the returned id to ${conversation_id} and use
that bound value in the assertion, so the expectation is tied to the specific
retrieved conversation rather than any non-empty id.

68-88: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Update and delete should also assert the captured conversation id.

These cases validate shape only, so the flow still passes if path interpolation or object identity is wrong.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json` around
lines 68 - 88, The update and delete conversation cases only validate response
shape, so they can pass even if the wrong conversation is targeted. In
conversations.json, update the conversation-related expectations in the stateful
flow definitions for the metadata update and delete scenarios to also assert the
captured conversation_id from the created conversation is returned or matched,
using the existing request/expect entries for conversations.update and
conversations.delete to tie the assertions to the specific conversation
identity.
docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json (1)

31-36: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Assert batch output ordering.

The case proves only count and shape; reordered embeddings still pass despite the note saying order is preserved.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json` around lines
31 - 36, The batch embedding QA spec currently only checks response count and
embedding shape, so reordered results can still pass. Update the expectations in
the embeddings benchmark spec to assert the output order as well, using the
existing $.data checks in the embeddings.json test case so the returned
embeddings are validated in the same sequence as the input.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md`:
- Around line 156-157: Update the open-source table entry for GoModel so it
matches the footnote and does not imply a current hybrid license state; in the
ARTICLE.md benchmark table, change the “Yes / Partial ‡” wording to a consistent
“Yes ‡” style. Keep the existing ‡ footnote reference and ensure the wording
aligns with the other table entries and the GoModel footnote explanation.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json`:
- Around line 40-56: The Conversations QA case in conversations.use_in_responses
only verifies that /v1/responses completes, so it can pass even if
conversation.id is ignored. Update this test to prove the response is actually
attached to the conversation by adding a stronger assertion in the same
scenario, such as validating the captured conversation id in an audit/check step
or making a follow-up Responses turn that recalls “7” from the same
conversation. Use the existing request/expect structure in
conversations.use_in_responses to locate the change.

In `@docs/2026-06-25_aws_gateway_benchmark/RESULTS.md`:
- Around line 9-17: The methodology section in RESULTS.md describes run settings
that do not match the harness defaults used by remote/run-on-instance.sh. Update
the claimed warmup, variant warmup, wall cap, and resource-window values in the
Load/Parity bullets, or explicitly note the env var overrides used for this
benchmark run so the reported configuration matches the actual harness behavior.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf`:
- Around line 111-115: The Docker Compose plugin download in the Terraform setup
is not verified, so update the setup steps around the compose plugin fetch to
validate the downloaded binary before making it executable. Use the existing
docker-compose download flow in main.tf to add a SHA256 checksum check (or
equivalent published verification) for the version from
var.compose_plugin_version, and fail the provisioning step if the checksum does
not match.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf`:
- Around line 19-22: The ssh_ingress_cidr variable currently defaults to a
public CIDR, which makes the benchmark host SSH-open by default. Update the
terraform/variables.tf definition to remove the insecure default and require an
explicit value, then verify the security-group ingress in terraform/main.tf only
uses ssh_ingress_cidr when the caller provides it. Use the ssh_ingress_cidr
variable and the related ingress rule in main.tf to locate the fix.

---

Duplicate comments:
In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json`:
- Around line 54-55: The oracle for the stream quality check still allows an
incorrect response because “green” is included in the accepted primary colors.
Update the quality expectation in the chat.json spec so the stream validation
only accepts the intended colors, and remove “green” from the contains_any list
used by the stream oracle.
- Around line 75-78: The stream validation in the chat JSON spec is too weak
because it only checks for non-empty output and a terminal [DONE], so it can
miss regressions where usage is not included. Update the expectation in the chat
stream spec to explicitly assert that a usage chunk appears in the streamed
events, alongside the existing stream and quality checks, so the scenario fails
if stream_options.include_usage is removed.
- Around line 175-178: The current QA case only checks for a keyword in
$.choices[0].message.content, so it can pass even if response_format/json_schema
is missing or invalid. Update the spec in chat.json to validate the json_schema
contract directly using the existing response/quality structure, and reference
the affected expectation block around the choices[0].message.content check so
the test asserts schema-conforming output rather than just containing “paris”.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json`:
- Around line 30-35: The retrieve expectation only verifies that $.id exists, so
it can succeed for the wrong conversation record. Update the conversation
retrieval test in conversations.json to bind the returned id to
${conversation_id} and use that bound value in the assertion, so the expectation
is tied to the specific retrieved conversation rather than any non-empty id.
- Around line 68-88: The update and delete conversation cases only validate
response shape, so they can pass even if the wrong conversation is targeted. In
conversations.json, update the conversation-related expectations in the stateful
flow definitions for the metadata update and delete scenarios to also assert the
captured conversation_id from the created conversation is returned or matched,
using the existing request/expect entries for conversations.update and
conversations.delete to tie the assertions to the specific conversation
identity.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json`:
- Around line 31-36: The batch embedding QA spec currently only checks response
count and embedding shape, so reordered results can still pass. Update the
expectations in the embeddings benchmark spec to assert the output order as
well, using the existing $.data checks in the embeddings.json test case so the
returned embeddings are validated in the same sequence as the input.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json`:
- Around line 56-65: The error expectation in the QA spec is too strict because
it locks the suite to the current unknown-field forwarding behavior. Update the
check around the response expectations in this spec so it asserts the stable
invariants instead: no 5xx/no crash and the audit still captures x_qa_marker,
while permitting either a forwarded 400 from the provider or a sanitized 200.
Keep the references to the existing response and audit paths in the spec so the
test remains valid if the gateway behavior changes.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile`:
- Line 4: The Dockerfile still uses floating base image tags, which can change
between builds and break benchmark reproducibility. Update both the build stage
image in the Dockerfile and the runtime stage image to pinned digests instead of
tag-based references, and keep the existing stage names (build/runtime) intact
so the image sources remain stable across runs.

In `@docs/about/benchmarks.mdx`:
- Around line 75-76: The reproduce snippet has a directory casing mismatch: the
repository is cloned into GoModel/, but the follow-up cd command uses a
lowercase path that will break on case-sensitive filesystems. Update the snippet
to use the same casing as the cloned repository name, and ensure the benchmark
directory path in the repro instructions matches the actual checkout location
referenced by the clone step.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: dfa0c154-20d2-4e8d-83e5-33c57ee96a90

📥 Commits

Reviewing files that changed from the base of the PR and between cfe0d9c and 304476a.

⛔ Files ignored due to path filters (5)
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/cover.png is excluded by !**/*.png
📒 Files selected for processing (42)
  • docs/2026-06-25_aws_gateway_benchmark/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md
  • docs/2026-06-25_aws_gateway_benchmark/README.md
  • docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
  • docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/qa/README.md
  • docs/2026-06-25_aws_gateway_benchmark/qa/models.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go
  • docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json
  • docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml
  • docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml
  • docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh
  • docs/2026-06-25_aws_gateway_benchmark/run.sh
  • docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py
  • docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py
  • docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf
  • docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf
  • docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf
  • docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/translation/README.md
  • docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py
  • docs/2026-06-25_aws_gateway_benchmark/translation/capture.py
  • docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json
  • docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml
  • docs/about/benchmarks.mdx

Comment on lines +156 to +157
| Open-source | Yes / Partial ‡ | Partial ‡ | Partial ‡ | Yes |

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Clarify the open-source footnote for GoModel.

The table cell "Yes / Partial ‡" is confusing. The ‡ footnote explains GoModel is "open-source today, with some enterprise-grade features planned to stay private" — meaning it is fully open-source now, with future plans. The "Yes / Partial" construction suggests a current hybrid state that doesn't match the footnote. Use "Yes ‡" consistently with how LiteLLM is marked "Yes" (also with ‡ qualification).

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md` around lines 156 - 157,
Update the open-source table entry for GoModel so it matches the footnote and
does not imply a current hybrid license state; in the ARTICLE.md benchmark
table, change the “Yes / Partial ‡” wording to a consistent “Yes ‡” style. Keep
the existing ‡ footnote reference and ensure the wording aligns with the other
table entries and the GoModel footnote explanation.

Comment on lines +40 to +56
"id": "conversations.use_in_responses",
"title": "Conversations: link a Responses call to a conversation",
"provider": "openai",
"modality": ["stateful", "text"],
"request": {
"path": "/v1/responses",
"body": {
"model": "@openai.chat",
"conversation": {"id": "${conversation_id}"},
"input": "Remember the number 7.",
"max_output_tokens": 32
}
},
"expect": {
"status": 200,
"response": [{"path": "$.status", "equals": "completed"}]
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

This step never proves the response is attached to the conversation.

A stateless /v1/responses call that ignores conversation.id still passes because the only assertion is status=completed. Add either an audit check for the captured id or a follow-up turn that recalls the stored value.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json` around
lines 40 - 56, The Conversations QA case in conversations.use_in_responses only
verifies that /v1/responses completes, so it can pass even if conversation.id is
ignored. Update this test to prove the response is actually attached to the
conversation by adding a stronger assertion in the same scenario, such as
validating the captured conversation id in an audit/check step or making a
follow-up Responses turn that recalls “7” from the same conversation. Use the
existing request/expect structure in conversations.use_in_responses to locate
the change.

Comment thread docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
Comment on lines +111 to +115
# Docker Compose v2 plugin (pinned).
mkdir -p /usr/libexec/docker/cli-plugins
curl -fsSL -o /usr/libexec/docker/cli-plugins/docker-compose \
"https://github.com/docker/compose/releases/download/${var.compose_plugin_version}/docker-compose-linux-x86_64"
chmod +x /usr/libexec/docker/cli-plugins/docker-compose

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | ⚡ Quick win

Verify the Docker Compose plugin download with a checksum.

The curl fetches the compose plugin binary without checksum or signature verification. For reproducibility and supply-chain integrity, verify the published SHA256 after download.

🔏 Proposed verification step
     curl -fsSL -o /usr/libexec/docker/cli-plugins/docker-compose \
       "https://github.com/docker/compose/releases/download/${var.compose_plugin_version}/docker-compose-linux-x86_64"
+    # Verify checksum if available; otherwise at least record it for audit
+    curl -fsSL -o /tmp/docker-compose.sha256 \
+      "https://github.com/docker/compose/releases/download/${var.compose_plugin_version}/docker-compose-linux-x86_64.sha256"
+    (cd /tmp && sha256sum -c docker-compose.sha256)
     chmod +x /usr/libexec/docker/cli-plugins/docker-compose
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf` around lines 111 -
115, The Docker Compose plugin download in the Terraform setup is not verified,
so update the setup steps around the compose plugin fetch to validate the
downloaded binary before making it executable. Use the existing docker-compose
download flow in main.tf to add a SHA256 checksum check (or equivalent published
verification) for the version from var.compose_plugin_version, and fail the
provisioning step if the checksum does not match.

Comment on lines +19 to +22
variable "ssh_ingress_cidr" {
description = "CIDR allowed to SSH in. Set to <your-ip>/32. Defaults to fully open if left empty (NOT recommended)."
type = string
default = "0.0.0.0/0"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Require an explicit SSH CIDR instead of opening SSH to the internet by default.

Line 22 defaults ssh_ingress_cidr to 0.0.0.0/0, and docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf forwards that straight into the security-group ingress rule. That leaves port 22 globally reachable on every fresh benchmark host.

Suggested fix
 variable "ssh_ingress_cidr" {
-  description = "CIDR allowed to SSH in. Set to <your-ip>/32. Defaults to fully open if left empty (NOT recommended)."
+  description = "CIDR allowed to SSH in. Set to <your-ip>/32."
   type        = string
-  default     = "0.0.0.0/0"
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
variable "ssh_ingress_cidr" {
description = "CIDR allowed to SSH in. Set to <your-ip>/32. Defaults to fully open if left empty (NOT recommended)."
type = string
default = "0.0.0.0/0"
variable "ssh_ingress_cidr" {
description = "CIDR allowed to SSH in. Set to <your-ip>/32."
type = string
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf` around lines 19
- 22, The ssh_ingress_cidr variable currently defaults to a public CIDR, which
makes the benchmark host SSH-open by default. Update the terraform/variables.tf
definition to remove the insecure default and require an explicit value, then
verify the security-group ingress in terraform/main.tf only uses
ssh_ingress_cidr when the caller provides it. Use the ssh_ingress_cidr variable
and the related ingress rule in main.tf to locate the fix.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 10

♻️ Duplicate comments (5)
docs/2026-06-25_aws_gateway_benchmark/README.md (2)

54-54: 📐 Maintainability & Code Quality | 🟡 Minor | 💤 Low value

Add language to fenced code block.

Still missing language tag per markdownlint MD040. Use text.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/README.md` at line 54, The fenced code
block in the README is missing a language tag, which triggers markdownlint
MD040. Update the markdown fence to use the appropriate language identifier for
this snippet, using text for the block in the README so the linter passes.

3-5: 🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

README still contradicts run.sh defaults.

Claimed as addressed but line 3 still says "provisions a free-tier AWS instance" while run.sh defaults to c7i.large (explicitly "NOT free tier") with N=20000. Lines 109-120 compound this by describing t2.micro/N=500/c=10 as defaults. Please align documentation with actual defaults or vice-versa.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/README.md` around lines 3 - 5, The
README content is out of sync with the actual benchmark defaults in run.sh,
especially around instance type and workload size. Update the README wording in
the benchmark overview and the defaults section to match the real defaults used
by run.sh (or change run.sh to match the documented defaults), and use the
run.sh entrypoint plus its configured instance/classic parameters as the source
of truth so terms like free-tier, t2.micro, N=500, and c=10 are consistent.
docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md (1)

156-157: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Clarify GoModel open-source status.

Still "Yes / Partial ‡" which implies a current hybrid state. The ‡ footnote explains GoModel is fully open-source now with future enterprise features planned private. Change to "Yes ‡" for consistency with LiteLLM's marking and accuracy.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md` around lines 156 - 157, The
comparison table still shows GoModel as “Yes / Partial ‡”, which suggests an
ongoing hybrid status instead of fully open-source. Update the GoModel entry in
the benchmark table in ARTICLE.md to match the footnote and LiteLLM’s style by
changing it to “Yes ‡”, keeping the existing footnote reference intact.
docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile (1)

4-4: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Pin base images by digest for reproducibility.

Floating tags remain unaddressed from previous review. For a benchmark claiming reproducibility, immutable digests ensure identical builds across time.

-FROM golang:1.26-alpine AS build
+FROM golang:1.26-alpine@sha256:<digest> AS build
...
-FROM gcr.io/distroless/static-debian12:nonroot
+FROM gcr.io/distroless/static-debian12:nonroot@sha256:<digest>

Also applies to: 12-12

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile` at line
4, The base image reference in the Dockerfile still uses a floating tag, which
breaks reproducibility. Update the FROM instruction for the build stage to use
an immutable digest instead of the golang:1.26-alpine tag, and apply the same
digest pinning to the other base image reference noted in the review so the
benchmark builds are stable over time.
docs/about/benchmarks.mdx (1)

75-76: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fix directory casing in reproduce snippet.

Still unaddressed: git clone creates GoModel/ but cd gomodel/... uses lowercase, failing on case-sensitive filesystems.

 git clone https://github.com/ENTERPILOT/GoModel.git
-cd gomodel/docs/2026-06-25_aws_gateway_benchmark
+cd GoModel/docs/2026-06-25_aws_gateway_benchmark
 ./run.sh
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/about/benchmarks.mdx` around lines 75 - 76, The reproduce snippet has a
casing mismatch between the cloned repository directory and the follow-up cd
command, which will break on case-sensitive filesystems. Update the snippet so
the directory name used after git clone matches the actual repository folder
name exactly, and verify the benchmark reproduction path under the
docs/about/benchmarks.mdx example stays consistent with the clone target.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py`:
- Around line 52-55: The `evaluate()` assertion handling in `assertions.py`
should treat `contains` and `not_contains` the same way as
`contains_any`/`contains_all` by coercing `expected` to a string before calling
`.lower()`. Update the `op == "contains"` and `op == "not_contains"` branches to
use `str(expected).lower()` so non-string spec values do not raise
`AttributeError` and can be evaluated consistently.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py`:
- Line 90: The `modality` formatting in `report.py` can break when
`r.get("modality")` is a string, causing `",".join(...)` to split it into
characters. Update the logic around the `modality = ",".join(...)` assignment to
normalize the `modality` value in the row `r` before joining. In the
report-building code, detect string values and wrap them in a single-item list,
otherwise keep list-like values as-is, so the output stays correctly
comma-separated.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go`:
- Around line 72-77: The benchmark entrypoint in main should validate flags
before proceeding: reject non-positive values for the fixed-count concurrency
flag, reject negative values for the request count flag, and fail immediately on
an unknown dialect instead of letting buildBody silently default to chat. Add
these checks near the existing url validation in main and keep the error
messages tied to the parsed flags so invalid benchmark runs exit fast.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go`:
- Around line 97-100: The redaction denylist in sensitiveHeaders is too narrow
and can leak credentials through /__log. Update the header filtering used by the
mock logger in main.go to cover common secret-bearing headers like
Proxy-Authorization and Cookie, plus any gateway/provider-specific key headers
you know about, and add a broader name-based check for headers containing key,
token, secret, or authorization. Keep the logic centralized around the existing
sensitiveHeaders map so all capture/log paths automatically redact these values.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh`:
- Around line 27-28: The run-on-instance script does not stop when changing into
its own directory fails, so subsequent path handling can target the wrong
location. Update the startup logic around SCRIPT_DIR and cd to explicitly detect
a failed directory change and exit immediately with an error before any
benchmark paths are used.
- Around line 207-218: The record_image() helper can emit invalid JSON when
docker save fails for a missing optional image, because the compressed size
pipeline may produce mixed output. Update record_image() to first verify the
image exists before attempting docker image inspect or docker save, and if it is
absent set compressed_bytes (and related size fields if needed) to 0/unknown
consistently so the generated "$RESULTS_DIR/${gw}_image.json" remains valid
JSON.

In `@docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py`:
- Line 159: The summarize.py cleanup has Ruff E702 violations from multiple
statements on one line. Split the semicolon-separated prints in the affected
spots, including the header output near the code that builds hdr and the other
repeated print lines around the referenced locations, so each statement is on
its own line and passes linting.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/capture.py`:
- Around line 165-166: The gateway selection in capture.py silently drops
invalid values when building gateways from args.gateways, which can make a typo
look like a successful run. Update the argument handling around the gateways
filtering logic to validate every requested name against GATEWAYS, raise an
error if any unknown names are present, and also fail if the resulting list is
empty so the capture does not proceed with no valid gateways.
- Around line 128-132: The mock reset helper currently swallows all failures,
which can leave stale recorder state and corrupt later captures. Update
mock_reset in capture.py to fail fast by surfacing the exception from the
urllib.request.urlopen call (or retry a small number of times and then raise)
instead of using a broad except that passes. Keep the fix localized to
mock_reset and ensure the reset request to MOCK + "/__reset" is treated as
required before subsequent __log reads.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml`:
- Around line 49-52: The gateway service images in the docker-compose setup are
still using mutable tags, which makes benchmark reruns non-reproducible. Update
the image references for the litellm and the other affected services in this
compose file to use immutable versioned tags or digests instead of
main-stable/latest. Keep the existing service definitions and profiles, but pin
the image values so the benchmark package always runs against the same gateway
builds.

---

Duplicate comments:
In `@docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md`:
- Around line 156-157: The comparison table still shows GoModel as “Yes /
Partial ‡”, which suggests an ongoing hybrid status instead of fully
open-source. Update the GoModel entry in the benchmark table in ARTICLE.md to
match the footnote and LiteLLM’s style by changing it to “Yes ‡”, keeping the
existing footnote reference intact.

In `@docs/2026-06-25_aws_gateway_benchmark/README.md`:
- Line 54: The fenced code block in the README is missing a language tag, which
triggers markdownlint MD040. Update the markdown fence to use the appropriate
language identifier for this snippet, using text for the block in the README so
the linter passes.
- Around line 3-5: The README content is out of sync with the actual benchmark
defaults in run.sh, especially around instance type and workload size. Update
the README wording in the benchmark overview and the defaults section to match
the real defaults used by run.sh (or change run.sh to match the documented
defaults), and use the run.sh entrypoint plus its configured instance/classic
parameters as the source of truth so terms like free-tier, t2.micro, N=500, and
c=10 are consistent.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile`:
- Line 4: The base image reference in the Dockerfile still uses a floating tag,
which breaks reproducibility. Update the FROM instruction for the build stage to
use an immutable digest instead of the golang:1.26-alpine tag, and apply the
same digest pinning to the other base image reference noted in the review so the
benchmark builds are stable over time.

In `@docs/about/benchmarks.mdx`:
- Around line 75-76: The reproduce snippet has a casing mismatch between the
cloned repository directory and the follow-up cd command, which will break on
case-sensitive filesystems. Update the snippet so the directory name used after
git clone matches the actual repository folder name exactly, and verify the
benchmark reproduction path under the docs/about/benchmarks.mdx example stays
consistent with the clone target.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: a5afb1f5-5f6e-4b83-ab49-37b993f3df3e

📥 Commits

Reviewing files that changed from the base of the PR and between 304476a and dc36712.

⛔ Files ignored due to path filters (5)
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg is excluded by !**/*.svg
  • docs/2026-06-25_aws_gateway_benchmark/cover.png is excluded by !**/*.png
📒 Files selected for processing (42)
  • docs/2026-06-25_aws_gateway_benchmark/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md
  • docs/2026-06-25_aws_gateway_benchmark/README.md
  • docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
  • docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/qa/README.md
  • docs/2026-06-25_aws_gateway_benchmark/qa/models.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json
  • docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go
  • docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json
  • docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml
  • docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml
  • docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh
  • docs/2026-06-25_aws_gateway_benchmark/run.sh
  • docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py
  • docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py
  • docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf
  • docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf
  • docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf
  • docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/translation/README.md
  • docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py
  • docs/2026-06-25_aws_gateway_benchmark/translation/capture.py
  • docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json
  • docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml
  • docs/about/benchmarks.mdx

Comment on lines +52 to +55
if op == "contains":
return expected.lower() in str(value).lower(), f"contains {expected!r}"
if op == "not_contains":
return expected.lower() not in str(value).lower(), f"not contains {expected!r}"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🔵 Trivial | ⚡ Quick win

Coerce expected to str in contains/not_contains for consistency.

contains_any/contains_all already defend against non-string operands via str(w).lower(), but contains/not_contains call expected.lower() directly. A non-string operand in a spec (e.g. {"contains": 200}) raises AttributeError, which currently propagates up through evaluate() and can abort the run.

♻️ Proposed fix
         if op == "contains":
-            return expected.lower() in str(value).lower(), f"contains {expected!r}"
+            return str(expected).lower() in str(value).lower(), f"contains {expected!r}"
         if op == "not_contains":
-            return expected.lower() not in str(value).lower(), f"not contains {expected!r}"
+            return str(expected).lower() not in str(value).lower(), f"not contains {expected!r}"
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if op == "contains":
return expected.lower() in str(value).lower(), f"contains {expected!r}"
if op == "not_contains":
return expected.lower() not in str(value).lower(), f"not contains {expected!r}"
if op == "contains":
return str(expected).lower() in str(value).lower(), f"contains {expected!r}"
if op == "not_contains":
return str(expected).lower() not in str(value).lower(), f"not contains {expected!r}"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py` around lines 52
- 55, The `evaluate()` assertion handling in `assertions.py` should treat
`contains` and `not_contains` the same way as `contains_any`/`contains_all` by
coercing `expected` to a string before calling `.lower()`. Update the `op ==
"contains"` and `op == "not_contains"` branches to use `str(expected).lower()`
so non-string spec values do not raise `AttributeError` and can be evaluated
consistently.

for r in results:
q = quality_score(r)
qs = f"{q}%" if q is not None else ""
modality = ",".join(r.get("modality") or [])

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
# Check whether any spec declares modality as a string rather than a list.
fd -e json . docs/2026-06-25_aws_gateway_benchmark/qa/spec \
  --exec sh -c 'python3 - "$1" <<PY
import json,sys
for c in json.load(open(sys.argv[1])):
    m=c.get("modality")
    if m is not None and not isinstance(m,list):
        print(sys.argv[1], c.get("id"), repr(m))
PY' _ {}

Repository: ENTERPILOT/GoModel

Length of output: 156


Guard modality against non-list values to prevent malformed output.

If a spec defines modality as a string (e.g., "vision"), the current ",".join(r.get("modality") or []) splits it into characters ("v,i,s,i,o,n"). Ensure the value is always a list before joining:

modals = r.get("modality")
if isinstance(modals, str):
    modals = [modals]
modality = ",".join(modals or [])
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py` at line 90, The
`modality` formatting in `report.py` can break when `r.get("modality")` is a
string, causing `",".join(...)` to split it into characters. Update the logic
around the `modality = ",".join(...)` assignment to normalize the `modality`
value in the row `r` before joining. In the report-building code, detect string
values and wrap them in a single-item list, otherwise keep list-like values
as-is, so the output stays correctly comma-separated.

Comment on lines +72 to +77
if *url == "" {
fmt.Fprintln(os.Stderr, "usage: loadgen -url <url> [-n] [-c] [-dialect chat|responses|messages] [-stream] [-model] [-H 'K: V']")
os.Exit(2)
}

body := buildBody(*dialect, *model, *stream)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Fail fast on invalid benchmark flags.

-c 0 deadlocks fixed-count mode on the semaphore, negative -n can panic in make, and an unknown -dialect silently falls back to chat while reporting the wrong variant.

🐛 Proposed validation
 	if *url == "" {
 		fmt.Fprintln(os.Stderr, "usage: loadgen -url <url> [-n] [-c] [-dialect chat|responses|messages] [-stream] [-model] [-H 'K: V']")
 		os.Exit(2)
 	}
+	if *c <= 0 {
+		fmt.Fprintln(os.Stderr, "-c must be > 0")
+		os.Exit(2)
+	}
+	if *n < 0 {
+		fmt.Fprintln(os.Stderr, "-n must be >= 0")
+		os.Exit(2)
+	}
+	switch *dialect {
+	case "chat", "responses", "messages":
+	default:
+		fmt.Fprintln(os.Stderr, "-dialect must be one of: chat, responses, messages")
+		os.Exit(2)
+	}
 
 	body := buildBody(*dialect, *model, *stream)
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
if *url == "" {
fmt.Fprintln(os.Stderr, "usage: loadgen -url <url> [-n] [-c] [-dialect chat|responses|messages] [-stream] [-model] [-H 'K: V']")
os.Exit(2)
}
body := buildBody(*dialect, *model, *stream)
if *url == "" {
fmt.Fprintln(os.Stderr, "usage: loadgen -url <url> [-n] [-c] [-dialect chat|responses|messages] [-stream] [-model] [-H 'K: V']")
os.Exit(2)
}
if *c <= 0 {
fmt.Fprintln(os.Stderr, "-c must be > 0")
os.Exit(2)
}
if *n < 0 {
fmt.Fprintln(os.Stderr, "-n must be >= 0")
os.Exit(2)
}
switch *dialect {
case "chat", "responses", "messages":
default:
fmt.Fprintln(os.Stderr, "-dialect must be one of: chat, responses, messages")
os.Exit(2)
}
body := buildBody(*dialect, *model, *stream)
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go`
around lines 72 - 77, The benchmark entrypoint in main should validate flags
before proceeding: reject non-positive values for the fixed-count concurrency
flag, reject negative values for the request count flag, and fail immediately on
an unknown dialect instead of letting buildBody silently default to chat. Add
these checks near the existing url validation in main and keep the error
messages tied to the parsed flags so invalid benchmark runs exit fast.

Comment on lines +97 to +100
var sensitiveHeaders = map[string]bool{
"authorization": true, "x-api-key": true, "api-key": true,
"x-portkey-api-key": true, "x-goog-api-key": true,
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Expand redaction for credential-bearing capture headers.

/__log can still expose secrets sent via headers such as Proxy-Authorization, Cookie, or gateway/provider-specific key headers not listed here. Prefer a denylist that includes common credential headers and/or redacts names containing key, token, secret, or authorization. As per coding guidelines, **/*.go: Never expose API keys, authorization headers, or secrets in errors or logs.

🛡️ Proposed hardening
 var sensitiveHeaders = map[string]bool{
-	"authorization": true, "x-api-key": true, "api-key": true,
-	"x-portkey-api-key": true, "x-goog-api-key": true,
+	"authorization": true, "proxy-authorization": true,
+	"cookie": true, "set-cookie": true,
+	"x-api-key": true, "api-key": true,
+	"x-portkey-api-key": true, "x-goog-api-key": true,
 }
-		if sensitiveHeaders[strings.ToLower(k)] {
+		lk := strings.ToLower(k)
+		if sensitiveHeaders[lk] ||
+			strings.Contains(lk, "authorization") ||
+			strings.Contains(lk, "token") ||
+			strings.Contains(lk, "secret") ||
+			strings.Contains(lk, "key") {
 			val = fmt.Sprintf("redacted(len=%d)", len(val))
 		}
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
var sensitiveHeaders = map[string]bool{
"authorization": true, "x-api-key": true, "api-key": true,
"x-portkey-api-key": true, "x-goog-api-key": true,
}
var sensitiveHeaders = map[string]bool{
"authorization": true, "proxy-authorization": true,
"cookie": true, "set-cookie": true,
"x-api-key": true, "api-key": true,
"x-portkey-api-key": true, "x-goog-api-key": true,
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go` around
lines 97 - 100, The redaction denylist in sensitiveHeaders is too narrow and can
leak credentials through /__log. Update the header filtering used by the mock
logger in main.go to cover common secret-bearing headers like
Proxy-Authorization and Cookie, plus any gateway/provider-specific key headers
you know about, and add a broader name-based check for headers containing key,
token, secret, or authorization. Keep the logic centralized around the existing
sensitiveHeaders map so all capture/log paths automatically redact these values.

Source: Coding guidelines

Comment on lines +27 to +28
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
cd "$SCRIPT_DIR"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🩺 Stability & Availability | 🟠 Major | ⚡ Quick win

Stop if the script cannot enter its own directory.

Because this script deliberately avoids set -e, a failed cd continues with paths derived from an invalid SCRIPT_DIR, which can create/delete results in the wrong location.

🐛 Proposed fix
 SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
-cd "$SCRIPT_DIR"
+cd "$SCRIPT_DIR" || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
cd "$SCRIPT_DIR"
SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
cd "$SCRIPT_DIR" || exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 28-28: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh` around lines
27 - 28, The run-on-instance script does not stop when changing into its own
directory fails, so subsequent path handling can target the wrong location.
Update the startup logic around SCRIPT_DIR and cd to explicitly detect a failed
directory change and exit immediately with an error before any benchmark paths
are used.

Source: Linters/SAST tools

Comment on lines +207 to +218
record_image() { # gateway image_ref -> results/<gw>_image.json
local gw="$1" ref="$2"
local size digest compressed
size="$(docker image inspect "$ref" --format '{{.Size}}' 2>/dev/null || echo 0)"
digest="$(docker image inspect "$ref" --format '{{if .RepoDigests}}{{index .RepoDigests 0}}{{else}}{{.Id}}{{end}}' 2>/dev/null || echo unknown)"
# Compressed size = what you actually pull/store: gzip the saved image (uniform
# across the locally-built gomodel image and the pulled competitor images).
compressed="$(docker save "$ref" 2>/dev/null | gzip -c | wc -c | tr -d ' ' || echo 0)"
printf '{"gateway":"%s","image":"%s","size_bytes":%s,"size_mb":%.1f,"compressed_bytes":%s,"compressed_mb":%.1f,"digest":"%s"}\n' \
"$gw" "$ref" "${size:-0}" "$(awk "BEGIN{print ${size:-0}/1048576}")" \
"${compressed:-0}" "$(awk "BEGIN{print ${compressed:-0}/1048576}")" "$digest" \
> "$RESULTS_DIR/${gw}_image.json"

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Avoid invalid image JSON when docker save fails.

If an optional competitor image was not pulled, the pipeline can emit both a gzip byte count and 0, making compressed_bytes invalid JSON. Check image existence before measuring.

🐛 Proposed fix
 record_image() {  # gateway image_ref -> results/<gw>_image.json
   local gw="$1" ref="$2"
   local size digest compressed
+  if ! docker image inspect "$ref" >/dev/null 2>&1; then
+    printf '{"gateway":"%s","image":"%s","size_bytes":0,"size_mb":0.0,"compressed_bytes":0,"compressed_mb":0.0,"digest":"unknown"}\n' \
+      "$gw" "$ref" > "$RESULTS_DIR/${gw}_image.json"
+    return 0
+  fi
   size="$(docker image inspect "$ref" --format '{{.Size}}' 2>/dev/null || echo 0)"
   digest="$(docker image inspect "$ref" --format '{{if .RepoDigests}}{{index .RepoDigests 0}}{{else}}{{.Id}}{{end}}' 2>/dev/null || echo unknown)"
@@
-  compressed="$(docker save "$ref" 2>/dev/null | gzip -c | wc -c | tr -d ' ' || echo 0)"
+  compressed="$(docker save "$ref" 2>/dev/null | gzip -c | wc -c | tr -d '[:space:]')" || compressed=0
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
record_image() { # gateway image_ref -> results/<gw>_image.json
local gw="$1" ref="$2"
local size digest compressed
size="$(docker image inspect "$ref" --format '{{.Size}}' 2>/dev/null || echo 0)"
digest="$(docker image inspect "$ref" --format '{{if .RepoDigests}}{{index .RepoDigests 0}}{{else}}{{.Id}}{{end}}' 2>/dev/null || echo unknown)"
# Compressed size = what you actually pull/store: gzip the saved image (uniform
# across the locally-built gomodel image and the pulled competitor images).
compressed="$(docker save "$ref" 2>/dev/null | gzip -c | wc -c | tr -d ' ' || echo 0)"
printf '{"gateway":"%s","image":"%s","size_bytes":%s,"size_mb":%.1f,"compressed_bytes":%s,"compressed_mb":%.1f,"digest":"%s"}\n' \
"$gw" "$ref" "${size:-0}" "$(awk "BEGIN{print ${size:-0}/1048576}")" \
"${compressed:-0}" "$(awk "BEGIN{print ${compressed:-0}/1048576}")" "$digest" \
> "$RESULTS_DIR/${gw}_image.json"
record_image() { # gateway image_ref -> results/<gw>_image.json
local gw="$1" ref="$2"
local size digest compressed
if ! docker image inspect "$ref" >/dev/null 2>&1; then
printf '{"gateway":"%s","image":"%s","size_bytes":0,"size_mb":0.0,"compressed_bytes":0,"compressed_mb":0.0,"digest":"unknown"}\n' \
"$gw" "$ref" > "$RESULTS_DIR/${gw}_image.json"
return 0
fi
size="$(docker image inspect "$ref" --format '{{.Size}}' 2>/dev/null || echo 0)"
digest="$(docker image inspect "$ref" --format '{{if .RepoDigests}}{{index .RepoDigests 0}}{{else}}{{.Id}}{{end}}' 2>/dev/null || echo unknown)"
# Compressed size = what you actually pull/store: gzip the saved image (uniform
# across the locally-built gomodel image and the pulled competitor images).
compressed="$(docker save "$ref" 2>/dev/null | gzip -c | wc -c | tr -d '[:space:]')" || compressed=0
printf '{"gateway":"%s","image":"%s","size_bytes":%s,"size_mb":%.1f,"compressed_bytes":%s,"compressed_mb":%.1f,"digest":"%s"}\n' \
"$gw" "$ref" "${size:-0}" "$(awk "BEGIN{print ${size:-0}/1048576}")" \
"${compressed:-0}" "$(awk "BEGIN{print ${compressed:-0}/1048576}")" "$digest" \
> "$RESULTS_DIR/${gw}_image.json"
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh` around lines
207 - 218, The record_image() helper can emit invalid JSON when docker save
fails for a missing optional image, because the compressed size pipeline may
produce mixed output. Update record_image() to first verify the image exists
before attempting docker image inspect or docker save, and if it is absent set
compressed_bytes (and related size fields if needed) to 0/unknown consistently
so the generated "$RESULTS_DIR/${gw}_image.json" remains valid JSON.

print("\nLATENCY (ms; rps = completed req/s @ c={})".format(meta.get("concurrency", "?")))
hdr = (f"{'target':9} {'variant':18} {'ok/fail':>11} {'rps':>7} {'p50':>7} "
f"{'p99':>7} {'p99 range':>15} {'ttft':>7} {'ovhd':>7}")
print(hdr); print("-" * len(hdr))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Split semicolon-separated statements flagged by Ruff.

These E702 violations are easy to fix and can fail lint-gated checks.

🧹 Proposed cleanup
-    print(hdr); print("-" * len(hdr))
+    print(hdr)
+    print("-" * len(hdr))
@@
-        print(hdrc); print("-" * len(hdrc))
+        print(hdrc)
+        print("-" * len(hdrc))
@@
-    print(hdr2); print("-" * len(hdr2))
+    print(hdr2)
+    print("-" * len(hdr2))

Also applies to: 184-184, 200-200

🧰 Tools
🪛 Ruff (0.15.18)

[error] 159-159: Multiple statements on one line (semicolon)

(E702)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py` at line 159, The
summarize.py cleanup has Ruff E702 violations from multiple statements on one
line. Split the semicolon-separated prints in the affected spots, including the
header output near the code that builds hdr and the other repeated print lines
around the referenced locations, so each statement is on its own line and passes
linting.

Source: Linters/SAST tools

Comment on lines +128 to +132
def mock_reset():
try:
urllib.request.urlopen(urllib.request.Request(MOCK + "/__reset", data=b"", method="POST"), timeout=5).read()
except Exception: # noqa: BLE001
pass

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Fail fast when the mock reset does not succeed.

This harness’s correctness depends on clearing the recorder before every request. If /__reset fails here, the next __log read can include stale upstream entries from the previous case and silently corrupt captures.json. Raise on reset failure (or retry and then raise) instead of swallowing it.

Suggested fix
 def mock_reset():
-    try:
-        urllib.request.urlopen(urllib.request.Request(MOCK + "/__reset", data=b"", method="POST"), timeout=5).read()
-    except Exception:  # noqa: BLE001
-        pass
+    req = urllib.request.Request(MOCK + "/__reset", data=b"", method="POST")
+    resp = urllib.request.urlopen(req, timeout=5)
+    status = getattr(resp, "status", 200)
+    resp.read()
+    if status != 200:
+        raise RuntimeError(f"mock reset failed with HTTP {status}")
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def mock_reset():
try:
urllib.request.urlopen(urllib.request.Request(MOCK + "/__reset", data=b"", method="POST"), timeout=5).read()
except Exception: # noqa: BLE001
pass
def mock_reset():
req = urllib.request.Request(MOCK + "/__reset", data=b"", method="POST")
resp = urllib.request.urlopen(req, timeout=5)
status = getattr(resp, "status", 200)
resp.read()
if status != 200:
raise RuntimeError(f"mock reset failed with HTTP {status}")
🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 129-129: Request-controlled URL passed to urlopen; validate against an allowlist to prevent SSRF.
Context: urllib.request.urlopen(urllib.request.Request(MOCK + "/__reset", data=b"", method="POST"), timeout=5)
Note: [CWE-918] Server-Side Request Forgery (SSRF).

(urlopen-unsanitized-data)

🪛 Ruff (0.15.18)

[warning] 129-132: Use contextlib.suppress(Exception) instead of try-except-pass

(SIM105)


[error] 130-130: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


[error] 130-130: Audit URL open for permitted schemes. Allowing use of file: or custom schemes is often unexpected.

(S310)


[error] 131-132: try-except-pass detected, consider logging the exception

(S110)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/capture.py` around lines
128 - 132, The mock reset helper currently swallows all failures, which can
leave stale recorder state and corrupt later captures. Update mock_reset in
capture.py to fail fast by surfacing the exception from the
urllib.request.urlopen call (or retry a small number of times and then raise)
instead of using a broad except that passes. Keep the fix localized to
mock_reset and ensure the reset request to MOCK + "/__reset" is treated as
required before subsequent __log reads.

Comment on lines +165 to +166
gateways = [g for g in args.gateways.split(",") if g in GATEWAYS]
corpus = json.load(open(args.corpus, encoding="utf-8"))

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Reject unknown --gateways values instead of dropping them silently.

A typo here can produce a partial or fully empty capture run that still exits successfully. Validate the requested names and error if any are unknown or if the filtered list is empty.

🧰 Tools
🪛 ast-grep (0.44.0)

[warning] 165-165: File path is request-/variable-derived; validate and normalize to prevent path traversal.
Context: open(args.corpus, encoding="utf-8")
Note: [CWE-22] Improper Limitation of a Pathname to a Restricted Directory ('Path Traversal').

(open-filename-from-request)

🪛 Ruff (0.15.18)

[warning] 166-166: Use a context manager for opening files

(SIM115)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/capture.py` around lines
165 - 166, The gateway selection in capture.py silently drops invalid values
when building gateways from args.gateways, which can make a typo look like a
successful run. Update the argument handling around the gateways filtering logic
to validate every requested name against GATEWAYS, raise an error if any unknown
names are present, and also fail if the resulting list is empty so the capture
does not proceed with no valid gateways.

Comment on lines +49 to +52
litellm:
profiles: ["all", "litellm"]
image: ${LITELLM_IMAGE:-ghcr.io/berriai/litellm:main-stable}
depends_on: [mock]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟠 Major | ⚡ Quick win

Pin the gateway images to immutable versions.

main-stable and latest make reruns non-reproducible: the same harness can exercise different gateway builds over time while the docs present these results as a reproducible benchmark package. Pin these images to versioned tags or digests before merge.

Also applies to: 60-63, 70-72

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml` around
lines 49 - 52, The gateway service images in the docker-compose setup are still
using mutable tags, which makes benchmark reruns non-reproducible. Update the
image references for the litellm and the other affected services in this compose
file to use immutable versioned tags or digests instead of main-stable/latest.
Keep the existing service definitions and profiles, but pin the image values so
the benchmark package always runs against the same gateway builds.

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 13

♻️ Duplicate comments (1)
docs/2026-06-25_aws_gateway_benchmark/qa/README.md (1)

103-109: 📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add language specifier to fenced code block.

The layout block is still missing a language tag, which triggers markdownlint rule MD040. Use text to keep it plain.

- ```
+ ```text
  run_qa.py        orchestrator + assertion evaluation + CLI
  models.json      logical model roles -> concrete model ids (edit this)
  spec/            declarative cases, one JSON file per endpoint group
  qalib/           stdlib helpers: config, paths, assertions, client, report
  output/          run artifacts (gitignored)

Based on previous review feedback, this was flagged but remains unaddressed.

<details>
<summary>🤖 Prompt for AI Agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In @docs/2026-06-25_aws_gateway_benchmark/qa/README.md around lines 103 - 109,
The fenced layout block in the README is missing a language tag and still
violates markdownlint MD040. Update the code fence around the directory listing
to use the plain-text specifier, and keep the change localized to the README
section containing the run_qa.py, models.json, spec/, qalib/, and output/
listing.


</details>

<!-- cr-comment:v1:ba2b8c834a2a215419aa4dea -->

</blockquote></details>

</blockquote></details>

<details>
<summary>🤖 Prompt for all review comments with AI agents</summary>

Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In @docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json:

  • Around line 52-56: The quality oracle in the chat spec still allows an
    over-permissive color set because contains_any includes green. Update the
    expectation in chat.json so the quality check for the stream only accepts
    the intended primary colors, and remove green unless there is an explicit note
    in the spec explaining why it must remain. Refer to the expect block and its
    quality entry when making the change.
  • Around line 60-81: The chat.openai.stream_usage spec still does not assert
    that a usage chunk appears in the streamed response. Update the expect block
    in chat.json for this case so it explicitly checks the stream for a usage
    event, using the existing stream-quality assertions alongside
    min_events/terminal; keep the change localized to the
    chat.openai.stream_usage entry and ensure it validates
    stream_options.include_usage survives translation.
  • Around line 158-181: The chat.openai.structured_json_schema case only checks
    for non-empty content and a keyword match, so it does not actually verify the
    response_format json_schema contract. Update this spec to assert that
    $.choices[0].message.content is valid JSON and conforms to the expected schema
    with country and capital present, using the chat.openai.structured_json_schema
    test entry and the existing response/quality harness operators (or a json_schema
    assertion if supported) so plain-text responses can’t pass.

In @docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json:

  • Around line 22-38: The conversations.get expectation only checks that the
    returned id is present, so it can pass on the wrong object. Update the expect
    block in the conversations.get entry to assert that the response $.id equals
    ${conversation_id}, while keeping the existing $.object check, so the test
    verifies path interpolation and identity using the conversations.get
    request/expect fields.
  • Around line 75-91: The conversations.delete expectation is missing an
    identity check for the deleted resource. Update the expect.response block in
    conversations.delete to also assert that $.id equals ${conversation_id} so
    the response is verified against the correct conversation, alongside the
    existing $.object and $.deleted checks.
  • Around line 40-58: The conversations.use_in_responses spec only checks
    $.status, so it does not verify that conversation.id was actually honored by
    /v1/responses. Update this scenario in conversations.json to add a stronger
    assertion in the expect block, such as validating the captured conversation id
    in the response or adding a follow-up turn that confirms the stored value is
    remembered. Use the existing request.conversation.id and expect structure to
    locate the change.
  • Around line 60-73: The conversations.update scenario is missing an identity
    assertion, so it only checks the object type and not the specific conversation
    returned. Update the expectations in conversations.update to also verify that
    the response id matches ${conversation_id}, alongside the existing $.object
    check, so the spec confirms the updated conversation is the one requested.

In @docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json:

  • Around line 22-39: The batch embeddings spec in embeddings.openai.batch only
    checks that at least three results are returned, so it does not verify the order
    implied by the note. Update the expect.response assertions in the embeddings
    test case to check each item’s index value matches its input position, using the
    $.data entries in the batch response, while keeping the existing embedding
    length assertion for a representative item.

In @docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json:

  • Around line 43-66: The OpenAI unknown-field case in errors.json is too brittle
    because it hard-codes a 400-only outcome in the expectations. Update the
    errors.openai_unknown_field_forwarded test to allow both accepted gateway
    sanitization and provider rejection by using a status set that includes 200 and
    400, while keeping the audit assertion on $.data.request_body.x_qa_marker and
    the response checks in expect.response so the stable invariants still verify
    behavior.

In @docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh:

  • Around line 172-176: The sustained-load sampler loop in the docker stats
    polling block is running without any delay, which can create measurement noise.
    Add a small fixed sleep inside the while loop in run-on-instance.sh after the
    docker stats/awk pipeline so the sampling rate is throttled, and keep the change
    scoped to the existing container-name polling logic using cname, csv, and
    STAT_AWK.

In @docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf:

  • Around line 25-28: The benchmark AMI selection is not actually reproducible
    because the ami_id variable defaults to empty and the main Terraform logic
    resolves that to the latest Amazon Linux SSM image. Update the ami_id variable
    and the AMI resolution path in main.tf so the benchmark uses a pinned AMI ID, or
    make ami_id required with no default; keep the existing variable and AMI lookup
    symbols consistent so callers must explicitly choose the image.

In @docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py:

  • Around line 61-64: The _cell helper in analyze.py should normalize AI-authored
    values before rendering the Markdown table, since it currently treats plain
    strings like iterables and leaves raw separators/newlines unescaped. Update
    _cell() to accept both list-like inputs and scalar strings by wrapping
    non-sequences as a single value, then escape Markdown-sensitive characters such
    as pipes and line breaks before joining/truncating. Keep the fix localized to
    _cell() so the table-building code continues to call it the same way.

In @docs/2026-06-25_aws_gateway_benchmark/translation/README.md:

  • Around line 35-41: The fenced block in the README summary is missing a
    language tag, which triggers markdownlint. Update the fence around the
    docker-compose/corpus/capture/analyze/output listing to use an explicit language
    or remove the fence if it is meant to be plain text, keeping the surrounding
    content unchanged.

Duplicate comments:
In @docs/2026-06-25_aws_gateway_benchmark/qa/README.md:

  • Around line 103-109: The fenced layout block in the README is missing a
    language tag and still violates markdownlint MD040. Update the code fence around
    the directory listing to use the plain-text specifier, and keep the change
    localized to the README section containing the run_qa.py, models.json, spec/,
    qalib/, and output/ listing.

</details>

<details>
<summary>🪄 Autofix (Beta)</summary>

Fix all unresolved CodeRabbit comments on this PR:

- [ ] <!-- {"checkboxId": "4b0d0e0a-96d7-4f10-b296-3a18ea78f0b9"} --> Push a commit to this branch (recommended)
- [ ] <!-- {"checkboxId": "ff5b1114-7d8c-49e6-8ac1-43f82af23a33"} --> Create a new PR with the fixes

</details>

---

<details>
<summary>ℹ️ Review info</summary>

<details>
<summary>⚙️ Run configuration</summary>

**Configuration used**: Organization UI

**Review profile**: ASSERTIVE

**Plan**: Pro

**Run ID**: `36f79096-f8f1-47db-87d4-19033dc0f9b0`

</details>

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between dc367128416a0778fb946e62b4b9d8383173647b and f8833e5588b1a4b34c0b97c041939107141a10e7.

</details>

<details>
<summary>⛔ Files ignored due to path filters (5)</summary>

* `docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-image.svg` is excluded by `!**/*.svg`
* `docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-latency-p99.svg` is excluded by `!**/*.svg`
* `docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-memory.svg` is excluded by `!**/*.svg`
* `docs/2026-06-25_aws_gateway_benchmark/charts/june-2026-throughput.svg` is excluded by `!**/*.svg`
* `docs/2026-06-25_aws_gateway_benchmark/cover.png` is excluded by `!**/*.png`

</details>

<details>
<summary>📒 Files selected for processing (42)</summary>

* `docs/2026-06-25_aws_gateway_benchmark/.gitignore`
* `docs/2026-06-25_aws_gateway_benchmark/ARTICLE.md`
* `docs/2026-06-25_aws_gateway_benchmark/README.md`
* `docs/2026-06-25_aws_gateway_benchmark/RESULTS.md`
* `docs/2026-06-25_aws_gateway_benchmark/qa/.gitignore`
* `docs/2026-06-25_aws_gateway_benchmark/qa/README.md`
* `docs/2026-06-25_aws_gateway_benchmark/qa/models.json`
* `docs/2026-06-25_aws_gateway_benchmark/qa/qalib/__init__.py`
* `docs/2026-06-25_aws_gateway_benchmark/qa/qalib/assertions.py`
* `docs/2026-06-25_aws_gateway_benchmark/qa/qalib/client.py`
* `docs/2026-06-25_aws_gateway_benchmark/qa/qalib/config.py`
* `docs/2026-06-25_aws_gateway_benchmark/qa/qalib/paths.py`
* `docs/2026-06-25_aws_gateway_benchmark/qa/qalib/report.py`
* `docs/2026-06-25_aws_gateway_benchmark/qa/run_qa.py`
* `docs/2026-06-25_aws_gateway_benchmark/qa/spec/audio.json`
* `docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json`
* `docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json`
* `docs/2026-06-25_aws_gateway_benchmark/qa/spec/embeddings.json`
* `docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json`
* `docs/2026-06-25_aws_gateway_benchmark/qa/spec/messages.json`
* `docs/2026-06-25_aws_gateway_benchmark/qa/spec/responses.json`
* `docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile`
* `docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod`
* `docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go`
* `docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go`
* `docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json`
* `docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml`
* `docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml`
* `docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh`
* `docs/2026-06-25_aws_gateway_benchmark/run.sh`
* `docs/2026-06-25_aws_gateway_benchmark/scripts/make_cover.py`
* `docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py`
* `docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf`
* `docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf`
* `docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf`
* `docs/2026-06-25_aws_gateway_benchmark/translation/.gitignore`
* `docs/2026-06-25_aws_gateway_benchmark/translation/README.md`
* `docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py`
* `docs/2026-06-25_aws_gateway_benchmark/translation/capture.py`
* `docs/2026-06-25_aws_gateway_benchmark/translation/corpus.json`
* `docs/2026-06-25_aws_gateway_benchmark/translation/docker-compose.yml`
* `docs/about/benchmarks.mdx`

</details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

Comment on lines +52 to +56
"expect": {
"status": 200,
"stream": {"min_events": 2, "terminal": "[DONE]", "text": [{"not_empty": true}]},
"quality": [{"target": "stream", "contains_any": ["red", "blue", "yellow", "green"]}]
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Primary colors oracle still includes green.

The contains_any list at line 55 still accepts green, which the previous review flagged as incorrect for this prompt. If this was intentionally retained to cover RGB interpretations, add a comment in the notes; otherwise the oracle is over-permissive.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json` around lines 52 -
56, The quality oracle in the chat spec still allows an over-permissive color
set because `contains_any` includes `green`. Update the expectation in
`chat.json` so the `quality` check for the stream only accepts the intended
primary colors, and remove `green` unless there is an explicit note in the spec
explaining why it must remain. Refer to the `expect` block and its `quality`
entry when making the change.

Comment on lines +60 to +81
"id": "chat.openai.stream_usage",
"title": "OpenAI chat: stream_options include_usage emits a usage chunk",
"provider": "openai",
"modality": ["text", "streaming"],
"request": {
"path": "/v1/chat/completions",
"stream": true,
"body": {
"model": "@openai.chat",
"messages": [{"role": "user", "content": "Say hi."}],
"stream": true,
"stream_options": {"include_usage": true},
"max_tokens": 32
}
},
"expect": {
"status": 200,
"stream": {"min_events": 2, "terminal": "[DONE]"},
"quality": [{"target": "stream", "not_empty": true}]
},
"notes": "stream_options must survive translation; usage chunk is provider-shaped."
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Stream usage case still does not verify usage chunks are emitted.

Despite being marked "Addressed in commit dc36712", the chat.openai.stream_usage case still only asserts min_events, terminal, and not_empty. It does not specifically check for a usage chunk in the stream. A gateway that drops stream_options.include_usage entirely will pass this test.

🔧 Proposed fix

Add a stream quality assertion that requires a usage event:

"quality": [
  {"target": "stream", "contains_any": ["usage"]},
  {"target": "stream", "not_empty": true}
]

Or add a response path check if the usage is aggregated post-stream.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json` around lines 60 -
81, The `chat.openai.stream_usage` spec still does not assert that a usage chunk
appears in the streamed response. Update the `expect` block in `chat.json` for
this case so it explicitly checks the stream for a `usage` event, using the
existing stream-quality assertions alongside `min_events`/`terminal`; keep the
change localized to the `chat.openai.stream_usage` entry and ensure it validates
`stream_options.include_usage` survives translation.

Comment thread docs/2026-06-25_aws_gateway_benchmark/qa/spec/chat.json Outdated
Comment on lines +22 to +38
"id": "conversations.get",
"title": "Conversations: retrieve by id",
"provider": "openai",
"modality": ["stateful"],
"request": {
"method": "GET",
"path": "/v1/conversations/${conversation_id}"
},
"expect": {
"status": 200,
"response": [
{"path": "$.object", "equals": "conversation"},
{"path": "$.id", "not_empty": true}
]
},
"notes": "Reads back the conversation created above (uses ${conversation_id})."
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

conversations.get does not verify object identity.

The test only asserts $.id is non-empty, not that it equals the captured ${conversation_id}. A gateway returning any conversation-shaped payload passes. Add {"path": "$.id", "equals": "${conversation_id}"} to verify path interpolation and object identity.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json` around
lines 22 - 38, The conversations.get expectation only checks that the returned
id is present, so it can pass on the wrong object. Update the expect block in
the conversations.get entry to assert that the response $.id equals
${conversation_id}, while keeping the existing $.object check, so the test
verifies path interpolation and identity using the conversations.get
request/expect fields.

Comment on lines +40 to +58
"id": "conversations.use_in_responses",
"title": "Conversations: link a Responses call to a conversation",
"provider": "openai",
"modality": ["stateful", "text"],
"request": {
"path": "/v1/responses",
"body": {
"model": "@openai.chat",
"conversation": {"id": "${conversation_id}"},
"input": "Remember the number 7.",
"max_output_tokens": 32
}
},
"expect": {
"status": 200,
"response": [{"path": "$.status", "equals": "completed"}]
},
"notes": "Responses request bound to a conversation id (stateful threading)."
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

conversations.use_in_responses never proves response is attached to the conversation.

Only $.status == "completed" is checked, so a stateless /v1/responses call that ignores conversation.id entirely will pass. Add an audit check for the captured conversation id or a follow-up turn that recalls the stored value.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/conversations.json` around
lines 40 - 58, The `conversations.use_in_responses` spec only checks `$.status`,
so it does not verify that `conversation.id` was actually honored by
`/v1/responses`. Update this scenario in `conversations.json` to add a stronger
assertion in the `expect` block, such as validating the captured conversation id
in the response or adding a follow-up turn that confirms the stored value is
remembered. Use the existing `request.conversation.id` and `expect` structure to
locate the change.

Comment on lines +43 to +66
"id": "errors.openai_unknown_field_forwarded",
"title": "Behavior: unknown top-level fields are forwarded verbatim (provider rejects)",
"provider": "openai",
"modality": ["errors", "preservation"],
"request": {
"path": "/v1/chat/completions",
"body": {
"model": "@openai.chat",
"messages": [{"role": "user", "content": "hi"}],
"x_qa_marker": "keep-123",
"max_tokens": 8
}
},
"expect": {
"status": [400],
"response": [
{"path": "$.error.message", "not_empty": true},
{"path": "$.error.message", "contains": "x_qa_marker", "hard": false},
{"path": "$.error.type", "equals": "invalid_request_error", "hard": false}
],
"audit": [{"path": "$.data.request_body.x_qa_marker", "equals": "keep-123", "hard": false}]
},
"notes": "Documented finding (2026-06): GoModel does not strip unrecognized top-level fields; it forwards them, so a strict provider (OpenAI) returns 400 'Unrecognized request argument'. The audit confirms the field was captured inbound and passed through. If the gateway later sanitizes unknown fields, change expect.status to 200."
},

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Unknown field test still bakes current buggy behavior into the pass condition.

Requiring status: [400] means the suite will fail if the gateway later sanitizes or accepts unknown fields — even though the note explicitly says that outcome is acceptable. The note even instructs future maintainers to "change expect.status to 200" if behavior changes. This is a brittle test that encodes a bug as the pass condition.

Prefer asserting stable invariants: no 5xx, no crash, and audit captures x_qa_marker, while allowing either 400 (provider rejects) or 200 (gateway sanitizes). Use status: [200, 400] and keep the audit check.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/qa/spec/errors.json` around lines 43 -
66, The OpenAI unknown-field case in errors.json is too brittle because it
hard-codes a 400-only outcome in the expectations. Update the
`errors.openai_unknown_field_forwarded` test to allow both accepted gateway
sanitization and provider rejection by using a status set that includes 200 and
400, while keeping the audit assertion on `$.data.request_body.x_qa_marker` and
the response checks in `expect.response` so the stable invariants still verify
behavior.

Comment on lines +172 to +176
(
while docker ps --format '{{.Names}}' | grep -q "^${cname}$"; do
docker stats --no-stream --format '{{.MemUsage}};{{.CPUPerc}}' "$cname" 2>/dev/null \
| awk "$STAT_AWK" >> "$csv" || true
done

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🚀 Performance & Scalability | 🟠 Major | ⚡ Quick win

Throttle the resource sampler loop.

This loop runs docker stats --no-stream as fast as possible during the sustained-load window, which can add host-side CPU/I/O noise and distort the resource/throughput measurements. Add a small fixed interval between samples.

Proposed fix
     while docker ps --format '{{.Names}}' | grep -q "^${cname}$"; do
       docker stats --no-stream --format '{{.MemUsage}};{{.CPUPerc}}' "$cname" 2>/dev/null \
         | awk "$STAT_AWK" >> "$csv" || true
+      sleep 1
     done
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
(
while docker ps --format '{{.Names}}' | grep -q "^${cname}$"; do
docker stats --no-stream --format '{{.MemUsage}};{{.CPUPerc}}' "$cname" 2>/dev/null \
| awk "$STAT_AWK" >> "$csv" || true
done
(
while docker ps --format '{{.Names}}' | grep -q "^${cname}$"; do
docker stats --no-stream --format '{{.MemUsage}};{{.CPUPerc}}' "$cname" 2>/dev/null \
| awk "$STAT_AWK" >> "$csv" || true
sleep 1
done
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh` around lines
172 - 176, The sustained-load sampler loop in the docker stats polling block is
running without any delay, which can create measurement noise. Add a small fixed
sleep inside the while loop in run-on-instance.sh after the docker stats/awk
pipeline so the sampling rate is throttled, and keep the change scoped to the
existing container-name polling logic using cname, csv, and STAT_AWK.

Comment on lines +25 to +28
variable "ami_id" {
description = "Override the AMI. Empty = latest Amazon Linux 2023 x86_64 via SSM (reproducible by policy, not by digest)."
type = string
default = ""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟠 Major | ⚡ Quick win

Don't default a “reproducible” benchmark to the latest AMI.

docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf:19-29 resolves an empty ami_id through the Amazon Linux “latest” SSM parameter, so reruns can pick up a different OS image and Docker stack without any code change. Pin the tested AMI ID, or require callers to pass one explicitly.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf` around lines 25
- 28, The benchmark AMI selection is not actually reproducible because the
ami_id variable defaults to empty and the main Terraform logic resolves that to
the latest Amazon Linux SSM image. Update the ami_id variable and the AMI
resolution path in main.tf so the benchmark uses a pinned AMI ID, or make ami_id
required with no default; keep the existing variable and AMI lookup symbols
consistent so callers must explicitly choose the image.

Comment on lines +61 to +64
def _cell(items):
if not items:
return "—"
return "; ".join(str(x) for x in items)[:120]

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Normalize and escape AI-authored cell values before building the Markdown table.

_cell() assumes list-like input. If an analyst writes a plain string for one of these fields, the report renders it character-by-character, and raw | or newlines will break the table layout.

Proposed fix
 def _cell(items):
-    if not items:
+    if items in (None, "", []):
         return "—"
-    return "; ".join(str(x) for x in items)[:120]
+    if isinstance(items, (list, tuple)):
+        text = "; ".join(str(x) for x in items)
+    else:
+        text = str(items)
+    return text.replace("|", "\\|").replace("\n", "<br>")[:120]
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
def _cell(items):
if not items:
return "—"
return "; ".join(str(x) for x in items)[:120]
def _cell(items):
if items in (None, "", []):
return "—"
if isinstance(items, (list, tuple)):
text = "; ".join(str(x) for x in items)
else:
text = str(items)
return text.replace("|", "\\|").replace("\n", "<br>")[:120]
🧰 Tools
🪛 Ruff (0.15.18)

[warning] 61-61: Missing return type annotation for private function _cell

(ANN202)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/analyze.py` around lines 61
- 64, The _cell helper in analyze.py should normalize AI-authored values before
rendering the Markdown table, since it currently treats plain strings like
iterables and leaves raw separators/newlines unescaped. Update _cell() to accept
both list-like inputs and scalar strings by wrapping non-sequences as a single
value, then escape Markdown-sensitive characters such as pipes and line breaks
before joining/truncating. Keep the fix localized to _cell() so the
table-building code continues to call it the same way.

Comment on lines +35 to +41
```
docker-compose.yml mock (MOCK_RECORD=1) + all 4 gateways, reusing ../remote configs
corpus.json 12 gateway-agnostic cases across chat/responses/messages, stream + not
capture.py resets the mock, sends each case through each gateway, records 4 artifacts
analyze.py builds per-case AI-analysis prompts from the captures (one bundle per case)
output/ captures.json + the AI comparison report (gitignored)
```

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🟡 Minor | ⚡ Quick win

Add a language to this fenced block.

The unlabeled fence trips markdownlint and will keep docs lint noisy for a purely formatting issue.

🧰 Tools
🪛 markdownlint-cli2 (0.22.1)

[warning] 35-35: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/translation/README.md` around lines 35
- 41, The fenced block in the README summary is missing a language tag, which
triggers markdownlint. Update the fence around the
docker-compose/corpus/capture/analyze/output listing to use an explicit language
or remove the fence if it is meant to be plain text, keeping the surrounding
content unchanged.

Source: Linters/SAST tools

Adds docs/2026-06-25_aws_gateway_benchmark/: a reproducible AWS benchmark
comparing GoModel against LiteLLM, Portkey, and Bifrost on latency, throughput,
memory, image size, and cold start, using a shared recording-mock backend so the
numbers reflect gateway overhead. Includes RESULTS.md and the one-command
Terraform/Docker harness (run.sh, remote/, terraform/, scripts/summarize.py).

Refreshes docs/about/benchmarks.mdx to this June 2026 run (with a paid-AWS note).

The benchmark write-up (ARTICLE.md, cover, charts) and the co-located QA and
translation tooling are in a separate draft PR.

Terraform state, provider binaries, the generated SSH key, and raw run output are
gitignored.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 3

♻️ Duplicate comments (5)
docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile (2)

12-12: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Pin the runtime base image by digest.

The gcr.io/distroless/static-debian12:nonroot runtime image is also still unpinned. Pin to an immutable digest for reproducible benchmark builds.

-FROM gcr.io/distroless/static-debian12:nonroot
+FROM gcr.io/distroless/static-debian12:nonroot@sha256:<digest>
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile` at line
12, The runtime base image in the Dockerfile is still tag-based and unpinned,
which makes benchmark builds non-reproducible. Update the `FROM
gcr.io/distroless/static-debian12:nonroot` line to use the immutable digest form
for that exact image so the `Dockerfile` always builds from the same runtime
base.

4-4: 🗄️ Data Integrity & Integration | 🟠 Major | ⚡ Quick win

Pin the build-stage base image by digest.

The previously-flagged digest pinning was not applied. golang:1.26-alpine is still a floating tag that can shift under rebuilds, undermining benchmark reproducibility.

-FROM golang:1.26-alpine AS build
+FROM golang:1.26-alpine@sha256:<digest> AS build
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile` at line
4, The build-stage base image is still using the floating golang:1.26-alpine
tag, which can change between rebuilds. Update the FROM line in the Dockerfile
to use a digest-pinned golang image so the build stage remains reproducible;
keep the same build-stage alias (AS build) and replace the tag reference with
the resolved digest.
docs/about/benchmarks.mdx (1)

82-85: 🎯 Functional Correctness | 🟡 Minor | ⚡ Quick win

Fix directory casing in the reproduce snippet.

The previously-flagged casing mismatch persists. git clone creates a GoModel/ directory (matching the repo name), but cd gomodel/... uses lowercase and will fail on case-sensitive filesystems.

 git clone https://github.com/ENTERPILOT/GoModel.git
-cd gomodel/docs/2026-06-25_aws_gateway_benchmark
+cd GoModel/docs/2026-06-25_aws_gateway_benchmark
 ./run.sh
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/about/benchmarks.mdx` around lines 82 - 85, The reproduce snippet still
has a directory casing mismatch: the clone step creates a GoModel directory, but
the subsequent cd uses lowercase gomodel and will fail on case-sensitive
filesystems. Update the path in the benchmark instructions so the directory name
matches the repository name exactly, keeping the fix within the benchmark repro
snippet in benchmarks.mdx.
docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf (2)

122-148: 🔒 Security & Privacy | 🟠 Major | ⚡ Quick win

Harden the instance: enforce IMDSv2 and encrypt the root volume.

None of the previously-flagged posture improvements have been applied. The aws_instance.bench block still has:

  • IMDSv1 enabled (no metadata_options), allowing token-less metadata access.
  • Unencrypted gp3 root volume.

Even for an ephemeral benchmark host these are one-line fixes.

🛡️ Proposed hardening
   root_block_device {
     volume_type = "gp3"
     volume_size = var.root_volume_gb
+    encrypted   = true
   }
+
+  metadata_options {
+    http_tokens   = "required"
+    http_endpoint = "enabled"
+  }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf` around lines 122 -
148, The aws_instance.bench resource still lacks the requested hardening: add
metadata_options to require IMDSv2 and disable IMDSv1, and enable encryption on
the root_block_device for the gp3 volume. Update the aws_instance "bench" block
in main.tf so the instance enforces token-based metadata access and the root
volume is encrypted by default.

111-115: 🔒 Security & Privacy | 🟡 Minor | ⚡ Quick win

Verify the Docker Compose plugin download with a checksum.

The curl still fetches the compose plugin binary without checksum or signature verification. For reproducibility and supply-chain integrity, verify the published SHA256 after download.

🔏 Proposed verification step
     curl -fsSL -o /usr/libexec/docker/cli-plugins/docker-compose \
       "https://github.com/docker/compose/releases/download/${var.compose_plugin_version}/docker-compose-linux-x86_64"
+    curl -fsSL -o /tmp/docker-compose.sha256 \
+      "https://github.com/docker/compose/releases/download/${var.compose_plugin_version}/docker-compose-linux-x86_64.sha256"
+    (cd /tmp && sha256sum -c docker-compose.sha256)
     chmod +x /usr/libexec/docker/cli-plugins/docker-compose
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf` around lines 111 -
115, The Docker Compose plugin download in the Terraform bootstrap script is not
being verified, so add SHA256 verification for the file fetched by the curl step
before chmod. Update the provisioning logic around the docker-compose download
to compute or check the expected checksum for the pinned compose_plugin_version,
and fail the script if the binary does not match. Use the existing
docker-compose install block as the place to add the checksum verification so
the download remains reproducible and integrity-checked.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Inline comments:
In `@docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh`:
- Around line 83-89: The gw_model helper is not dialect-aware, so Bifrost
Messages requests are still using the openai/ prefix and mismatching the
/anthropic/v1/messages route. Update gw_model to accept the dialect alongside
the gateway and return an anthropic/<model-id> value for the bifrost + messages
case, while preserving the existing OpenAI/default behavior for other paths.
Make sure every gw_model call site passes the dialect argument consistently so
the routing logic in gw_path and gw_model stays aligned.

In `@docs/2026-06-25_aws_gateway_benchmark/RESULTS.md`:
- Line 11: The benchmark RESULTS.md has inconsistent hyphenation for the warmup
term, so standardize the wording to one style everywhere. Update the text around
the benchmark summary and any other occurrences in RESULTS.md to use the same
form consistently, and keep the chosen style aligned with the existing
terminology in the document.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf`:
- Around line 21-23: The instance_id output is missing a description, unlike the
other outputs. Update the output block for instance_id in the outputs definition
to include a clear description that explains what aws_instance.bench.id
represents, keeping it consistent with the other output declarations.

---

Duplicate comments:
In `@docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile`:
- Line 12: The runtime base image in the Dockerfile is still tag-based and
unpinned, which makes benchmark builds non-reproducible. Update the `FROM
gcr.io/distroless/static-debian12:nonroot` line to use the immutable digest form
for that exact image so the `Dockerfile` always builds from the same runtime
base.
- Line 4: The build-stage base image is still using the floating
golang:1.26-alpine tag, which can change between rebuilds. Update the FROM line
in the Dockerfile to use a digest-pinned golang image so the build stage remains
reproducible; keep the same build-stage alias (AS build) and replace the tag
reference with the resolved digest.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf`:
- Around line 122-148: The aws_instance.bench resource still lacks the requested
hardening: add metadata_options to require IMDSv2 and disable IMDSv1, and enable
encryption on the root_block_device for the gp3 volume. Update the aws_instance
"bench" block in main.tf so the instance enforces token-based metadata access
and the root volume is encrypted by default.
- Around line 111-115: The Docker Compose plugin download in the Terraform
bootstrap script is not being verified, so add SHA256 verification for the file
fetched by the curl step before chmod. Update the provisioning logic around the
docker-compose download to compute or check the expected checksum for the pinned
compose_plugin_version, and fail the script if the binary does not match. Use
the existing docker-compose install block as the place to add the checksum
verification so the download remains reproducible and integrity-checked.

In `@docs/about/benchmarks.mdx`:
- Around line 82-85: The reproduce snippet still has a directory casing
mismatch: the clone step creates a GoModel directory, but the subsequent cd uses
lowercase gomodel and will fail on case-sensitive filesystems. Update the path
in the benchmark instructions so the directory name matches the repository name
exactly, keeping the fix within the benchmark repro snippet in benchmarks.mdx.
🪄 Autofix (Beta)

Fix all unresolved CodeRabbit comments on this PR:

  • Push a commit to this branch (recommended)
  • Create a new PR with the fixes

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: ASSERTIVE

Plan: Pro

Run ID: c78ed3ee-2b58-40fe-82a5-08dd172f50ad

📥 Commits

Reviewing files that changed from the base of the PR and between f8833e5 and 97afaad.

📒 Files selected for processing (17)
  • docs/2026-06-25_aws_gateway_benchmark/.gitignore
  • docs/2026-06-25_aws_gateway_benchmark/README.md
  • docs/2026-06-25_aws_gateway_benchmark/RESULTS.md
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/Dockerfile
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/go.mod
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/loadgen/main.go
  • docs/2026-06-25_aws_gateway_benchmark/remote/bench-tools/mock/main.go
  • docs/2026-06-25_aws_gateway_benchmark/remote/configs/bifrost-config.json
  • docs/2026-06-25_aws_gateway_benchmark/remote/configs/litellm-config.yaml
  • docs/2026-06-25_aws_gateway_benchmark/remote/docker-compose.yml
  • docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh
  • docs/2026-06-25_aws_gateway_benchmark/run.sh
  • docs/2026-06-25_aws_gateway_benchmark/scripts/summarize.py
  • docs/2026-06-25_aws_gateway_benchmark/terraform/main.tf
  • docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf
  • docs/2026-06-25_aws_gateway_benchmark/terraform/variables.tf
  • docs/about/benchmarks.mdx

Comment on lines +83 to +89
# Model name per gateway. Bifrost routes by an explicit "provider/model" prefix.
gw_model() { case "$1" in bifrost) echo "openai/gpt-4o-mini";; *) echo "gpt-4o-mini";; esac; }

# Path per (gateway, dialect). Bifrost exposes Anthropic Messages under /anthropic/v1/messages.
gw_path() { # target dialect default_path
if [[ "$1" == "bifrost" && "$2" == "messages" ]]; then echo "/anthropic/v1/messages"; else echo "$3"; fi
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🗄️ Data Integrity & Integration | 🟠 Major

🧩 Analysis chain

🌐 Web query:

Bifrost gateway Anthropic /anthropic/v1/messages model provider prefix documentation

💡 Result:

In the Bifrost gateway, the provider prefix is used to explicitly target a specific model provider, ensuring deterministic routing [1][2]. The standard format for specifying a model is provider/model [3][4]. For the Anthropic provider, the prefix is anthropic/ [1]. When using Bifrost's unified inference API or configuring routing rules, models are referenced using this prefix—for example, anthropic/claude-3-5-sonnet [5][6]. Key details regarding the Anthropic provider prefix in Bifrost include: Routing and Determinism: Using the anthropic/ prefix ensures the gateway directs the request to the Anthropic provider [1]. If you do not use a provider prefix (e.g., passing only "model": "claude-3-5-sonnet"), Bifrost uses its Model Catalog to automatically resolve a provider [7][2]. However, for deterministic behavior in production, explicit provider/model formatting is recommended [2]. Configuration: When defining allowed models in your provider configurations (e.g., within governance or routing policies), you can use the anthropic/ prefix to specify exactly which models are permitted through that provider [8][7][9]. Bifrost will strip the prefix internally to match the model against the requested name [8][9]. Compatibility APIs: Bifrost provides an Anthropic-compatible endpoint at /anthropic/v1/messages [3][6]. While this endpoint is designed to be a drop-in replacement for the native Anthropic API (allowing you to use the standard Anthropic SDK) [6][10], you can still utilize Bifrost's routing flexibility. By sending requests to this endpoint, you can still specify models with the anthropic/ prefix or even other prefixes if the gateway is configured to allow cross-provider routing within that compatible format [6]. In summary, anthropic/ is the canonical prefix used throughout Bifrost documentation and routing logic to distinguish Anthropic models from those of other providers [1][5].

Citations:


Make gw_model dialect-aware for Bifrost Anthropic routing.

The gw_model function currently returns openai/gpt-4o-mini for all Bifrost requests, even when gw_path redirects to /anthropic/v1/messages. Bifrost routing requires the anthropic/<model-id> prefix (e.g., anthropic/claude-3-5-sonnet) for the Anthropic Messages endpoint to correctly route to the Anthropic provider. Using the openai/ prefix on this path results in a provider mismatch, causing the benchmark to fail or route to the wrong provider.

Update gw_model to return anthropic/claude-3-5-sonnet (or the configured Anthropic model) when the gateway is bifrost and the dialect is messages.

Recommended changes
# Model name per gateway. Bifrost routes by an explicit "provider/model" prefix.
gw_model() {
  local gateway="$1" dialect="$2"
  if [[ "$gateway" == "bifrost" && "$dialect" == "messages" ]]; then
    echo "anthropic/claude-3-5-sonnet" # Matches Bifrost Anthropic path requirement
  elif [[ "$gateway" == "bifrost" ]]; then
    echo "openai/gpt-4o-mini"
  else
    echo "gpt-4o-mini"
  fi
}

Ensure the dialect argument is passed to gw_model wherever it is invoked.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/remote/run-on-instance.sh` around lines
83 - 89, The gw_model helper is not dialect-aware, so Bifrost Messages requests
are still using the openai/ prefix and mismatching the /anthropic/v1/messages
route. Update gw_model to accept the dialect alongside the gateway and return an
anthropic/<model-id> value for the bifrost + messages case, while preserving the
existing OpenAI/default behavior for other paths. Make sure every gw_model call
site passes the dialect argument consistently so the routing logic in gw_path
and gw_model stays aligned.

drift, so the tail is stable), Amazon Linux 2023, us-east-1.
- **Load**: N=8000 requests/variant, concurrency 10, **2 randomized-order trials**
(latency = median across trials; p99 shown with its min–max), 200-request
process warmup + 50-request per-variant warmup, per-variant wall cap 10 s, 8 s

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Inconsistent hyphenation: "warmup" vs "warm-up".

Line 11 uses "warmup" (no hyphen) while line 124 uses "Warm-up" (hyphen). Pick one style and apply consistently throughout.

🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/RESULTS.md` at line 11, The benchmark
RESULTS.md has inconsistent hyphenation for the warmup term, so standardize the
wording to one style everywhere. Update the text around the benchmark summary
and any other occurrences in RESULTS.md to use the same form consistently, and
keep the chosen style aligned with the existing terminology in the document.

Comment on lines +21 to +23
output "instance_id" {
value = aws_instance.bench.id
}

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

📐 Maintainability & Code Quality | 🔵 Trivial | 💤 Low value

Add a description to the instance_id output.

For consistency with the other outputs, add a description explaining what this value represents.

 output "instance_id" {
+  description = "EC2 instance ID of the benchmark host."
   value = aws_instance.bench.id
 }
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
output "instance_id" {
value = aws_instance.bench.id
}
output "instance_id" {
description = "EC2 instance ID of the benchmark host."
value = aws_instance.bench.id
}
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@docs/2026-06-25_aws_gateway_benchmark/terraform/outputs.tf` around lines 21 -
23, The instance_id output is missing a description, unlike the other outputs.
Update the output block for instance_id in the outputs definition to include a
clear description that explains what aws_instance.bench.id represents, keeping
it consistent with the other output declarations.

@SantiagoDePolonia SantiagoDePolonia merged commit 2eb4def into main Jun 26, 2026
21 checks passed
@SantiagoDePolonia

Copy link
Copy Markdown
Contributor Author

Review triage — fixed the harness/test issues that don't affect the measurements; flagging the ones that do.

Addressed in 857857f2 here (and the companion QA + translation fixes in #430): the security, robustness, lint, docs, and reproducibility items — explicit ssh_ingress_cidr + CIDR validation (no 0.0.0.0/0 default), IMDSv2 + encrypted root volume, Docker Compose plugin sha256 verification, base-image digest pins, tracked Terraform lockfile, quoted rsync -e, loadgen flag validation, expanded mock redaction + a recording-mode data-race fix, run-on-instance.sh guards, summarize.py lint, the inaccurate "free-tier" wording, and the benchmarks page now links the published write-up.

The points below touch the measurement itself — fixing any of them would change the published numbers and/or require re-running the benchmark, so I left them as-is for an explicit decision rather than silently altering results:

  1. Loadgen leaves the SSE body undrained on early returnremote/bench-tools/loadgen/main.go (~303). Returning at the terminal marker / idle gap without draining + closing the body blocks HTTP keep-alive reuse, so streaming requests can pay per-request connection setup. → affects streaming latency/throughput.

  2. Resource sampler loop is unthrottledremote/run-on-instance.sh (~176). The docker stats --no-stream loop has no delay; sampler CPU competes with the gateway under load. → affects CPU% and the req/s-per-CPU efficiency figures.

  3. gw_model isn't dialect-aware for Bifrost's Anthropic routeremote/run-on-instance.sh (~89). Bifrost always gets openai/gpt-4o-mini, even on the /anthropic/v1/messages workload. → could change Bifrost's measured behavior/coverage on the messages dialect.

  4. summarize.py discovers targets from runs[0] onlyscripts/summarize.py (~137). A target/workload present only in a later trial would be dropped or aggregated inconsistently. → affects which rows appear and the median/spread aggregation.

  5. Peer gateway images aren't pinnedremote/docker-compose.yml (~88). LiteLLM/Portkey/Bifrost are pulled by floating tags, so the exact versions measured aren't locked. (The translation harness in docs(benchmark): write-up, charts + QA/translation tooling (split from #429) #430 was pinned to the digests from the last run; this benchmark compose still floats.) → reproducibility of which software was compared.

  6. Terraform resolves the latest AMI at apply timeterraform/variables.tf (~28). The OS image can drift run-to-run. → environment determinism / run-to-run comparability.

Happy to change any of these — but each implies the headline numbers should be re-measured on a fresh run, so I kept them out of the "no-impact" cleanup.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants