Benchmarking - DO NOT MERGE by prakhargarg105 · Pull Request #4487 · redpanda-data/connect

prakhargarg105 · 2026-06-04T16:07:20Z

No description provided.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Flags contiguous >=60s spans where MB/sec drops below 0.8x reference. These bubble up as callouts in the rendered markdown so reviewers know when a run isn't safe to publish externally. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Defines the canonical per-run JSON shape and renders an appended section to docs/benchmark-results/<connector>.md from an embedded template. Anomalies surface as quoted callouts. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The runner shells onto the bench host via Systems Manager — no SSH keys, IAM-gated, audit-logged. A FakeSSM is exposed for tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Init/Apply/Destroy/Plan/Outputs for a stack directory, using the shared S3 backend with a per-stack state key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Loops over scenario.matrix.cpu_points, pins Connect to a cpuset starting at core 2 (cores 0,1 reserved), runs warmup+duration, parses rolling stats, and computes per-point summary + anomalies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

End-to-end orchestration: terraform init+apply (shared + stack), build the binary for linux/arm64, render the pipeline config, run the sweep, write JSON, append markdown, terraform destroy. Staging and workload scripts are placeholders filled in by Task 12. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

S3+DynamoDB backend, AWS provider with bench-session-id default tags, /16 VPC across 2 AZs with public+private subnets, IGW, public route table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Runner + load-gen EC2 instances on AL2023 arm64 with the SSM-managed instance role, gp3 root volumes, and a cloud-init that creates /opt/bench and installs psql + jq. A private versioned S3 bucket with a 180-day lifecycle hosts raw result JSON. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stageArtefacts uploads the redpanda-connect binary and rendered config to the results bucket, then SSMs the runner to pull them down. runSeeder does the same for the cdc-rows seeder binary and invokes its 'seed' subcommand on the load-gen host. renderWorkloadScript drives sustained inserts at the scenario's configured rate. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provisions a Postgres 16 RDS instance with gp3 storage, configurable IOPS, and a parameter group for logical replication. Outputs the DSN as a sensitive value for downstream stacks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Composes the shared VPC + SGs with the rds-postgres module via remote_state. Passes scenario-supplied sizing (instance class, storage, IOPS, params). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two subcommands. 'seed' creates the table, then bulk-inserts N rows across 16 concurrent workers (~1000 rows per INSERT). 'workload' runs a 100ms-tick loop emitting <rate>/10 rows per tick across the table list, until duration elapses. DynamoDB path will be added by the dynamodb plan. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A customer-shaped CDC workload: 75M orders rows, sustained 5K writes/sec for 15 minutes after a 2-minute warmup, swept across 1/2/4/8 vCPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

One command per lifecycle stage. The default `aws:bench` invocation provisions, runs the sweep, renders results, and tears down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The local Docker benches stay the developer-iteration path; the new benchmarking/aws/ framework is the path for the published numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Switch the runner/load-gen defaults + the postgres orders-cdc scenario from c7i.* (Intel x86_64) to c8g.* (Graviton arm64) so the existing arm64 AL2023 AMI and arm64 Go builds work. Keep c7i.* in the vCPU lookup table for backward compatibility with existing test fixtures. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

translateInfraSource was emitting YAML for nested-map var values, which terraform's -var parser rejects. Switch to json.Marshal so the comment that already says "JSON-encoded" is now true. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

terraform -chdir=<stack> changes the working directory before resolving -backend-config, so a repo-relative backend.hcl path is interpreted inside the stack directory and fails to read. Resolve to absolute up front. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Three issues caught on the first real apply against AWS: 1. Security group descriptions used an em-dash; AWS rejects non-ASCII in GroupDescription. Replace with hyphens. 2. aws_s3_bucket_lifecycle_configuration rule now requires an explicit filter or prefix in newer provider versions. Add filter {} to fall under the all-objects default. 3. The destroy defer was registered after the shared apply succeeded, so a failed shared apply left orphan resources. Move the defer before any apply. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

wal_level isn't a user-settable parameter on AWS RDS — setting it directly fails ModifyDBParameterGroup with "Could not find parameter with name: wal_level". The RDS-specific equivalent is rds.logical_replication=1, which makes RDS set wal_level=logical for us. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

16.4 is no longer available in us-east-2 (RDS retires patch versions). 16.14 is the current latest in the postgres16 parameter group family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

combineReset was writing \$POSTGRES_DSN as a shell-variable reference, but the SSM script environment doesn't carry that variable, so psql fell back to the local Unix socket and failed before the first sweep point. Substitute the DSN value directly (same pattern as renderWorkloadScript and runSeeder), and wrap with set -euo pipefail + ON_ERROR_STOP=1 so a real psql failure surfaces. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

postgres_cdc and other enterprise connectors refuse to start without a license. Add a --license-file flag (defaulting to $REDPANDA_LICENSE_FILEPATH), upload the license to the staging bucket alongside the binary + config, and set REDPANDA_LICENSE_FILEPATH=/opt/bench/license.jwt in the bench script env so Connect picks it up. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

stat() succeeds even when macOS TCC / sandbox blocks read, so the validation passed but stageArtefacts later failed after ~8 minutes of terraform apply. Open the file for read at validate-time so the permissions error surfaces before any AWS provisioning. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add *.license, *.jwt, rpcn_license, and rpcn.license patterns so a license file dropped at the repo root for benchmarking doesn't get accidentally committed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

postgres_cdc unconditionally reads its `tls:` field and overwrites the sslmode parsed from the DSN (see input_pg_stream.go:303). When the scenario didn't set `tls:`, FieldTLS returned a disabled tls.Config, so postgres_cdc connected without encryption and RDS rejected the replication slot with 'no pg_hba.conf entry ... no encryption'. Add tls: { enabled: true, skip_cert_verify: true } so the replication stream is encrypted. skip_cert_verify is acceptable for this benchmark because the connection stays inside the bench VPC. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

postgres_cdc uses NewTLSField (not NewTLSToggledField), so there's no 'enabled' toggle — the presence of the tls: block implies TLS. Drop the 'enabled: true' line that caused lint failure at every sweep point. Also: if the FIRST sweep point captures zero samples (Connect failed to start or the connector errored for the whole window), bail out of the sweep with a clear error rather than burning the remaining ~50min of sweep windows watching the same failure repeat. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Replace the brief 30-line README with a complete guide covering first-time setup, running benches, architecture, adding scenarios + connectors, known limitations (SSM truncation, TLS field shape, RDS quirks, SSO timeout, macOS TCC, clock skew), troubleshooting, and cost estimates. Captures everything learned during the foundation's smoke runs. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Add .claude/ (Claude Code harness state) and /runner (stray binary sometimes left at the repo root by go-run when invoked from there) so neither lands in commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Both canonical scenarios re-run with the db.r6g.4xlarge bump. Trimmed older 2xlarge results + vcpu4 diagnostic per "keep latest only" policy. Headline numbers (Connect MB/s sustained / peak): Postgres: 102 / 108 (plateaus at vCPU=4) Mysql: 111 / 121 (slight scaling to vCPU=8) KC patterns differ by engine: Postgres KC tight bottleneck at 17 MB/s (vCPU 2-4), breaks to 46 MB/s only at vCPU=8 — Debezium pgoutput-specific Mysql KC scales to 54 MB/s at vCPU=4, REGRESSES to 39 at vCPU=8 — Debezium MySQL sweet spot at 4 vCPU Cross-engine ratios narrowed substantially vs 2xlarge: Mysql vCPU=8: 5.5x → 2.9x Postgres vCPU=4: 1.2x → 6x (KC bottleneck became more visible once Connect was no longer producer-bound) Operational note: mysql 4xlarge first attempt hung at KC vCPU=1 (transient flake). Retry with --keep-on-fail succeeded; one SSM stdout buffer hang at vCPU=8 connect was unblocked by cancelling the stuck SSM command (output was buffered, not lost). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The smoke failed with all 12M records failing because franz-go does not request topic auto-creation by default (broker auto_create alone is not enough). Explicitly create the topic via kadm (also a broker-readiness check, with retries), set a partition count, and surface the first produce error instead of swallowing it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds engineSpec NoDSN/ExtraEnvVars for IAM-authed engines, a dynamodb-bench TF module + stack, a BatchWriteItem load-gen seeder, and a scenario YAML with drop+recreate reset between sweep points. No KC counterpart - Debezium 2.7.x doesn't ship a DynamoDB connector and the bench cloud-init doesn't install a paid alternative, so this runs against --engines=connect only. Smoke pending. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A sink's Connect pipeline has no benchmark processor, so rolling-stats log samples are empty by design; throughput is the Iceberg metric series in brokerSeries. MatrixRunner now branches on Direction for both Summary derivation and the first-point early-abort, fixing a spurious '0 samples' abort observed in the iceberg smoke. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…create fix) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ion via catalogx) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…arios Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…s w/ location Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…re-created) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Avoids a spurious class-not-found when a large plugin (iceberg) is still scanning at REST-API-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The runner download in stageArtefacts ran before the upload, so the binary was never present on the runner and the sink reset failed with '/opt/bench/iceberg-tablegen: No such file or directory'. Reorder so the sink-only upload precedes the staging download. Validated live: tablegen creates the Glue table and the KC sink task reaches RUNNING against it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…up.id override, add control/kafka timeouts) KC Tabular sink ran but committed nothing. The sink manages its OWN consumer group for transactional (exactly-once) offset commits; our consumer.override.group.id was interfering. Drop it and add the control/consumer timeouts from the repo's known-good connector.json (internal/impl/iceberg/bench/kafka-connector). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ursty sinks) Iceberg sink throughput is the table's committed-bytes growth; KC's Tabular sink commits in bursts so median-of-per-interval-rates reads 0 despite committing all data. Add Summary.MeanMBPerSec (mean of per-interval rates = total committed / window) and a 3-decimal 'mean MB/s' markdown column so bursty committers compare fairly with steady ones (Connect). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ion-independent Committed-bytes/sec confounded throughput with each engine's Parquet compression. Poll Iceberg total-records too and report records/sec (mean), the fair compression-independent sink throughput for Connect-vs-KC. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The full iceberg-sink sweep stranded at the kafka_connect points: the bench script spawns its own taskset-pinned connect-distributed JVM on :8083 after only 'systemctl stop' + 'sleep 2', while the systemd unit (Restart=on-failure) and a slow-to-release JVM under load contend for the same port. The loser crash-loops on BindException (600+ restarts observed) until the other JVM exits, so the connector submit hits a worker that never stably binds -> HTTP 500. - kcscript.go: before spawning, stop the unit, pkill any lingering connect-distributed, and poll until :8083 is actually free (re-issuing stop each iteration to defeat on-failure relaunch). After SIGTERMing our JVM, poll :8083 free again before handing the port back to systemd. - runner-user-data.tftpl: RestartSec=10 + TimeoutStopSec=120 so a transient collision backs off instead of tight-looping, and a clean stop waits for the JVM to flush state + release the socket. - kcscript_test.go: regression test asserts both port-free waits + the straggler kill, in the correct order relative to spawn/hand-back. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The json-orders seeder sets ProducerBatchMaxBytes(16 MiB) but created the topic with the broker-default max.message.bytes (~1 MiB). Redpanda validates each produce batch against the topic limit, so batches that fill toward 16 MiB are rejected with MESSAGE_TOO_LARGE. This is a rare spike at low volume (the 12M smoke passed) but a hard failure seeding 160M rows (200k/160M failed). Create the topic with max.message.bytes= 64 MiB so the 16 MiB batches always clear, with headroom. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Connect's iceberg output is commit-latency-bound: serialized ~320ms Glue snapshot commits with no commit-linger, so under default config it commits ~300 records at a time at ~0.03 CPU cores and is flat across vCPU (~412 rec/s). The redpanda input alone sustains ~57k rec/s into a drop output, so the input is not the limiter — the small per-commit record count is. Thread a top-level `buffer` from the scenario pipeline block to the Connect config root (mirrors the existing cache_resources passthrough) so a fast input can be decoupled from the commit-bound output. With a memory buffer + larger batches, each commit carries ~50k records instead of ~300. Validated (connect-only 1-vCPU smoke): mean 412 -> 10,984 rec/s (~27x), peak ~35k rec/s, and CPU 0.03 -> 0.99 cores (now CPU-bound, so it scales with vCPU instead of being flat). orders-sink-smoke.yaml carries the tuned config (buffer 512MiB, max_in_flight 8, count 50000, period 10s). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Smoke ran Connect-only at 1 vCPU for 15min: median 5.7 MB/s, 3865 msg/sec, 0 anomalies. Producer-bound — load-gen sustained ~4K PutItems/sec out of 5K target before hitting BatchWriteItem InternalServerError near end. Connector ceiling is well above this; needs higher PutItem rate to probe. Also fixes scenario YAML: `billing: PROVISIONED` was being forwarded to `terraform apply` by translateInfraSource and rejected as an undeclared variable. The TF module hardcodes PROVISIONED anyway, so the key is removed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Auto-generated results appended across the iceberg-sink bench work: 1-vCPU smokes, the full [1,2,4,8] both-engine sweep, and the buffered Connect runs. See results/iceberg/orders-sink*/ for the raw JSON. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Match Connect's commit cadence to KC's Tabular coordinator (10s) using input-side batching, so the comparison is tuned-vs-tuned rather than default-vs-tuned. Add a pipeline.input_options passthrough in sinkTopology.Pipeline that merges scenario-supplied redpanda-input tuning (e.g. unordered_processing) into the input, protecting the bench-managed connection fields (brokers/topic/group) from being clobbered. orders-sink.yaml now configures input_options.unordered_processing (enabled, checkpoint_limit 100000, batching count 50000 / period 10s) + output max_in_flight 8 / batching 50000 / 10s. Input batching adds the read-ahead the default path lacks (measured ~110x over baseline, output_sent ~46k rec/s at 1 vCPU), so each commit carries ~10s of records like KC's coordinator. Trades cross-partition ordering (fine for a sink; the buffer variant is the order-preserving equivalent). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The repetitive 'x' padding compressed ~150x in Parquet, making committed- bytes/sec meaningless and giving KC an absurd ~7 B/record. Sample each record's payload from a 16 MiB pool of random alphanumeric bytes (JSON-safe) via a random window, and vary region/status across small sets. Payloads are now distinct and barely compressible, so Parquet file sizes are representative and the MB/s axis is symmetric across engines for a fair head-to-head. Pool + window keeps seeding fast (no per-byte RNG on the hot path). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tic payload) Tuned-vs-tuned head-to-head: Connect input batching at KC's 10s commit cadence, high-entropy realistic payload. Both records/s and MB/s now agree. Connect scales (15k->178k rec/s across 1->8 vCPU), KC plateaus (56k->88k); KC leads at 1-2 vCPU, Connect crosses over by 4 vCPU and ends ~2x ahead at 8. results/iceberg/orders-sink/2026-06-04T19-01-06Z.json. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…quota The reset bash recreated the source table with hardcoded WriteCapacityUnits=10000 — fine for the 1-vCPU smoke (reset doesn't fire) but would tank the table mid-sweep. Wire read_capacity / write_capacity through the dynamodb stack outputs and the aws_dynamodb_cdc engineSpec's ExtraEnvVars, so the bash references ${READ_CAPACITY}/${WRITE_CAPACITY} and tracks scenario sizing automatically. Also resize the workload after the 120K WCU smoke hit AWS's default per-table quota (47-min CreateTable hang) and the 10K PutItems/sec smoke saturated 40K WCU late in the run. New sizing: 40K WCU provisioned, 9K PutItems/sec sustained — 10% headroom under quota. Smoke redpanda-data#3 result (10 MB/s p50 @ 1 vCPU) appended to dynamodb.md. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The drop+recreate-table-between-points pattern (mirroring postgres/mysql) doesn't fit DDB. The aws_dynamodb_cdc connector reads from Streams, not table items, so leftover rows can't leak between sweep points. All we need to reset is the connector's CHECKPOINT table — dropping it forces the next process to fall back to start_from:latest and skip to the current stream tail. The table-recreate path also kept hitting AWS's per-table teardown latency (20-30min for a 40K WCU table), causing TF's default 10m delete timeout to fire mid-destroy. Override to 30m so the final teardown completes cleanly; this is the price of keeping the source table around across sweep points (~150GB by end of full sweep, negligible storage cost). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

The seeder's writeBatch returned the first BatchWriteItem error verbatim — including transient AWS errors like InternalServerError and ProvisionedThroughputExceededException — which killed the worker goroutine and cascaded to the whole workload exiting. This masqueraded as a Connect-side throughput drop in earlier runs: the bench saw rates collapse from 36 MB/s to 0 (workload fully died) or to ~11 MB/s (workload partially died, some workers continued) after ~9 minutes, and the connector was incorrectly suspected. Two changes: - writeBatch now retries up to 8 times with exponential backoff (50ms→2s, ~6s total) on any non-context-cancellation error, not just UnprocessedItems. - After max retries, the batch is dropped rather than returned as an error. Sustaining the configured rate is more important than exact-once delivery for a bench — better to underdeliver one batch than terminate the load generator. Verified at 1 vCPU: throughput now sustained 36 MB/s for 10 minutes (was: 36 MB/s for 9 min, then collapse to 0 or ~11), 0 anomalies. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Documents the four-point sweep result (36→15→12→0 MB/s) with the analysis showing it's an artefact of the bench reset strategy not fitting DDB Streams' shard lifecycle, not a Connect anti-scaling behaviour. Lists falsified hypotheses (AWS GetRecords throttle, GOMAXPROCS lock contention, downstream ShouldThrottle backpressure) and the load-gen-side bugs found during the investigation (4K cap, transient error worker death) which were fixed at d19c10b / 817a6a3 / e2fb34e. Identifies the open root cause hypothesis (shard rotation outpacing the connector's 30s shard refresh interval) and three paths forward for someone returning to this work. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…iant Switch orders-sink.yaml from input batching (unordered_processing) to a top-level memory buffer (512 MiB) + output batching at KC's 10s cadence. The buffer decouples the fast redpanda input from the commit-latency-bound iceberg output the same way, but PRESERVES cross-partition ordering (no unordered_processing). Measured equivalent throughput in isolation. Input batching variant preserved in git history (a1d88d9). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ults Add iceberg-findings.md — the full RPCN-vs-KC writeup: fair tuned-vs-tuned comparison (Connect scales, crosses over KC at 4 vCPU; KC efficient at 1-2), the commit-latency-bound default trap (~412 rec/s) and the buffer/input-batching fixes, the buffer-vs-input-batching ordering/throughput tradeoff, negative results (max_in_flight inert, coalescing doesn't fix low-core), and the low-core per-record-CPU gap + recommendations. Plus the buffer-variant full sweep results appended to iceberg.md. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

prakhargarg105 and others added 30 commits June 1, 2026 09:49

test(bench/aws): add empty-input and T-field assertions to stats tests

6ee6dc5

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(bench/aws): add SSM RunCommand wrapper with streaming output

d613726

The runner shells onto the bench host via Systems Manager — no SSH keys, IAM-gated, audit-logged. A FakeSSM is exposed for tests. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(bench/aws): add Terraform CLI wrapper

5eb45c5

Init/Apply/Destroy/Plan/Outputs for a stack directory, using the shared S3 backend with a per-stack state key. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(bench/aws): add terraform shared stack — providers, vars, VPC

6cf5337

S3+DynamoDB backend, AWS provider with bench-session-id default tags, /16 VPC across 2 AZs with public+private subnets, IGW, public route table. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(bench/aws): add terraform stack — postgres

e635c56

Composes the shared VPC + SGs with the rds-postgres module via remote_state. Passes scenario-supplied sizing (instance class, storage, IOPS, params). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(bench/aws): add postgres orders-cdc scenario

6794c97

A customer-shaped CDC workload: 75M orders rows, sustained 5K writes/sec for 15 minutes after a 2-minute warmup, swept across 1/2/4/8 vCPU. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

feat(bench/aws): add operator-facing Taskfile

8ca6783

One command per lifecycle stage. The default `aws:bench` invocation provisions, runs the sweep, renders results, and tears down. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

docs(benchmarking): point at the new AWS framework

c8168f4

The local Docker benches stay the developer-iteration path; the new benchmarking/aws/ framework is the path for the published numbers. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(bench/aws): bump default RDS Postgres engine to 16.14

88f66c5

16.4 is no longer available in us-east-2 (RDS retires patch versions). 16.14 is the current latest in the postgres16 parameter group family. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

chore: gitignore Redpanda license files

578cff0

Add *.license, *.jwt, rpcn_license, and rpcn.license patterns so a license file dropped at the repo root for benchmarking doesn't get accidentally committed. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

prakhargarg105 and others added 29 commits June 2, 2026 09:35

docs(bench): design for pre-created iceberg tables (KC Glue-REST auto…

5080cdd

…create fix) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

docs(bench): plan for pre-created iceberg tables (KC Glue-REST fix)

9129db8

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(bench): iceberg-tablegen tool (pre-create Iceberg table w/ locat…

5b70b4c

…ion via catalogx) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(bench): build+stage iceberg-tablegen on the runner for sink scen…

e7a0c62

…arios Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(bench): sink ResetScript pre-creates both engines' iceberg table…

8a59154

…s w/ location Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

feat(bench): KC iceberg connector auto-create-enabled=false (tables p…

ce7cfcc

…re-created) Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

fix(bench): KC bench script waits for connector class before submit

8dbe0ee

Avoids a spurious class-not-found when a large plugin (iceberg) is still scanning at REST-API-up. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

twmb mentioned this pull request Jun 10, 2026

Expand CONTRIBUTING guidelines and audit them in the review skill #4493

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Benchmarking - DO NOT MERGE#4487

Benchmarking - DO NOT MERGE#4487
prakhargarg105 wants to merge 256 commits into
redpanda-data:mainfrom
prakhargarg105:benchmarking

prakhargarg105 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

prakhargarg105 commented Jun 4, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant