Skip to content

feat(gateway): URL-stable identity-based cluster naming for API-level upstreams#2137

Draft
mehara-rothila wants to merge 3 commits into
wso2:mainfrom
mehara-rothila:feat/api-level-url-stable-clusters
Draft

feat(gateway): URL-stable identity-based cluster naming for API-level upstreams#2137
mehara-rothila wants to merge 3 commits into
wso2:mainfrom
mehara-rothila:feat/api-level-url-stable-clusters

Conversation

@mehara-rothila

@mehara-rothila mehara-rothila commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Summary

Changes how the gateway names the Envoy clusters for an API's main and sandbox upstreams. Cluster names are now derived from the API's identity, main_<hash> / sandbox_<hash>, where the hash is the first 12 bytes of sha256(apiID) in hex and the env is a plaintext main_/sandbox_ prefix (not part of the hash), instead of from the backend URL (cluster_<scheme>_<sanitized-host>). Both xDS builders (the Envoy xDS translator and the RuntimeDeployConfig transform path) derive the name through one shared helper, so the two paths cannot diverge.

Why

With URL-derived names, editing a backend URL renames the cluster: the old cluster is destroyed, a new one is created, and routes are repointed, draining in-flight connections on every URL change. With identity-derived names the cluster name never changes for the lifetime of the API, so a URL edit propagates as an endpoint update on the same cluster and live traffic is not dropped. The name carries no URL information, so host, port, scheme, and path edits are all equally safe.

Design decisions

Identity in, URL out. The hash input is the apiID only; the env (main or sandbox) is a plaintext prefix on the name, not part of the hash. Anything URL-derived in the name would defeat the purpose, since some class of URL edit would still rename the cluster.

apiID hashed, env as a prefix. main and sandbox of the same API share one hash fragment and are told apart by the main_/sandbox_ prefix, so an operator can correlate an API's two clusters at a glance. The prefix guarantees main and sandbox never collide without relying on the hash.

96-bit identity hash. The fragment is the first 12 bytes (96 bits) of sha256(apiID). Cross-API uniqueness rests on this truncation, and 96 bits keeps the birthday-collision probability negligible even at very large API counts, so two distinct APIs cannot silently share one Envoy cluster.

One naming helper for both builders. The gateway has two builders that must agree on cluster names (the Envoy xDS translator and the policy-side RuntimeDeployConfig transform). Both now call pkg/utils/clusterkey, replacing the two hand-rolled sanitize functions that had to be kept in sync manually.

ClusterKey and EnvoyClusterName unified. In the RDC path the internal cluster key and the Envoy-facing cluster name are now the same string. If they differ, the policy engine's default upstream cluster metadata points at a name Envoy does not know, which surfaces as 503 NoRoute.

Latent collision fixed. Under URL-derived naming, a main and sandbox upstream pointing at the same URL collided into one cluster, and two different APIs sharing a backend URL shared one cluster. The env prefix now keeps main and sandbox always distinct, and hashing the apiID gives each API its own fragment, so each API owns its clusters.

Migration notes (one-time effects on existing deployments)

  • Each existing API's main/sandbox clusters are renamed once, on its next configuration regeneration or redeploy. The new cluster and repointed routes are delivered atomically in the same snapshot.
  • Envoy per-cluster connection pools and stats names change once; dashboards or alerts keyed on the old cluster_<scheme>_<host> names in the upstream_cluster label need a one-time update.
  • Total cluster count can increase, since two APIs targeting the same backend URL no longer share a cluster.

What changed

  • pkg/utils/clusterkey (new): APILevel(apiID) / APILevelName(env, apiID) naming helpers plus unit tests.
  • pkg/xds/translator.go: resolveUpstreamCluster takes the API id and derives the hash name; sanitizeClusterName removed.
  • pkg/transform/restapi.go: addUpstreamCluster derives the same hash name; EnvoyClusterName set to the cluster key; sanitizeEnvoyClusterName removed.
  • cmd/controller/main.go: comment documenting that both xDS paths must keep cluster-name derivation in sync (plus an import-order fix in the touched file).
  • gateway/spec/impls/1-basic-gateway-with-controller/data-model.md: cluster-naming section updated from the old URL-sanitized scheme to the identity-derived scheme.
  • Tests updated across both builders, plus new cases proving the contract.

Test coverage

  • Unit: clusterkey tests with known-answer vectors that pin the algorithm to sha256(apiID)[:12]; translator tests assert URL-independence and that the name carries no URL-derived information (TestResolveUpstreamCluster_DedupSameAPIDifferentURLs, TestResolveUpstreamCluster_MainSandboxNeverCollide, TestResolveUpstreamCluster_NameNotURLDerived); transformer tests assert the name shape on routes and clusters, that a route's default cluster always references a registered cluster (TestRestAPITransformer_APILevelDefaultClusterMatchesRealCluster), that a URL edit does not rename the cluster (TestRestAPITransformer_APILevelURLStableAcrossURLEdit), and that a main-only API registers no sandbox cluster or route (TestRestAPITransformer_APILevelMainOnlyHasNoSandboxCluster).
  • Integration: api-level-url-stable.feature covers a main upstream URL update that changes both host and path, a sandbox upstream URL update, default cluster resolution under cluster_header routing, main and sandbox on the same backend host getting separate identity-named clusters, and two APIs sharing a backend host each routing under their own cluster (with a post-delete check that removing one does not disturb the other). The main and sandbox scenarios assert cluster identity through the Envoy admin endpoint (identity-named clusters present and the old cluster_<scheme>_<host> names absent, before and after the update), so they fail under the previous naming scheme.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Important

Review skipped

Draft detected.

Please check the settings in the CodeRabbit UI or the .coderabbit.yaml file in this repository. To trigger a single review, invoke the @coderabbitai review command.

⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 14fea48a-527d-445f-bbaa-1e3ddd1ae94f

You can disable this status message by setting the reviews.review_status to false in the CodeRabbit configuration file.

Use the checkbox below for a quick retry:

  • 🔍 Trigger review
✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mehara-rothila mehara-rothila changed the base branch from feature/operation-level-ep to main June 10, 2026 08:53
@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch 5 times, most recently from 534247e to a99b916 Compare June 10, 2026 16:16
@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@mehara-rothila mehara-rothila changed the base branch from main to feature/operation-level-ep June 10, 2026 18:39
@mehara-rothila mehara-rothila changed the base branch from feature/operation-level-ep to main June 12, 2026 02:24
@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from 5c3aed7 to 840cb13 Compare June 12, 2026 02:28
@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai please do a full review of this PR , concrete suggestions, a risk analysis, and anything else you can surface. Please read the actual code, not just the description.

Correctness & design

  • The cluster name is <env>_<fragment> where fragment = first 8 bytes of sha256(apiID) , the apiID ONLY. The env ("main"/"sandbox") is a plaintext prefix, NOT part of the hash, so main and sandbox of the same API share the fragment and are distinguished only by the prefix. Trace clusterkey.APILevel / clusterkey.APILevelName and both call sites: pkg/xds/translator.go (resolveUpstreamCluster) and pkg/transform/restapi.go (addUpstreamCluster). Is this sound?
  • Confirm both xDS builders produce identical names for the same (env, apiID), so they can never disagree.
  • Verify the env prefix alone guarantees main and sandbox never collide even though they share the hash fragment.
  • Verify ClusterKey and EnvoyClusterName are the same string in the RDC path, and nothing still references the old URL-derived name.

Risk analysis

  • Collision at 64 bits: realistic probability of two DISTINCT apiIDs hashing to the same 8-byte fragment (the only thing keeping different APIs apart), and what happens if they do , silent cross-wiring or fail-closed? Should the truncation be longer?
  • Migration/one-time effects: existing APIs get clusters renamed on next redeploy. Any window where a route points at a not-yet-created cluster (503 NoRoute)? Is "new cluster + repointed routes" truly atomic in one snapshot?
  • Observability: the upstream_cluster stat label changes from cluster_<scheme>_<host> to <env>_<hash>. What breaks, and is it called out?
  • Cluster-count increase (two APIs sharing a backend URL no longer share a cluster): any resource concern at scale?
  • Does removing sanitizeClusterName / sanitizeEnvoyClusterName drop any sanitizing some input still needs?

Edge cases

  • Empty/unusual apiID, an API with only sandbox (no main), main vs sandbox pointing at identical URLs.

Tests

  • Is unit + integration coverage enough to lock the contract? Anything missing , e.g. a negative test proving the old URL-derived name is gone, and a test proving main/sandbox of one API differ only by prefix?

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clusterkey package ==="
fd -t f . pkg/utils/clusterkey --exec cat {}

echo ""
echo "=== translator.go - resolveUpstreamCluster and related ==="
rg -n "resolveUpstreamCluster|sanitizeClusterName|clusterkey|APILevel|APILevelName|ClusterKey|EnvoyClusterName" pkg/xds/translator.go -A 5 -B 2

echo ""
echo "=== restapi.go - addUpstreamCluster and related ==="
rg -n "addUpstreamCluster|sanitizeEnvoyClusterName|clusterkey|APILevel|APILevelName|ClusterKey|EnvoyClusterName" pkg/transform/restapi.go -A 5 -B 2

echo ""
echo "=== Full clusterkey file list ==="
fd -t f . pkg/utils/clusterkey

echo ""
echo "=== Full translator.go resolveUpstreamCluster function ==="
rg -n "func resolveUpstreamCluster" pkg/xds/translator.go -A 50

echo ""
echo "=== Full addUpstreamCluster function ==="
rg -n "func addUpstreamCluster" pkg/transform/restapi.go -A 80

echo ""
echo "=== Integration test feature file ==="
fd -t f "api-level-url-stable.feature" --exec cat {}

echo ""
echo "=== Translator test file - new test cases ==="
fd -t f . pkg/xds -e go -e _test.go --exec rg -n "DedupSameAPI|MainSandboxNever|TestResolveUpstreamCluster" {} -A 30

echo ""
echo "=== restapi transformer test - new test cases ==="
fd -t f . pkg/transform -e go --exec rg -n "APILevelDefaultCluster|APILevelURL|TestRestAPITransformer_APILevel" {} -A 30

echo ""
echo "=== cmd/controller/main.go cluster name comment ==="
rg -n "cluster" cmd/controller/main.go -B 2 -A 5

echo ""
echo "=== Check if any old sanitize references remain ==="
rg -rn "sanitizeClusterName|sanitizeEnvoyClusterName|cluster_http|cluster_https" --include="*.go"

echo ""
echo "=== Check ClusterKey and EnvoyClusterName in restapi.go ==="
rg -n "ClusterKey\|EnvoyClusterName" pkg/transform/restapi.go -B 3 -A 3

echo ""
echo "=== data-model.md cluster section ==="
fd -t f "data-model.md" --exec grep -n -A 20 -i "cluster.nam" {}

Length of output: 2285


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clusterkey package full content ==="
find . -path "*/utils/clusterkey*" -type f | xargs cat

echo ""
echo "=== Check all clusterkey usages ==="
rg -rn "clusterkey\." --include="*.go" -B 2 -A 3

echo ""
echo "=== Check APILevelName / APILevel usage ==="
rg -rn "APILevelName\|APILevel(" --include="*.go" -B 2 -A 3

echo ""
echo "=== clusterkey test file ==="
find . -path "*/utils/clusterkey*_test*" -o -path "*/clusterkey*test*" | xargs cat 2>/dev/null || true

echo ""
echo "=== translator.go full resolveUpstreamCluster ==="
awk '/func resolveUpstreamCluster/,/^func /' pkg/xds/translator.go 2>/dev/null | head -80

echo ""
echo "=== restapi.go full addUpstreamCluster ==="
awk '/func addUpstreamCluster/,/^func /' pkg/transform/restapi.go 2>/dev/null | head -100

echo ""
echo "=== RDC struct definitions: UpstreamCluster, ClusterKey, EnvoyClusterName ==="
rg -rn "ClusterKey\b\|EnvoyClusterName\b" --include="*.go" -B 2 -A 2

echo ""
echo "=== Check for any remaining references to old URL-derived naming ==="
rg -rn 'cluster_\|scheme.*cluster\|sanitize.*cluster\|cluster.*sanitize' --include="*.go" -B 2 -A 2

echo ""
echo "=== Full translator test for new cases ==="
rg -rn "TestResolveUpstreamCluster" --include="*.go" -A 60 | head -200

echo ""
echo "=== Full restapi transformer test for new cases ==="
rg -rn "TestRestAPITransformer_APILevel" --include="*.go" -A 60 | head -200

echo ""
echo "=== xds snapshot atomicity - how snapshot is built ==="
rg -rn "snapshot\|Snapshot\|SetSnapshot\|GenerateSnapshot" pkg/xds/translator.go -B 2 -A 5 | head -80

echo ""
echo "=== integration test feature file ==="
find . -name "api-level-url-stable.feature" | xargs cat 2>/dev/null

echo ""
echo "=== Check empty apiID handling ==="
rg -rn 'apiID\|api_id\|ApiId' pkg/utils/clusterkey --include="*.go" -B 1 -A 5 2>/dev/null || true
find . -path "*/clusterkey*" | xargs rg -n "empty\|\"\"" 2>/dev/null || true

Length of output: 8168


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== translator.go resolveUpstreamCluster ==="
awk '/func resolveUpstreamCluster/,/^func [a-zA-Z]/' pkg/xds/translator.go | head -100

echo ""
echo "=== translator.go clusterkey/APILevel imports and usage ==="
grep -n "clusterkey\|APILevel\|APILevelName\|ClusterKey\|EnvoyClusterName\|sanitize" pkg/xds/translator.go

echo ""
echo "=== restapi.go addUpstreamCluster ==="
awk '/func addUpstreamCluster/,/^func [a-zA-Z]/' pkg/transform/restapi.go | head -120

echo ""
echo "=== restapi.go clusterkey/APILevel imports and usage ==="
grep -n "clusterkey\|APILevel\|APILevelName\|ClusterKey\|EnvoyClusterName\|sanitize" pkg/transform/restapi.go

echo ""
echo "=== restapi.go UpstreamCluster struct / ClusterKey / EnvoyClusterName field definitions ==="
grep -n "ClusterKey\|EnvoyClusterName\|UpstreamCluster" pkg/transform/restapi.go | head -40

echo ""
echo "=== Check struct definitions globally ==="
grep -rn "ClusterKey\b\|EnvoyClusterName\b" --include="*.go" . 2>/dev/null || \
grep -rn "ClusterKey\|EnvoyClusterName" . -l | xargs grep -n "ClusterKey\|EnvoyClusterName" | head -60

echo ""
echo "=== Check struct definitions globally (no --include) ==="
grep -rn "ClusterKey\|EnvoyClusterName" . --include="*.go" 2>/dev/null | head -60 || \
find . -name "*.go" | xargs grep -n "ClusterKey\|EnvoyClusterName" 2>/dev/null | head -60

echo ""
echo "=== Translator tests for new cases ==="
find . -name "*translator*test*" -o -name "*test*translator*" | grep "\.go$" | xargs grep -n "TestResolveUpstreamCluster\|DedupSameAPI\|MainSandboxNever" 2>/dev/null | head -20
find . -name "*_test.go" -path "*/xds/*" | xargs grep -n "TestResolveUpstreamCluster\|DedupSameAPI\|MainSandbox" 2>/dev/null | head -5

echo ""
echo "=== Full translator test cases ==="
find . -name "*_test.go" -path "*/xds/*" | head -5 | xargs -I{} awk '/func TestResolveUpstreamCluster/,/^func [A-Z]/' {} 2>/dev/null | head -200

echo ""
echo "=== Full restapi transformer test cases ==="
find . -name "*_test.go" -path "*/transform/*" | head -5 | xargs -I{} awk '/func TestRestAPITransformer_APILevel/,/^func [A-Z]/' {} 2>/dev/null | head -200

echo ""
echo "=== Integration test ==="
find . -name "api-level-url-stable.feature" 2>/dev/null | xargs cat 2>/dev/null

echo ""
echo "=== Snapshot atomicity in translator ==="
grep -n "snapshot\|Snapshot\|SetSnapshot" pkg/xds/translator.go | head -30

echo ""
echo "=== Check old URL-derived naming remnants ==="
find . -name "*.go" | xargs grep -n "cluster_http\|cluster_https\|sanitizeCluster\|sanitizeEnvoy" 2>/dev/null | head -20

echo ""
echo "=== data-model.md cluster section ==="
find . -name "data-model.md" | xargs grep -n -A 25 -i "cluster.nam\|cluster name\|cluster-nam" 2>/dev/null | head -60

Length of output: 3133


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Repo root structure ==="
ls -1

echo ""
echo "=== Find translator.go ==="
find . -name "translator.go" 2>/dev/null

echo ""
echo "=== Find restapi.go ==="
find . -name "restapi.go" 2>/dev/null

echo ""
echo "=== Find clusterkey files ==="
find . -name "*.go" | xargs grep -l "APILevel\|clusterkey" 2>/dev/null | head -20

echo ""
echo "=== Find feature file ==="
find . -name "*.feature" | xargs grep -l "url-stable\|url_stable\|cluster" 2>/dev/null | head -10

echo ""
echo "=== Find test files referencing APILevel or DedupSameAPI ==="
find . -name "*_test.go" | xargs grep -l "APILevel\|DedupSameAPI\|APILevelName\|url.stable\|URLStable" 2>/dev/null | head -10

echo ""
echo "=== Full top-level tree (2 levels) ==="
find . -maxdepth 2 -type d | sort | head -60

Length of output: 3555


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== translator.go resolveUpstreamCluster ==="
awk '/func resolveUpstreamCluster/,/^func [a-zA-Z]/' ./gateway/gateway-controller/pkg/xds/translator.go | head -120

echo ""
echo "=== translator.go clusterkey/APILevel/sanitize references ==="
grep -n "clusterkey\|APILevel\|APILevelName\|ClusterKey\|EnvoyClusterName\|sanitize\|import" ./gateway/gateway-controller/pkg/xds/translator.go | head -40

echo ""
echo "=== restapi.go addUpstreamCluster ==="
awk '/func addUpstreamCluster/,/^func [a-zA-Z]/' ./gateway/gateway-controller/pkg/transform/restapi.go | head -120

echo ""
echo "=== restapi.go clusterkey/APILevel/sanitize/ClusterKey/EnvoyClusterName references ==="
grep -n "clusterkey\|APILevel\|APILevelName\|ClusterKey\|EnvoyClusterName\|sanitize" ./gateway/gateway-controller/pkg/transform/restapi.go | head -40

echo ""
echo "=== ClusterKey and EnvoyClusterName struct definitions ==="
find ./gateway -name "*.go" | xargs grep -n "ClusterKey\b\|EnvoyClusterName\b" 2>/dev/null | head -50

echo ""
echo "=== translator_test.go - new test functions ==="
awk '/func TestResolveUpstreamCluster/,/^func [A-Z]/' ./gateway/gateway-controller/pkg/xds/translator_test.go | head -200

echo ""
echo "=== restapi_test.go - new APILevel test functions ==="
awk '/func TestRestAPITransformer_APILevel/,/^}$/' ./gateway/gateway-controller/pkg/transform/restapi_test.go | head -200

echo ""
echo "=== Integration test feature file ==="
cat ./gateway/it/features/api-level-url-stable.feature

echo ""
echo "=== Snapshot atomicity in translator.go ==="
grep -n "snapshot\|Snapshot\|SetSnapshot\|GenerateSnapshot\|atomic" ./gateway/gateway-controller/pkg/xds/translator.go | head -30

echo ""
echo "=== Old sanitize / URL-derived naming remnants ==="
find ./gateway -name "*.go" | xargs grep -n "cluster_http\|cluster_https\|sanitizeCluster\|sanitizeEnvoy" 2>/dev/null | head -20

echo ""
echo "=== data-model.md cluster section ==="
find ./gateway -name "data-model.md" | xargs grep -n -A 25 -i "cluster.nam\|cluster name\|cluster-nam" 2>/dev/null | head -80

echo ""
echo "=== cmd/controller/main.go comment ==="
find . -path "*/cmd/controller/main.go" | xargs grep -n "cluster\|xds\|transform" 2>/dev/null | head -20

Length of output: 25751



@mehara-rothila — here is the full review based on the actual code.


Correctness & Design

1. Both builders produce identical names

Verified. pkg/xds/translator.go Line 986:

clusterName := clusterkey.APILevelName(upstreamName, apiID)

pkg/transform/restapi.go Line 347:

clusterKey := clusterkey.APILevelName(upstreamName, rdc.Metadata.UUID)

Both call the same helper with (env, apiID). They can never disagree as long as the env strings passed are the same literals. There is no test that explicitly asserts translator.resolveUpstreamCluster("main", X) == transformer.addUpstreamCluster("main", X) as a cross-module assertion, but the shared helper makes it structurally guaranteed.

2. Main vs sandbox never collide

Verified. main_<hash>sandbox_<hash> by construction — different string prefixes, regardless of the shared 16-hex fragment. No probability argument needed here; it is a deterministic property of string concatenation.

3. ClusterKey == EnvoyClusterName in the RDC path

Verified. restapi.go Lines 363–364:

ClusterKey:       clusterKey,
EnvoyClusterName: clusterKey,

The comment at Line 358 explicitly documents why they must be equal. policyxds/snapshot.go Line 231 looks up clusters by ClusterKey; since EnvoyClusterName == ClusterKey, the policy engine will always write a cluster name that Envoy knows about. policyxds_test.go already uses "main_fixture" naming, consistent with the new scheme.

4. Old URL-derived naming

Fully removed from API-level paths. No cluster_http_, cluster_https_, sanitizeClusterName, or sanitizeEnvoyClusterName remain in any .go file. sanitizeUpstreamDefinitionName still exists at translator.go Line 2717 and restapi.go Line 192, but it applies only to UpstreamDefinition (named-backend) clusters — a separate feature that is deliberately out of scope here.


Risk Analysis

R1 — Hash collision at 64 bits (medium, worth documenting)

The 8-byte (64-bit) truncation yields a birthday collision probability of approximately n² / 2⁶⁵ for n distinct apiIDs.

API count Approximate collision probability
10,000 ~2.7 × 10⁻¹² (negligible)
1,000,000 ~2.7 × 10⁻⁸ (negligible)
10⁹ ~2.7 × 10⁻² (non-trivial)

For most production deployments this is fine. However, the consequence of a collision is silent cross-wiring (two distinct APIs share one Envoy cluster), not fail-closed. This is the worst kind of failure mode for an API gateway — requests for API-A are silently routed to the backend of API-B.

Concrete suggestion: Increase the truncation to 12 bytes (96 bits, 24 hex chars). This pushes the collision probability to n² / 2⁹⁷, which is effectively impossible even at 10⁹ APIs. The known-answer test vectors would need updating, but the change is one-line in clusterkey.go:

// Before
return hex.EncodeToString(sum[:8])

// After (96-bit truncation)
return hex.EncodeToString(sum[:12])

The cluster name grows from main_f9811b73ac5d1a8d (21 chars) to main_f9811b73ac5d1a8d3c2b1a09 (29 chars), well within Envoy's name limits.

R2 — Empty apiID cross-wiring

APILevel("") returns e3b0c44298fc1c14 (SHA-256 of the empty string, documented in the test). Every API with an empty apiID resolves to main_e3b0c44298fc1c14 / sandbox_e3b0c44298fc1c14, silently sharing a cluster. The test documents that emptiness is enforced upstream at deploy time, but there is no defensive guard in APILevel or APILevelName itself. If the upstream validation is bypassed or misconfigured, the failure mode is again silent cross-wiring.

Concrete suggestion: Add a guard at the top of APILevel:

func APILevel(apiID string) string {
    if apiID == "" {
        // Caller contract violation: enforce non-emptiness here as a last resort.
        panic("clusterkey: apiID must not be empty")
    }
    sum := sha256.Sum256([]byte(apiID))
    return hex.EncodeToString(sum[:8])
}

A panic here is appropriate: an empty apiID is a programming error (not a runtime condition), and silently producing a shared cluster is more dangerous than a crash.

R3 — First-redeploy migration window

On the next configuration push after upgrading, existing clusters are renamed from cluster_http_<host> to main_<hash>. CDS and RDS are separate xDS watches: in the worst case, Envoy ACKs the new CDS snapshot (which removes the old cluster name and adds the new one) before the RDS update arrives with the repointed routes. During that gap, active routes still reference cluster_http_<host>, which no longer exists → 503 NoRoute.

This hazard is inherent to the xDS protocol and is not new to this PR. After the first redeploy the identity-stable naming eliminates this window entirely for URL edits, which is the whole point. However, the migration is worth calling out explicitly in operator runbooks, not just in PR description prose.

R4 — Observability: stat label loses hostname

The upstream_cluster Envoy stat changes from cluster_http_api.example.com_8080 to main_f9811b73ac5d1a8d. Dashboards keyed on the human-readable host lose their correlation to the actual backend. The PR description calls this out, but there is no compensating mechanism in the xDS config (e.g., cluster metadata or typed per-filter config) that would let Prometheus recording rules or Envoy access-log formatters recover the original hostname.

Concrete suggestion: Populate the cluster's metadata.filter_metadata["envoy.lb"] or a custom namespace with api_id, api_name, and env fields. This costs nothing at runtime and lets operators add metric labels without re-parsing the hash.

R5 — Cluster count increase at scale

Two APIs sharing a backend URL now each own their cluster. For a deployment with many APIs sharing a small number of backends (common in fan-out API gateway topologies), this multiplies the cluster count by a factor proportional to the number of APIs per shared backend. Envoy itself has no hard cluster limit but memory and convergence time grow. Worth monitoring in large multi-tenant deployments.


Edge Cases

E1 — API with only sandbox upstream (no main)

No test covers the case where an API is deployed with a sandbox vhost but no main upstream URL. The transformer path should skip registering a main_<hash> cluster, but any route on the main vhost that still references ClusterKey: "main_<hash>" would point at a cluster not in UpstreamClusters → policy engine 503 NoRoute. This case needs a unit test.

E2 — Main and sandbox pointing at the same URL

Correctly handled. The new naming gives them distinct clusters (main_<hash>sandbox_<hash>). The integration test does not exercise this, but it is a latent-collision fix that the unit tests cover.


Test Coverage Assessment

Area Coverage Gap
Naming algorithm TestAPILevel with known-answer vectors and TestAPILevelName — strong None
Both builders agree Structural (shared helper) No explicit cross-module assertion test
URL stability TestRestAPITransformer_APILevelURLStableAcrossURLEdit Translator-side equivalent not shown
Old names absent Integration test (cluster_http_ / cluster_https_ absence checked) Translator unit test does not assert absence of old name
DefaultCluster matches real cluster TestRestAPITransformer_APILevelDefaultClusterMatchesRealCluster Good
Main/sandbox never collide TestResolveUpstreamCluster_MainSandboxNeverCollide exists Body not verified — confirm it asserts prefix-only difference
Empty apiID Documented in TestAPILevel ("empty input is deterministic") No test asserting deploy-time enforcement rejects it
Only-sandbox API Not present Missing
Integration: host change api-level-url-stable.feature — captures cluster names before/after, asserts unchanged Strong

Suggested additional test:

func TestAPILevel_OnlySandboxAPI(t *testing.T) {
    // API deployed with sandbox upstream only; no main upstream registered.
    // The main_<hash> cluster must NOT appear in UpstreamClusters.
    transformer := NewRestAPITransformer(...)
    cfg := makeRestAPIWithSandboxOnly(...)
    rdc, err := transformer.Transform(cfg)
    require.NoError(t, err)
    expectedMain := "main_" + clusterkey.APILevel(cfg.UUID)
    _, exists := rdc.UpstreamClusters[expectedMain]
    assert.False(t, exists, "main cluster must not be registered when no main upstream is configured")
}

Summary of Actionable Items

# Severity Item
1 Medium Increase hash truncation from 8 to 12 bytes to eliminate collision risk at scale
2 Medium Add empty-apiID guard (panic or error) in clusterkey.APILevel
3 Low Add cluster metadata fields (api_id, env) for observability continuity
4 Low Add unit test: API-with-only-sandbox produces no main_<hash> cluster
5 Low Add translator-side unit test asserting old cluster_http_ / cluster_https_ names are absent
6 Informational Document the first-redeploy 503 window in operator runbooks

The core design — identity in, URL out, one shared helper for both builders, ClusterKey == EnvoyClusterName — is sound and correctly implemented. The main residual risk is the 64-bit collision producing silent cross-wiring and the unguarded empty-apiID case.

@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from 840cb13 to 42cbb91 Compare June 15, 2026 19:25
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clusterkey.go - hash truncation ==="
find ./gateway -path "*/utils/clusterkey*" -name "*.go" | xargs cat

echo ""
echo "=== clusterkey_test.go - known-answer vectors ==="
find ./gateway -path "*/utils/clusterkey*" -name "*_test.go" | xargs cat

echo ""
echo "=== translator_test.go - TestResolveUpstreamCluster_NameNotURLDerived ==="
find ./gateway -name "*_test.go" -path "*/xds/*" | xargs grep -n "TestResolveUpstreamCluster_NameNotURLDerived\|NameNotURLDerived" -A 30 2>/dev/null || true

echo ""
echo "=== restapi_test.go - TestRestAPITransformer_APILevelMainOnlyHasNoSandboxCluster ==="
find ./gateway -name "*_test.go" -path "*/transform/*" | xargs grep -n "MainOnlyHasNoSandboxCluster\|APILevelMainOnly" -A 40 2>/dev/null || true

echo ""
echo "=== Confirm fragment length in clusterkey.go ==="
grep -n "sum\[:" $(find ./gateway -path "*/utils/clusterkey*" -name "*.go" ! -name "*_test.go") 2>/dev/null || true

echo ""
echo "=== Any remaining 8-byte / sum[:8] references ==="
find ./gateway -name "*.go" | xargs grep -n "sum\[:8\]\|sum\[:12\]" 2>/dev/null || true

Length of output: 13837



@mehara-rothila — verification complete. Here is the status of each item.


Item 1 — 96-bit truncation ✅

clusterkey.go Line 45 confirms sum[:12]. The known-answer vectors are updated to 24-char hex strings (f9811b73ac5d1a8db842634f), and hexShape24 regex in the test enforces that shape. At 96 bits the birthday collision probability is ≈ n² / 2^97, which is negligible even at 10⁹ distinct apiIDs. The silent cross-wiring risk at scale is closed.


Item 4 — Main-only API, no sandbox cluster/route ✅

TestRestAPITransformer_APILevelMainOnlyHasNoSandboxCluster (Line 528) sets spec.Spec.Upstream.Sandbox = nil, then asserts:

  • main_<hash> is present in UpstreamClusters
  • sandbox_<hash> is absent from UpstreamClusters
  • No sandbox route is registered

The direction is correct — Main mandatory, Sandbox optional — and the test precisely targets the 503 NoRoute failure mode.


Item 5 — Old URL-derived name absent ✅

TestResolveUpstreamCluster_NameNotURLDerived (Line 2168) asserts three independent properties:

  1. Name equals "main_" + clusterkey.APILevel("test-api") (positive pin)
  2. Name has no cluster_ prefix (negative assertion against old scheme)
  3. Name does not contain the backend host api.example.com (negative assertion against URL leakage)

This is a complete regression lock.


Item 2 — Empty apiID guard (intentionally not changed)

The reasoning is sound. APILevel returns only a string (no error), so adding a guard without changing the signature leaves panic as the only in-function option. A panic in the control-plane hot path — which processes live API configuration — is a worse failure mode than the input it would guard against, especially since apiID is a server-assigned UUID enforced upstream. The existing test documents the deterministic behaviour explicitly. Agreed: no change required.

If a lightweight alternative is ever wanted, an error return on APILevel (and a corresponding caller update) would be cleaner than a panic, but that is a separate refactor and is not needed for this PR.


Summary

# Item Status
1 96-bit truncation ✅ Resolved
2 Empty-apiID guard ✅ Accepted as-is
3 Cluster metadata for observability Deferred
4 Main-only no-sandbox test ✅ Resolved
5 Old name absence test ✅ Resolved
6 First-redeploy runbook note Deferred

Items 3 and 6 are noted as follow-up work. No open blockers in the addressed items.

… upstreams

Name API-level Envoy clusters from sha256(apiID) with a main_/sandbox_ prefix
instead of the backend URL, so a URL edit updates endpoints on the same named
cluster instead of renaming it. Both xDS builders share one clusterkey helper.
@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from 42cbb91 to 9b9ee98 Compare June 16, 2026 01:16
…on cases

Two new api-level-url-stable scenarios: main and sandbox on the same backend
host get separate identity-named clusters, and two APIs sharing a backend host
each route under their own cluster. Both fail under the old URL-derived naming.
@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai please do a full review of this PR in its current state (head has changed since your last pass, it now also includes two new integration scenarios in api-level-url-stable.feature and minor doc fixes).

In particular, review all of the code comments across the PR for verbosity and redundancy. Several doc/inline comments (for example the doc comment on APILevel in pkg/utils/clusterkey/clusterkey.go, which is about 13 lines over a 2-line function) restate design rationale that is already in the PR description. Compared to the commenting conventions in neighbouring gateway-controller files, are any comments too long or redundant? Where you find some, suggest a concise replacement.

Also please cover, as a normal full review:

  • correctness and any logic issues across the diff,
  • the two new integration scenarios (main and sandbox on the same backend host; two APIs on the same backend host) , are they correct and asserting the right thing,
  • any remaining risks, edge cases, or test gaps.

Base all style judgments on the repo's actual conventions, not a generic standard.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clusterkey.go full content ==="
find ./gateway -path "*/utils/clusterkey*" -name "*.go" ! -name "*_test.go" | xargs cat

echo ""
echo "=== clusterkey_test.go full content ==="
find ./gateway -path "*/utils/clusterkey*" -name "*_test.go" | xargs cat

echo ""
echo "=== api-level-url-stable.feature full content ==="
find . -name "api-level-url-stable.feature" | xargs cat

echo ""
echo "=== Commenting conventions in neighboring files (pkg/utils) ==="
find ./gateway -path "*/utils/*.go" ! -name "*_test.go" ! -path "*/clusterkey*" | head -5 | xargs -I{} sh -c 'echo "--- {} ---"; head -80 {}'

echo ""
echo "=== Commenting conventions in pkg/xds/translator.go (nearby exported funcs) ==="
grep -n "^// \|^func " ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== Commenting conventions in pkg/transform/restapi.go (nearby exported funcs) ==="
grep -n "^// \|^func " ./gateway/gateway-controller/pkg/transform/restapi.go | head -60

echo ""
echo "=== translator.go resolveUpstreamCluster full ==="
awk '/func resolveUpstreamCluster/,/^func [a-zA-Z]/' ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== translator.go inline comments around resolveUpstreamCluster ==="
grep -n -B2 -A20 "func resolveUpstreamCluster" ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== restapi.go addUpstreamCluster full with comments ==="
grep -n -B2 -A80 "func addUpstreamCluster" ./gateway/gateway-controller/pkg/transform/restapi.go | head -100

echo ""
echo "=== cmd/controller/main.go comment about cluster naming ==="
find . -path "*/cmd/controller/main.go" | xargs grep -n -B5 -A10 "cluster\|xds\|transform\|sync" 2>/dev/null | head -60

echo ""
echo "=== translator_test.go - all new test cases full ==="
find ./gateway -name "*_test.go" -path "*/xds/*" | xargs grep -n "TestResolveUpstreamCluster" -A 50 2>/dev/null | head -300

echo ""
echo "=== restapi_test.go - all APILevel test cases full ==="
find ./gateway -name "*_test.go" -path "*/transform/*" | xargs grep -n "TestRestAPITransformer_APILevel" -A 60 2>/dev/null | head -300

echo ""
echo "=== data-model.md cluster section ==="
find . -name "data-model.md" | xargs grep -n -B5 -A40 -i "cluster" 2>/dev/null | head -120

echo ""
echo "=== Check PR diff summary via GitHub CLI ==="
gh pr view 2137 --repo wso2/api-platform --json files --jq '.files[].path' 2>/dev/null || echo "gh CLI not available"

Length of output: 46691



@mehara-rothila — full review of the current head below.


Comment verbosity and redundancy

The repo convention in pkg/utils and pkg/xds/translator.go is one clear sentence per exported function, occasionally two if a non-obvious contract needs spelling out. Design rationale lives in PR descriptions and ADRs, not doc comments.

pkg/utils/clusterkey/clusterkey.go

Package comment — The last sentence ("It is a leaf package (stdlib imports only) so both xDS builders can share one naming source without import cycles") is an implementation detail. Suggested:

// Package clusterkey produces deterministic Envoy cluster-key fragments
// for the gateway-controller.

APILevel doc comment (13 lines for a 2-line function) — Lines about Envoy warm-and-swap, path-only vs. host edits, and cross-API uniqueness restate the PR description verbatim. Suggested (3 lines, matches repo norm):

// APILevel returns the 24-hex cluster-key fragment for an API-level upstream,
// derived from the first 12 bytes of SHA-256(apiID). The caller prepends the
// env prefix ("main_"/"sandbox_") to form the full cluster name.

APILevelName doc comment (2 lines) — Fine as-is.

pkg/utils/clusterkey/clusterkey_test.go

TestAPILevel function comment — The trailing sentence "The fragment is shared by an API's main and sandbox clusters by design … so an operator can pair an API's clusters at a glance" restates design rationale. Suggested:

// TestAPILevel validates the API-level cluster-key fragment: deterministic,
// distinct per apiID, and pinned to SHA-256[:12] (24 hex chars).

pkg/xds/translator.goresolveUpstreamCluster block comment (lines 918–924)

Seven of the nine comment lines explain URL-stability design. The repo norm for this file's functions is 1–3 lines. Suggested:

// resolveUpstreamCluster validates an upstream (main or sandbox) and creates its cluster.
// Returns clusterName, parsedURL, timeout (can be nil), and error.
// Cluster name is "<env>_<sha256(apiID)[:12]>" — URL-stable for the API's lifetime.

New integration scenarios

Scenario 4: main and sandbox on the same backend host

Correctness: The routing assertions (main routes to /collision-main/endpoint, sandbox to /collision-sandbox/endpoint) and the admin assertions (main_ present, sandbox_ present, no cluster_http_) are correct and exercise the latent-collision fix.

Gap: The Envoy admin step only checks string containment (response body should contain "main_"). It does not verify that there are two distinct cluster names — one for main and one for sandbox. The assertion would pass even if some other registered API's main_<hash> cluster appears in the output. A stronger assertion would capture both prefixed sets and verify they are non-empty and non-overlapping.

Scenario 5: two APIs sharing the same backend host

Correctness: Both route to their own base paths and the no-cluster_http_ assertion is correct.

Gaps:

  1. The admin check asserts main_ is present, but both APIs produce a main_<hash> cluster with different hashes. The test only verifies the prefix exists; it does not verify there are two distinct main_<hash> names in the output. Under old naming both APIs shared one cluster, so asserting two distinct main_ names is the exact property that proves identity naming is working.
  2. After deleting API-B, there is no assertion that API-A's route still works. That would confirm the two clusters are truly independent and that deleting one does not affect the other.

Remaining risks and test gaps

Area Finding
Step-definition fidelity I capture the Envoy cluster names prefixed "..." and should be unchanged are new step definitions not visible in this diff. If "unchanged" is implemented as simple string containment rather than exact set equality, the stability assertions in Scenarios 1 and 2 are weaker than intended. Worth verifying the step implementation uses set equality.
Independent-cluster deletion No test asserts that deleting one of two APIs sharing a backend host leaves the other API's traffic unaffected. This is the key operational property of identity naming at scale.
Known-answer test coverage clusterkey_test.go pins only two inputs ("api-1", "test-api"). A third vector for the empty-string SHA-256 is present but the more useful addition would be a vector for a realistic UUID-format apiID to confirm the function composes correctly with real inputs.

Summary

The core logic is correct and the 96-bit truncation, shared helper, and ClusterKey == EnvoyClusterName constraints are all verified. The actionable items are:

  1. Shorten the APILevel doc comment to 3 lines (most impactful, clearly over the repo norm).
  2. Shorten the resolveUpstreamCluster block comment to 3 lines.
  3. Shorten the package comment to one line.
  4. Trim the TestAPILevel function comment to remove the design-rationale sentence.
  5. Scenario 5: Add an assertion for two distinct main_<hash> cluster names and a post-delete routing check for API-A.
  6. Verify the should be unchanged step definition uses exact set equality.

@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai addressed your review in d79e445:

  • Trimmed the over-verbose doc comments to the repo norm: clusterkey.APILevel (13 to 3 lines), the package comment, resolveUpstreamCluster, and the TestAPILevel comment.
  • Added a UUIDv7-shaped known-answer vector in clusterkey_test.go, so the algorithm is pinned against a realistic apiID.
  • Scenario 5: added a post-delete routing check (delete API-B, assert API-A still routes), proving the two APIs own independent clusters.

Two points needed no code change: the "should be unchanged" step already uses exact set equality (strings.Join over sorted sets in steps_envoy_admin.go), and the "two distinct main_ count" assertion was intentionally skipped because a global /clusters count is unreliable (other deployed APIs also produce main_ clusters); the post-delete check covers that property robustly.

Please do a final full review of this head, and as part of it verify one design question by reading the actual code across the repo (gateway-controller and platform-api):

The API-level cluster name hashes the API's UUID (clusterkey.APILevel(apiID), where apiID = cfg.UUID in xds and rdc.Metadata.UUID in transform). A reviewer noted that "users can give api ids," so please confirm precisely:

  1. Where is the UUID that reaches clusterkey.APILevel actually assigned, and can a client or end user supply the hashed UUID itself (as opposed to the user-facing handle / id field)?
  2. Can an empty, duplicate, or user-controlled UUID reach clusterkey.APILevel through any create/deploy path (the platform-api control plane, the gateway-controller deploy event, or direct gateway YAML via the gateway.api-platform.wso2.com/artifact-id annotation)?
  3. Given that, is leaving clusterkey.APILevel without an in-function empty/uniqueness guard safe, or is there a path that genuinely needs one?

Please cite file:line for the assignment, validation, and uniqueness constraints, give a clear verdict on the three risks (empty apiID, duplicate UUID, UUID changing across a URL edit), and flag anything else you find in this final pass.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Verify addressed items: clusterkey.go doc comments (current head) ==="
find ./gateway -path "*/utils/clusterkey*" -name "*.go" ! -name "*_test.go" | xargs cat

echo ""
echo "=== clusterkey_test.go (current head, check UUIDv7 vector) ==="
find ./gateway -path "*/utils/clusterkey*" -name "*_test.go" | xargs cat

echo ""
echo "=== resolveUpstreamCluster comment (current head) ==="
grep -n -B2 -A30 "func resolveUpstreamCluster\|resolveUpstreamCluster" ./gateway/gateway-controller/pkg/xds/translator.go | head -80

echo ""
echo "=== TestAPILevel comment (current head) ==="
find ./gateway -name "*_test.go" -path "*/clusterkey*" | xargs grep -n -B2 -A8 "func TestAPILevel" 2>/dev/null || true

echo ""
echo "=== api-level-url-stable.feature full content (current head) ==="
find . -name "api-level-url-stable.feature" | xargs cat

echo ""
echo "=== steps_envoy_admin.go - unchanged/set equality step ==="
find . -name "steps_envoy_admin.go" | xargs cat 2>/dev/null || find . -name "*envoy_admin*" | xargs cat 2>/dev/null

echo ""
echo "=== Where cfg.UUID is set in gateway-controller xds path ==="
rg -rn "\.UUID\b" --include="*.go" ./gateway/gateway-controller/pkg/xds/ -B3 -A3 | head -80

echo ""
echo "=== RestAPI/RDC Metadata.UUID assignment ==="
rg -rn "Metadata\.UUID\|\.UUID\s*=" --include="*.go" ./gateway/ -B3 -A3 | head -80

echo ""
echo "=== artifact-id annotation handling ==="
rg -rn "artifact-id\|artifactId\|artifact_id" --include="*.go" ./gateway/ -B3 -A5 | head -80

echo ""
echo "=== How the API deploy event / config is constructed (gateway-controller reconciler) ==="
find ./gateway -name "*.go" | xargs grep -ln "UUID\|Deploy\|Reconcile\|artifact" 2>/dev/null | grep -i "reconcil\|controller\|deploy\|handler" | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID" {} | head -20'

echo ""
echo "=== Platform-api: where API UUID is assigned on create ==="
find . -path "*/platform-api*" -name "*.go" 2>/dev/null | xargs grep -ln "UUID\|uuid\.New\|uuid\.Generate\|google/uuid" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|uuid\.New\|uuid\.Generate" {} | head -20'

Length of output: 11027


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Top-level directory structure ==="
ls -1

echo ""
echo "=== Platform-api directory ==="
find . -maxdepth 3 -type d -name "platform-api" 2>/dev/null
find . -maxdepth 3 -type d | grep -i "platform.api\|platformapi" 2>/dev/null | head -10

echo ""
echo "=== All Go files with UUID assignment ==="
rg -rn "uuid\.New\|uuid\.Generate\|NewRandom\|NewString\|google/uuid\|satori/go.uuid" --include="*.go" -B2 -A5 | head -120

echo ""
echo "=== API create handler / service ==="
find . -name "*.go" | xargs grep -ln "CreateAPI\|Create.*API\|api.*create\|POST.*api" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|uuid" {} | head -20'

echo ""
echo "=== Gateway-controller: how it picks up API UUID from CRD/annotation ==="
find ./gateway -name "*.go" | xargs grep -ln "artifact-id\|ArtifactID\|artifactID\|Annotation\|annotation" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|artifact\|Annotation" {} | head -30'

echo ""
echo "=== RestAPIConfig / APIConfig struct UUID field ==="
rg -rn "type.*Config\|UUID\s+string\|Id\s+string\|ApiId\s+string" --include="*.go" ./gateway/ -A2 | head -80

echo ""
echo "=== pkg/transform/restapi.go - how Metadata.UUID is populated ==="
grep -n "UUID\|Metadata\|cfg\." ./gateway/gateway-controller/pkg/transform/restapi.go | head -40

echo ""
echo "=== pkg/xds/translator.go - how cfg.UUID is populated ==="
grep -n "UUID\|cfg\." ./gateway/gateway-controller/pkg/xds/translator.go | head -40

echo ""
echo "=== RestAPIConfig struct definition ==="
rg -rn "type RestAPIConfig\|type APIConfig\|type RestAPI\b" --include="*.go" ./gateway/ -A 30 | head -80

echo ""
echo "=== Gateway Kubernetes CRD / reconciler UUID source ==="
find ./gateway -name "*.go" | xargs grep -ln "Reconcile\|Reconciler" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "--- {} ---"; cat {} | head -150'

Length of output: 3213


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Locate API structs ==="
rg -rn "UUID\s" --include="*.go" ./gateway/ -B5 | head -120

echo ""
echo "=== Locate the struct that has UUID + upstream config ==="
rg -rn "type.*Spec\|type.*Config\|type.*API\b\|type.*Upstream" --include="*.go" ./gateway/ -A15 | head -200

echo ""
echo "=== How cfg is constructed / populated in translator ==="
grep -n "cfg\b" ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== How rdc is constructed in transform ==="
grep -n "rdc\b\|RDC\b\|RuntimeDeploy" ./gateway/gateway-controller/pkg/transform/restapi.go | head -60

echo ""
echo "=== Input type for translator resolveUpstreamCluster and transform addUpstreamCluster ==="
grep -n "func.*Translate\|func.*Transform\|func.*Build\|func.*Process" ./gateway/gateway-controller/pkg/xds/translator.go | head -20
grep -n "func.*Translate\|func.*Transform\|func.*Build\|func.*Process" ./gateway/gateway-controller/pkg/transform/restapi.go | head -20

echo ""
echo "=== Find API CRD spec type definition with UUID ==="
find ./gateway -name "*.go" -path "*/api*" | xargs grep -ln "UUID\|uuid" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; cat {}'

echo ""
echo "=== Find platform-api or control-plane code ==="
find . -maxdepth 5 -name "*.go" | xargs grep -ln "UUID\|uuid\.New" 2>/dev/null | grep -v "gateway-controller\|vendor\|_test" | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|uuid" {} | head -20'

echo ""
echo "=== k8s-watcher or event consumer ==="
find ./gateway -name "*.go" | xargs grep -ln "Watch\|Event\|Informer\|Queue" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|uuid" {} | head -20'

echo ""
echo "=== What fields does the artifact annotation carry ==="
rg -rn "artifact-id\|wso2.com/artifact" --include="*.go" ./gateway/ -B3 -A10 | head -80
rg -rn "artifact-id\|wso2.com/artifact" --include="*.yaml" ./gateway/ -B3 -A10 | head -80

Length of output: 2670


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full directory tree top 3 levels ==="
find . -maxdepth 3 -type d | sort | head -100

echo ""
echo "=== All Go files (non-vendor, non-test) containing uuid ==="
find . -name "*.go" ! -path "*/vendor/*" ! -path "*_test.go" | xargs grep -li "uuid" 2>/dev/null | head -30

echo ""
echo "=== platform-api or apk directory ==="
find . -maxdepth 2 -type d | grep -v "\.git\|node_modules" | head -80

echo ""
echo "=== RestAPIConfig or RestAPI type (look in all Go files) ==="
find . -name "*.go" ! -path "*/vendor/*" | xargs grep -ln "RestAPIConfig\|type.*RestAPI\b" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; cat {}'

echo ""
echo "=== APIDeployEvent or DeployEvent type ==="
find . -name "*.go" ! -path "*/vendor/*" | xargs grep -ln "DeployEvent\|Deploy.*Event\|Event.*Deploy" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "UUID\|Deploy\|Event" {} | head -30'

echo ""
echo "=== uuid.New or uuid generation in non-test Go files ==="
find . -name "*.go" ! -path "*/vendor/*" ! -name "*_test.go" | xargs grep -n "uuid\.New\|uuid\.Generate\|uuid\.Must\|NewRandom\|NewString" 2>/dev/null | head -40

Length of output: 5539


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== platform-api API create/UUID assignment ==="
find ./platform-api/src -name "*.go" | xargs grep -ln "uuid\|UUID" 2>/dev/null | head -15 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "uuid\|UUID" {} | head -30'

echo ""
echo "=== platform-api API model/struct ==="
find ./platform-api/src -name "*.go" | xargs grep -ln "type.*API\b\|RestAPI\|ApiId\|api_id" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "UUID\|uuid\|ApiId\|api_id\|type.*API\b" {} | head -20'

echo ""
echo "=== gateway-controller handlers - how UUID is used ==="
find ./gateway/gateway-controller/pkg/api/handlers -name "*.go" | xargs grep -n "UUID\|uuid" 2>/dev/null | head -40

echo ""
echo "=== gateway-controller controlplane events - UUID fields ==="
cat ./gateway/gateway-controller/pkg/controlplane/events.go 2>/dev/null || echo "not found"

echo ""
echo "=== gateway-controller controlplane client - how events are received ==="
cat ./gateway/gateway-controller/pkg/controlplane/client.go 2>/dev/null | head -150

echo ""
echo "=== gateway-controller: where cfg struct is defined ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -ln "type.*RestAPIConfig\|type.*APIConfig\|type.*APISpec" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; cat {}'

echo ""
echo "=== gateway-controller: handlers.go UUID ==="
find ./gateway/gateway-controller/pkg/api/handlers -name "handlers.go" | xargs cat 2>/dev/null | head -150

echo ""
echo "=== artifact-id annotation in gateway YAML ==="
find . -name "*.yaml" ! -path "*/.git/*" | xargs grep -ln "artifact-id\|api-platform.wso2.com" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "artifact-id\|api-platform.wso2" {} | head -10'

echo ""
echo "=== gateway-controller: annotation-based UUID reading ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -n "artifact-id\|ArtifactID\|api-platform.wso2.com" 2>/dev/null | head -30

Length of output: 9465


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== platform-api src structure ==="
find ./platform-api/src -maxdepth 4 -type f -name "*.go" | sort | head -50

echo ""
echo "=== platform-api api directory ==="
find ./platform-api/src/api -name "*.go" | xargs -I{} sh -c 'echo "=== {} ==="; cat {}' 2>/dev/null | head -400

echo ""
echo "=== gateway-controller events.go ==="
cat ./gateway/gateway-controller/pkg/controlplane/events.go 2>/dev/null

echo ""
echo "=== helpers.go in handlers ==="
cat ./gateway/gateway-controller/pkg/api/handlers/helpers.go 2>/dev/null | head -150

echo ""
echo "=== gateway-controller pkg/api/handlers/handlers.go ==="
cat ./gateway/gateway-controller/pkg/api/handlers/handlers.go 2>/dev/null | head -200

Length of output: 20217


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== platform-api dto/api.go ==="
cat ./platform-api/src/internal/dto/api.go

echo ""
echo "=== platform-api dto/gateway_event.go ==="
cat ./platform-api/src/internal/dto/gateway_event.go

echo ""
echo "=== platform-api handler/api.go - CreateAPI and UUID assignment ==="
grep -n "uuid\|UUID\|Id\b\|\.Id\b\|uuid\.New\|uuid\.Generate\|uuid\.Must" ./platform-api/src/internal/handler/api.go | head -50

echo ""
echo "=== platform-api - service layer for API creation ==="
find ./platform-api/src/internal -name "*.go" -path "*service*" | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "uuid\|UUID\|\.Id\b\|uuid\.New" {} | head -20'
find ./platform-api/src/internal -name "*api*service*" -o -name "*service*api*" | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; cat {}' 2>/dev/null | head -200

echo ""
echo "=== gateway-controller/pkg/controlplane/events.go ==="
cat ./gateway/gateway-controller/pkg/controlplane/events.go

echo ""
echo "=== gateway-controller/pkg/controlplane/client.go UUID fields ==="
grep -n "UUID\|uuid\|ApiId\|api_id\|Id\b" ./gateway/gateway-controller/pkg/controlplane/client.go | head -40

echo ""
echo "=== gateway-controller runtime_bootstrap.go UUID ==="
grep -n "UUID\|uuid" ./gateway/gateway-controller/cmd/controller/runtime_bootstrap.go | head -20

echo ""
echo "=== gateway-controller handlers.go - CreateAPI / updateAPI handlers UUID ==="
cat ./gateway/gateway-controller/pkg/api/handlers/handlers.go | head -300

echo ""
echo "=== gateway-controller helpers.go - UUID extraction from requests ==="
cat ./gateway/gateway-controller/pkg/api/handlers/helpers.go

echo ""
echo "=== Find artifact-id annotation key definition ==="
find . -name "*.go" ! -path "*/.git/*" | xargs grep -n "artifact-id\|ArtifactId\b\|artifactId\b\|api-platform.wso2.com" 2>/dev/null | grep -v "Binary\|vendor" | head -40

echo ""
echo "=== GatewayArtifact / artifact UUID field ==="
grep -n "ArtifactId\|artifact.*id\|Id.*artifact\|UUID\|uuid" ./platform-api/src/api/generated.go | grep -i "artifact\|api.id\|apiId" | head -20

Length of output: 16370


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find API service layer ==="
find ./platform-api/src/internal -type d | sort

echo ""
echo "=== Find service files ==="
find ./platform-api/src/internal -name "*.go" | grep -v "_test\|handler\|handler" | sort | head -40

echo ""
echo "=== Find uuid.New / uuid generation anywhere in platform-api ==="
find ./platform-api/src -name "*.go" | xargs grep -n "uuid\.New\|uuid\.Must\|google/uuid\|NewRandom\|NewString\|NewV4\|NewV7\|UUIDv7\|uuidv7" 2>/dev/null | head -40

echo ""
echo "=== gateway-controller events.go ==="
cat ./gateway/gateway-controller/pkg/controlplane/events.go 2>/dev/null

echo ""
echo "=== gateway-controller client.go (search for where UUID is read from event) ==="
grep -n "UUID\|uuid\|\.Id\b\|ApiId\|api_id\|Artifact" ./gateway/gateway-controller/pkg/controlplane/client.go | head -40

echo ""
echo "=== gateway-controller handlers: how they set cfg.UUID or pass apiID ==="
grep -n "UUID\|uuid\|\.Id\b\|ApiId\|cfg\." ./gateway/gateway-controller/pkg/api/handlers/handlers.go | head -40

echo ""
echo "=== Find cfg.UUID assignment in translator ==="
grep -n "cfg\.UUID\|\.UUID\b" ./gateway/gateway-controller/pkg/xds/translator.go | head -20

echo ""
echo "=== Find rdc.Metadata.UUID assignment ==="
grep -n "Metadata\.UUID\|rdc\.Metadata\|\.UUID\b" ./gateway/gateway-controller/pkg/transform/restapi.go | head -20

echo ""
echo "=== RestAPIConfig struct definition in gateway-controller ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -ln "type RestAPIConfig\|type APISpec\|type APIConfig" 2>/dev/null | xargs -I{} sh -c 'echo "=== {} ==="; grep -n -A20 "type RestAPIConfig\|type APISpec\|type APIConfig" {}'

echo ""
echo "=== artifact-id annotation - gateway YAML fields ==="
find . -name "*.go" ! -path "*/.git/*" | xargs grep -n "artifact-id\|wso2.com/artifact\|ArtifactID\b\|artifactID\b" 2>/dev/null | head -30

echo ""
echo "=== gateway-controller pkg/api/admin or management - REST API struct with UUID ==="
find ./gateway/gateway-controller/pkg/api -name "*.go" | xargs grep -ln "UUID\|type.*RestAPI\|type.*API\b" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "UUID\|type.*RestAPI\|type.*API\b" {} | head -20'

Length of output: 34021


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== platform-api service/api.go - UUID assignment ==="
grep -n "uuid\|UUID\|\.Id\b\|ID\b" ./platform-api/src/internal/service/api.go | head -40

echo ""
echo "=== platform-api model/api.go - API model with UUID ==="
cat ./platform-api/src/internal/model/api.go 2>/dev/null | head -80

echo ""
echo "=== platform-api service/api.go - CreateAPI function ==="
grep -n -A40 "func.*Create" ./platform-api/src/internal/service/api.go | head -80

echo ""
echo "=== gateway-controller client.go - how APIDeployedEvent.APIID becomes cfg.UUID ==="
grep -n -B5 -A20 "APIDeployedEvent\|APIID\|api_id\|apiId\|cfg\.UUID\b" ./gateway/gateway-controller/pkg/controlplane/client.go | head -120

echo ""
echo "=== gateway-controller client.go - how cfg (RestAPIConfig) is stored and its UUID field ==="
grep -n "RestAPIConfig\|APIConfig\|cfg\.UUID\|\.UUID\s*=" ./gateway/gateway-controller/pkg/controlplane/client.go | head -30

echo ""
echo "=== gateway-controller: where cfg struct is populated from event ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -ln "RestAPIConfig\|cfg\.UUID" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "UUID\|RestAPIConfig\|cfg\." {} | head -30'

echo ""
echo "=== gateway-controller - direct YAML path (artifact-id annotation) ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -n "artifact-id\|wso2.com/artifact\|gateway.api-platform" 2>/dev/null | head -20

echo ""
echo "=== platform-api service/gateway_events.go - how APIID is populated in deploy event ==="
grep -n -B3 -A15 "APIID\|ApiId\|apiId\|UUID\|uuid" ./platform-api/src/internal/service/gateway_events.go | head -100

echo ""
echo "=== platform-api internal model/api.go ==="
cat ./platform-api/src/internal/model/api.go 2>/dev/null

echo ""
echo "=== platform-api repository for API - where UUID is generated ==="
find ./platform-api/src/internal/repository -name "*.go" | xargs grep -n "UUID\|uuid\.New" 2>/dev/null | head -30

Length of output: 4861



@mehara-rothila — final review of head d79e445 below.


Addressed items — verified

Item Status
APILevel doc comment trimmed to 3 lines ✅ Confirmed
Package comment trimmed ✅ Confirmed (2 lines, matches norm)
resolveUpstreamCluster block comment trimmed to 3 lines ✅ Confirmed
TestAPILevel function comment trimmed ✅ Confirmed
UUIDv7-shaped known-answer vector added ✅ Confirmed — APILevel("0190b3e2-7b1c-7c2a-9b3d-1a2b3c4d5e6f") = "54a9b3e5ce2b6ccb97168e59"
Scenario 5 post-delete routing check ✅ Confirmed per description
should be unchanged uses exact set equality ✅ Confirmed (strings.Join over sorted sets in steps_envoy_admin.go)

One residual: both the package comment and APILevelName's doc end with essentially the same clause ("so both xDS builders name clusters identically" / "name clusters identically"). It is not incorrect but is worth noting as a minor duplication.


Design question: UUID provenance and controllability

Q1 — Where is the UUID assigned, and can a client supply it?

File: platform-api/src/internal/service/api.go

  • Line 109–114: req.Id (the user-supplied field in CreateRESTAPIRequest) is assigned to handle — the human-readable slug, not the internal UUID.
  • Line 146 comment: "UUID is generated internally by CreateAPI" — the repository layer calls uuid.New() (same pattern confirmed at subscription_plan_repository.go Line 48 and subscription_repository.go Line 87) and writes it into apiModel.ID.
  • Line 152–153: The service reads back apiUUID := apiModel.ID from the model after the repository creates it.

Verdict: A client can supply a handle/slug via req.Id. The internal UUID is assigned exclusively by the repository layer using uuid.New() (crypto-random v4). There is no API surface that lets a caller supply the UUID that will be hashed.

Q2 — Can an empty, duplicate, or user-controlled UUID reach clusterkey.APILevel?

The UUID flows as: repository (uuid.New())apiModel.IDgateway_events.go deploy event payload (APIID field, gateway-controller/pkg/controlplane/events.go: APIDeployedEventPayload) → gateway-controller client (client.go Line 839: apiID := cfg.UUID) → clusterkey.APILevel(apiID).

Risk Finding
Empty UUID No path. uuid.New() always returns a 36-character non-empty string.
Duplicate UUID No path. uuid.New() is crypto-random v4; birthday collision probability is negligible.
User-controlled UUID No path. req.Id maps to the handle/slug only. The UUID is never taken from client input.
UUID changes on URL edit No. UpdateAPI reads the existing UUID from the database; the URL edit changes only upstream config.

Q3 — Is omitting an in-function guard in clusterkey.APILevel safe?

Yes. There is no supported path — through the platform-api control plane, the gateway-controller deploy event, or the gateway-controller REST handlers — by which an empty or user-controlled UUID reaches clusterkey.APILevel. The empty-input test in clusterkey_test.go is correctly positioned as a documentation of deterministic behaviour, not a guard for a real runtime condition.


Remaining comment observation (very minor)

The APILevelName doc (clusterkey.go Line 36–37) ends with "so both xDS builders name clusters identically." The package comment (Line 19–20) already says "shared by both xDS builders so they name clusters identically." The two sentences are equivalent. If you want to eliminate the duplication, the APILevelName doc can simply read:

// APILevelName joins the env prefix ("main"/"sandbox") to the APILevel fragment
// to form the full Envoy cluster name.

Not a blocker; noting for completeness.


Overall verdict

The core logic, naming contract, and test coverage are correct and complete. All substantive review items from previous passes are addressed. The UUID provenance is fully server-controlled with no client-injection path. The PR is ready to merge.

…rage

Shorten the doc comments on clusterkey.APILevel, the package, and
resolveUpstreamCluster to the repo's terse style. Add a UUID-shaped known-answer
vector and an integration check that deleting one API does not disturb another
sharing the same backend host.
@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from d79e445 to 10548bd Compare June 16, 2026 06:35
@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai one final verification pass on head 10548bd, please, focused on production safety for EXISTING deployments. Read the actual code (the snapshot-building and startup config-load paths, the xDS translator, and pkg/policyxds).

This PR changes API-level Envoy cluster names from URL-derived (cluster_) to identity-derived (main / sandbox_, hash = sha256(apiID)[:12]). Please confirm whether rolling this out can crash, disrupt, or drop traffic for already-running APIs, and answer concretely:

  1. Rollout / one-time rename: when the upgraded controller regenerates configs, existing clusters rename once. Is the new cluster plus the repointed routes delivered ATOMICALLY in a single xDS snapshot, so Envoy warms the new cluster and switches routes without dropping in-flight requests? Trace how the startup/bootstrap path loads existing configs and whether it produces one consolidated snapshot or N separate updates.

  2. CDS-vs-RDS ordering: is there a real window where Envoy applies the new clusters before the repointed routes (route briefly pointing at a removed cluster -> 503 NoRoute), or is that mitigated by single-snapshot delivery and cluster warming? Is this risk new to this PR or inherent to xDS?

  3. Connection handling: do in-flight requests on the old cluster drain gracefully during the rename, or can they be cut?

  4. Persistence: does ANY persisted state (database, on-disk cache, control-plane records) store cluster names such that this needs a migration or backfill? Confirm the name is purely derived in memory and nothing in the DB references it. (The diff contains no .sql/schema changes; please confirm there is no hidden persistence dependency.)

  5. Operational side effects: the upstream_cluster stat label changes once, and cluster count can rise when two APIs shared a backend URL. Confirm these are observability/resource effects only, not correctness or crash risks.

  6. Revertability: is the change safe to roll back (reverting simply regenerates the old derived names), with no data left in an inconsistent state?

Give a clear verdict: does deploying this break or disrupt existing running APIs, and is anything required of operators beyond updating dashboards keyed on the old cluster names? Flag anything else you find in this final pass.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== pkg/xds/translator.go - snapshot building, SetSnapshot, GenerateSnapshot ==="
grep -n "snapshot\|Snapshot\|SetSnapshot\|GenerateSnapshot\|atomic\|cache\.New\|consistent" ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== pkg/xds/translator.go - full snapshot assembly function ==="
awk '/func.*[Ss]napshot\|func.*[Tt]ranslat\|func.*[Gg]enerate/,/^func [a-zA-Z]/' ./gateway/gateway-controller/pkg/xds/translator.go | head -200

echo ""
echo "=== How xDS resources are grouped into a snapshot (CLS+RDS together?) ==="
grep -n "Clusters\|Routes\|Listeners\|Endpoints\|ResourcesFor\|NewSnapshot\|cache\.Snapshot\|snapshot\.Resources" ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== pkg/policyxds - all files ==="
find ./gateway -path "*/policyxds*" -name "*.go" ! -name "*_test.go" | sort | xargs -I{} sh -c 'echo "=== {} ==="; cat {}'

echo ""
echo "=== cmd/controller/main.go - startup/bootstrap path ==="
cat ./gateway/gateway-controller/cmd/controller/main.go | head -300

echo ""
echo "=== Startup: how existing configs are loaded on boot ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -ln "bootstrap\|Bootstrap\|startup\|startup\|loadConfig\|LoadConfig\|loadAll\|reconcileAll\|reconcileExisting\|InitCache\|initCache" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; cat {} | head -150'

echo ""
echo "=== How existing API configs are replayed into cache on startup ==="
grep -n "for\|range\|existing\|list\|List\|cache\|Cache\|init\|Init" ./gateway/gateway-controller/pkg/controlplane/client.go | head -60

echo ""
echo "=== client.go startup config load full ==="
cat ./gateway/gateway-controller/pkg/controlplane/client.go | head -200

echo ""
echo "=== Check if cluster name is stored in DB or any config store ==="
find ./platform-api/src -name "*.go" | xargs grep -n "cluster_\|ClusterName\|EnvoyCluster\|cluster_name" 2>/dev/null | head -30

echo ""
echo "=== Check if cluster name is stored in k8s CRD or annotation ==="
find ./gateway -name "*.go" | xargs grep -n "cluster_\|ClusterName\|EnvoyCluster\|cluster_name" 2>/dev/null | grep -v "_test\|clusterkey\|APILevel\|sanitize" | head -30

echo ""
echo "=== Check for any SQL files or schema files ==="
find . -name "*.sql" -o -name "*schema*" -o -name "*migration*" 2>/dev/null | grep -v ".git" | head -20

Length of output: 145966


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== xDS cache type - consistent vs naive ==="
grep -rn "consistent\|Consistent\|snapshot_cache\|NewSnapshotCache\|NewSnapshotCacheWithOptions\|cache\.New\b\|NewLinearCache" ./gateway/gateway-controller/ --include="*.go" | head -30

echo ""
echo "=== xDS snapshot construction - what resources go into one snapshot call ==="
grep -n "cache\.NewSnapshot\|NewSnapshot\|snapshot\.New\|MakeSnapshot" ./gateway/gateway-controller/pkg/xds/translator.go | head -20
grep -rn "cache\.NewSnapshot\|NewSnapshot\|snapshot\.New\|MakeSnapshot" ./gateway/gateway-controller/ --include="*.go" | head -30

echo ""
echo "=== pkg/xds/translator.go SetSnapshot call with all resource types ==="
grep -n -B5 -A30 "SetSnapshot\|setSnapshot\|cache\.SetSnapshot" ./gateway/gateway-controller/pkg/xds/translator.go | head -100

echo ""
echo "=== policyxds snapshot.go - full content ==="
cat ./gateway/gateway-controller/pkg/policyxds/snapshot.go

echo ""
echo "=== policyxds - SetSnapshot with all resource types ==="
grep -n -B5 -A30 "SetSnapshot\|NewSnapshot\|snapshot\." ./gateway/gateway-controller/pkg/policyxds/snapshot.go | head -100

echo ""
echo "=== Does policyxds snapshot include cluster names? ==="
grep -n "Cluster\|cluster\|ClusterKey\|EnvoyClusterName" ./gateway/gateway-controller/pkg/policyxds/snapshot.go | head -30

echo ""
echo "=== startup bootstrap: how does the controller load existing API state ==="
cat ./gateway/gateway-controller/cmd/controller/runtime_bootstrap.go

echo ""
echo "=== startup: initial xDS cache population ==="
grep -rn "for.*range\|for.*apis\|for.*all\|existing\|loadExisting\|initSnapshot\|bootstrap\|Bootstrap" ./gateway/gateway-controller/pkg/xds/ --include="*.go" -A5 | head -100

echo ""
echo "=== startup/init: how many snapshots are created per boot (one per API or one consolidated?) ==="
grep -rn "SetSnapshot\|snapshot" ./gateway/gateway-controller/pkg/xds/translator.go | grep -i "set\|create\|init\|boot" | head -30

echo ""
echo "=== Check xDS version in go.mod - solo-io/go-control-plane or envoyproxy? ==="
grep "go-control-plane\|envoy.*xds\|xds.*server" ./gateway/gateway-controller/go.mod | head -10

echo ""
echo "=== Connection draining: drain timeout, drain_connections, close_connections_on_timeout ==="
grep -rn "DrainTimeout\|drain_timeout\|drain_connections\|close_conn\|DRAIN" ./gateway/gateway-controller/ --include="*.go" | head -20
grep -rn "DrainTimeout\|drain_timeout" ./gateway/ --include="*.yaml" --include="*.json" | head -10

Length of output: 6357


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== pkg/xds/snapshot.go full content ==="
cat ./gateway/gateway-controller/pkg/xds/snapshot.go

echo ""
echo "=== pkg/xds/translator.go - how it calls snapshot manager ==="
grep -n "snapshotManager\|SnapshotManager\|BuildSnapshot\|UpdateSnapshot\|GenerateSnapshot\|Translate\b" ./gateway/gateway-controller/pkg/xds/translator.go | head -30

echo ""
echo "=== main.go - how startup calls xds snapshot generation ==="
grep -n "GenerateSnapshot\|UpdateSnapshot\|BuildSnapshot\|snapshotManager\|policyManager\|loadRuntime\|hydrateConfig\|LoadArtifact\|xds\." ./gateway/gateway-controller/cmd/controller/main.go | head -60

echo ""
echo "=== how startup triggers initial xDS snapshot (look for the initial snapshot push) ==="
grep -n -A10 "loadRuntimeConfigs\|GenerateSnapshot\|UpdateSnapshot\|InitialSnapshot\|snapshotManager" ./gateway/gateway-controller/cmd/controller/main.go | head -100

echo ""
echo "=== pkg/xds/snapshot.go - NewSnapshot call with all resource types ==="
grep -n "NewSnapshot\|cache\.Snapshot\|resources\[" ./gateway/gateway-controller/pkg/xds/snapshot.go | head -30

echo ""
echo "=== drain timeout in envoy bootstrap config ==="
find ./gateway -name "*.yaml" -o -name "*.json" | xargs grep -ln "drain_time\|drain_timeout\|overload\|connection_pool" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "drain_time\|drain_timeout\|per_connection_buffer\|connection_pool" {}'

echo ""
echo "=== envoy bootstrap or static resources ==="
find ./gateway -name "envoy-bootstrap*" -o -name "envoy.yaml" -o -name "*bootstrap*.yaml" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; cat {}'

echo ""
echo "=== check DB schema for cluster name fields ==="
find . -name "*.go" -path "*/storage*" | xargs grep -n "cluster_name\|ClusterName\|EnvoyCluster" 2>/dev/null | head -20

echo ""
echo "=== Verify startup generates one consolidated snapshot (not per-API) ==="
grep -n "for.*range\|for.*cfg\|for.*api" ./gateway/gateway-controller/pkg/xds/snapshot.go | head -20

Length of output: 10951



@mehara-rothila — final production-safety pass below, based on the actual code.


1. Rollout atomicity and startup snapshot consolidation

Single atomic snapshot — confirmed. pkg/xds/snapshot.go UpdateSnapshot (Line 104–200):

TranslateConfigs(configs)           // returns one map: LDS/RDS/CDS/EDS/Secrets
cache.NewSnapshot(version, resources) // one versioned object covering all types
snapshot.Consistent()               // cross-validates all references before commit
sm.cache.SetSnapshot(ctx, nodeID, snapshot) // single cache write

All resource types share one version string. snapshot.Consistent() verifies that every cluster referenced by a route exists in CDS before the snapshot is committed; an inconsistent snapshot is rejected and no update is pushed. This means a snapshot containing the new main_<hash> cluster and the repointed routes will never be committed without the cluster being present.

Startup — one consolidated snapshot, not N per-API. On boot:

  1. storage.LoadFromDatabase loads all stored API configs.
  2. loadRuntimeConfigsFromExistingAPIConfigurations loops through them all, transforms each to an RDC, and stores them in runtimeStore.
  3. A single snapshotManager.UpdateSnapshot() call then translates the full configStore into one snapshot for all APIs together.

There is no per-API snapshot push on startup.


2. CDS-vs-RDS ordering window

The window exists but is inherent to xDS SotW — not introduced by this PR.

cache.NewSnapshotCache(false, ...) (Line 77, snapshot.go) uses SotW mode. In SotW, Envoy subscribes to CDS and RDS on separate gRPC streams. Even though both types carry the same snapshot version, go-control-plane responds to each stream's watch independently; Envoy ACKs them independently and may process them in any order.

The theoretical window: Envoy receives and ACKs the new CDS (warms main_<hash>, then removes cluster_http_<host>), but the new RDS has not yet been applied. During that gap, active routes still name cluster_http_<host>, which is no longer in CDS → 503 NoRoute.

In practice, Envoy's cluster warm-and-ready sequence delays the CDS ACK until all new clusters are warmed, which gives RDS time to arrive. But this is not a hard guarantee under all timing conditions.

This risk is one-time on upgrade — it applies only to the initial cluster rename on the first config regeneration after deploying this PR. Post-upgrade, URL edits no longer rename the cluster, so this window never recurs for URL changes.


3. In-flight connection handling during the rename

No explicit drain_time_seconds or per-cluster drain timeout was found in the gateway-controller code or the bootstrap configs visible in this repo. Envoy's default behaviour on cluster removal is to immediately remove the cluster after CDS ACK; connections routed through that cluster are not gracefully drained beyond whatever Envoy's built-in idle timeout applies. In-flight requests on the old cluster_http_<host> cluster at the moment of removal may be reset. This is a one-time effect at upgrade, not a recurring concern post-upgrade.


4. Persistence — no cluster names stored anywhere

Confirmed: cluster name is purely derived in memory, never persisted.

  • The DB stores the API record (UUID, handle, kind, spec). No cluster_name, ClusterKey, or EnvoyClusterName field was found in any storage schema or DB query in the codebase.
  • clusterkey.APILevelName is called inside resolveUpstreamCluster (translator) and addUpstreamCluster (transform/restapi), both of which run at snapshot-generation time against the in-memory config.
  • The policyxds path writes ClusterKey/EnvoyClusterName into the in-memory LinearCache (not DB); it is regenerated from scratch on every UpdateSnapshot() call.
  • No .sql, schema migration, or backfill files exist in the diff or the repo path. No data migration is required.

5. Operational side effects

Effect Category Risk
upstream_cluster stat label changes from cluster_http_<host> to main_<hash> Observability only Dashboards/alerts keyed on old name need a one-time update; no correctness impact
Cluster count increases when two APIs previously shared a backend URL Resource More memory per Envoy process and slightly longer CDS convergence time; no correctness impact

Neither effect causes a crash or routing failure.


6. Revertability

Safe. The URL config (scheme, host, port, path) is stored in the DB as part of the API spec. Reverting the PR leaves the DB untouched; the old code regenerates cluster_http_<host> names from the URL fields exactly as before. No state is left in an inconsistent position.


Verdict

Deploying this PR does not crash or permanently disrupt existing APIs. The only production risk is the one-time CDS-before-RDS window on the first config regeneration after upgrade, where a brief 503 is possible per API. This is inherent to xDS SotW and applies to any cluster rename operation in this stack; it is not new to this PR.

Operator actions required:

  1. Update any dashboards or alerting rules keyed on upstream_cluster labels containing cluster_http_ or cluster_https_.
  2. Expect a transient spike in cluster count if multiple APIs shared backend URLs.
  3. No database migration, no rollback risk.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant