Skip to content

feat(gateway): URL-stable identity-based cluster naming for API-level upstreams#2137

Open
mehara-rothila wants to merge 3 commits into
wso2:feature/operation-level-epfrom
mehara-rothila:feat/api-level-url-stable-clusters
Open

feat(gateway): URL-stable identity-based cluster naming for API-level upstreams#2137
mehara-rothila wants to merge 3 commits into
wso2:feature/operation-level-epfrom
mehara-rothila:feat/api-level-url-stable-clusters

Conversation

@mehara-rothila

@mehara-rothila mehara-rothila commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Purpose

API-level Envoy cluster names were derived from the backend URL (cluster_<scheme>_<sanitized-host>). Because the name depended on the URL, editing a backend URL renamed the cluster: the old cluster was destroyed, a new one created, and routes repointed, draining in-flight connections on every URL change. This makes a routine config edit disruptive to live traffic.

Goals

Make an API's main/sandbox cluster names stable for the lifetime of the API, so a backend-URL edit becomes an endpoint update on the same named cluster instead of a cluster rename. As a side effect, fix a latent collision where a main and sandbox upstream on the same URL, or two different APIs sharing a backend URL, collapsed into one shared cluster.

Approach

Derive the cluster name from the API identity instead of the URL: main_<hash> / sandbox_<hash>, where the hash is the first 12 bytes of sha256(apiID) and the env is a plaintext main_/sandbox_ prefix (not part of the hash). A single shared helper (pkg/utils/clusterkey) is used by both xDS builders (the Envoy xDS translator and the RuntimeDeployConfig transform path), so the two paths cannot diverge, and the two hand-rolled URL-sanitizing functions are removed. ClusterKey and EnvoyClusterName are unified to the same string so the policy engine never points Envoy at an unknown cluster (which would surface as 503 NoRoute).

Rollout is a one-time, in-memory effect: on the next config regeneration each API's clusters are renamed once, delivered atomically (new cluster plus repointed routes in a single xDS snapshot, validated by snapshot.Consistent()), so live traffic is not dropped. There is no database migration, the name is derived at translate time and never persisted.

User stories

N/A. This is an internal gateway cluster-naming change with no user-facing API or behavior change.

Documentation

N/A for external product docs. In-repo, the cluster-naming section of gateway/spec/impls/1-basic-gateway-with-controller/data-model.md is updated from the old URL-sanitized scheme to the identity-derived scheme.

Automation tests

  • Unit tests

    New pkg/utils/clusterkey tests with known-answer vectors (including a UUID-shaped apiID). Translator and transformer tests assert URL-independence, that the name carries no URL-derived information, that ClusterKey == EnvoyClusterName, and that a main-only API registers no sandbox cluster or route. Covers the new helper and both builders' naming paths.

  • Integration tests

    gateway/it/features/api-level-url-stable.feature: a main upstream URL update changing both host and path, a sandbox URL update, default-cluster resolution under cluster_header routing, main and sandbox on the same backend host getting separate identity-named clusters, and two APIs on the same backend host routing independently (with a post-delete check that removing one does not disturb the other). The main and sandbox scenarios assert cluster identity through the Envoy admin endpoint, so they fail under the previous naming scheme.

Security checks

  • Followed secure coding standards in http://wso2.com/technical-reports/wso2-secure-engineering-guidelines? yes
  • Ran FindSecurityBugs plugin and verified report? N/A (this change is in a Go module; FindSecurityBugs is a Java/SpotBugs plugin)
  • Confirmed that this PR doesn't commit any keys, passwords, tokens, usernames, or other secrets? yes

Samples

No new configuration or API surface is introduced (cluster naming is internal), so this shows the observable behavior with any REST API.

Deploy an API with a main upstream:

spec:
  upstream:
    main:
      url: http://backend-a:8080
  operations:
    - method: GET
      path: /users

The gateway names the Envoy cluster from the API identity, e.g. main_5b50e1f10eb52155ebe4d947 (the main_ prefix plus the first 12 bytes of sha256(apiID)), not from the URL.

Now change the upstream to a different host and port:

spec:
  upstream:
    main:
      url: http://backend-b:9090

The cluster name stays main_5b50e1f10eb52155ebe4d947, so the route keeps pointing at the same named cluster and Envoy updates the endpoint in place. Under the old scheme the name would have changed from cluster_http_backend-a_8080 to cluster_http_backend-b_9090, renaming the cluster on every URL edit.

Related PRs

N/A.

Test environment

GitHub Actions CI on Linux: Go unit tests and the gateway integration-test suite. No JDK, database, or browser dependency for this change.

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor

Review Change Stack

📝 Walkthrough

Summary

This PR implements identity-based cluster naming for API-level upstreams in the gateway controller, replacing URL-derived names with stable identity names. Cluster names now follow the pattern <env>_<hash>, where env is either main_ or sandbox_ and hash is the first 12 bytes of SHA-256(apiID) encoded as 24 hex characters.

Problem Solved

Previously, cluster names were derived from upstream URLs (e.g., cluster_http_backend-a_8080). When a backend URL was edited, the cluster name would change, causing in-flight connections to drain even for routine configuration changes.

Solution

The new naming scheme makes cluster names stable for the API's lifetime. URL edits now result in endpoint updates on the same named cluster without connection disruption.

Key Changes

  • New cluster-key utility (pkg/utils/clusterkey/): Provides shared APILevel() and APILevelName() helpers used by both xDS builders (translator and transformer) to eliminate naming divergence.

  • Translator updates (pkg/xds/translator.go): Now derives cluster names using the shared helper, passing the API UUID to ensure URL-independent naming. Removed URL-based sanitization for API-level clusters.

  • Transformer updates (pkg/transform/restapi.go): Uses the same cluster-key helper for RuntimeDeployConfig transforms. EnvoyClusterName and ClusterKey now reference identical values to keep policy routing aligned with actual Envoy cluster names.

  • Policy xDS test updates (pkg/policyxds/policyxds_test.go): Updated to use new identity-based cluster naming.

  • Comprehensive test coverage:

    • Unit tests with known-answer vectors and determinism checks
    • Translator tests verifying URL-independence and deduplication across different URLs
    • Transformer tests validating cluster naming shape and registry matching
    • Integration tests covering main/sandbox upstreams on the same backend host and two APIs sharing a backend host, including post-deletion routing verification
  • Feature documentation (gateway/it/features/api-level-url-stable.feature): New integration test scenarios with Envoy admin API assertions to verify cluster names are identity-derived rather than URL-derived.

  • Integration test infrastructure (gateway/it/steps_envoy_admin.go): New Envoy admin step definitions for capturing and validating cluster names via the Envoy /clusters endpoint.

Benefits

  • Cluster names remain stable across upstream URL changes, eliminating unnecessary connection drains
  • Collisions between APIs sharing the same backend URL or between main/sandbox upstreams on the same URL are now avoided
  • Configuration changes deliver atomically in a single xDS snapshot, preventing transient routing errors
  • No database migration required; cluster names are derived at translation time

Walkthrough

This PR replaces URL-derived Envoy upstream cluster naming with a deterministic, API-identity-based scheme. A new clusterkey package introduces APILevel(apiID) (first 12 bytes of SHA-256 as 24 hex chars) and APILevelName(env, apiID) (env prefix + fragment). Both the xDS translator (resolveUpstreamCluster) and the REST API transformer now use this shared helper, aligning cluster names across translation paths. The old sanitizeClusterName helper is removed. Policy-xDS test fixtures, unit tests, spec documentation, new E2E Gherkin scenarios, and Envoy admin step definitions are all updated or added accordingly.

Sequence Diagram(s)

sequenceDiagram
  participant GatewayController
  participant xDSTranslator
  participant RestAPITransformer
  participant clusterkey

  GatewayController->>xDSTranslator: translateAPIConfig(cfg)
  xDSTranslator->>clusterkey: APILevelName(upstreamName, cfg.UUID)
  clusterkey-->>xDSTranslator: deterministic cluster name
  xDSTranslator-->>GatewayController: Envoy xDS snapshot with identity-named clusters

  GatewayController->>RestAPITransformer: Transform(rdc)
  RestAPITransformer->>clusterkey: APILevelName(upstreamName, rdc.Metadata.UUID)
  clusterkey-->>RestAPITransformer: deterministic cluster name
  RestAPITransformer-->>GatewayController: ClusterKey + EnvoyClusterName aligned
Loading

Suggested Reviewers

  • RakhithaRR
  • Tharsanan1
  • tharindu1st
  • malinthaprasan
  • VirajSalaka
  • ashera96
🚥 Pre-merge checks | ✅ 4 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 58.33% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (4 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: implementing URL-stable, identity-based cluster naming for API-level upstreams in the gateway.
Description check ✅ Passed The description comprehensively covers all template sections: purpose, goals, approach, user stories, documentation, automation tests, security checks, samples, and test environment.
Linked Issues check ✅ Passed Check skipped because no linked issues were found for this pull request.
Out of Scope Changes check ✅ Passed Check skipped because no linked issues were found for this pull request.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests

Warning

There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure.

🔧 golangci-lint (2.12.2)

level=error msg="[linters_context] typechecking error: pattern ./...: directory prefix . does not contain modules listed in go.work or their selected dependencies"


Thanks for using CodeRabbit! It's free for OSS, and your support helps us grow. If you like it, consider giving us a shout-out.

❤️ Share

Comment @coderabbitai help to get the list of available commands and usage tips.

@mehara-rothila mehara-rothila changed the base branch from feature/operation-level-ep to main June 10, 2026 08:53
@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch 5 times, most recently from 534247e to a99b916 Compare June 10, 2026 16:16
@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai full review

@coderabbitai

coderabbitai Bot commented Jun 10, 2026

Copy link
Copy Markdown
Contributor
✅ Action performed

Full review finished.

@mehara-rothila mehara-rothila changed the base branch from main to feature/operation-level-ep June 10, 2026 18:39
@mehara-rothila mehara-rothila changed the base branch from feature/operation-level-ep to main June 12, 2026 02:24
@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from 5c3aed7 to 840cb13 Compare June 12, 2026 02:28
@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai please do a full review of this PR , concrete suggestions, a risk analysis, and anything else you can surface. Please read the actual code, not just the description.

Correctness & design

  • The cluster name is <env>_<fragment> where fragment = first 8 bytes of sha256(apiID) , the apiID ONLY. The env ("main"/"sandbox") is a plaintext prefix, NOT part of the hash, so main and sandbox of the same API share the fragment and are distinguished only by the prefix. Trace clusterkey.APILevel / clusterkey.APILevelName and both call sites: pkg/xds/translator.go (resolveUpstreamCluster) and pkg/transform/restapi.go (addUpstreamCluster). Is this sound?
  • Confirm both xDS builders produce identical names for the same (env, apiID), so they can never disagree.
  • Verify the env prefix alone guarantees main and sandbox never collide even though they share the hash fragment.
  • Verify ClusterKey and EnvoyClusterName are the same string in the RDC path, and nothing still references the old URL-derived name.

Risk analysis

  • Collision at 64 bits: realistic probability of two DISTINCT apiIDs hashing to the same 8-byte fragment (the only thing keeping different APIs apart), and what happens if they do , silent cross-wiring or fail-closed? Should the truncation be longer?
  • Migration/one-time effects: existing APIs get clusters renamed on next redeploy. Any window where a route points at a not-yet-created cluster (503 NoRoute)? Is "new cluster + repointed routes" truly atomic in one snapshot?
  • Observability: the upstream_cluster stat label changes from cluster_<scheme>_<host> to <env>_<hash>. What breaks, and is it called out?
  • Cluster-count increase (two APIs sharing a backend URL no longer share a cluster): any resource concern at scale?
  • Does removing sanitizeClusterName / sanitizeEnvoyClusterName drop any sanitizing some input still needs?

Edge cases

  • Empty/unusual apiID, an API with only sandbox (no main), main vs sandbox pointing at identical URLs.

Tests

  • Is unit + integration coverage enough to lock the contract? Anything missing , e.g. a negative test proving the old URL-derived name is gone, and a test proving main/sandbox of one API differ only by prefix?

@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clusterkey package ==="
fd -t f . pkg/utils/clusterkey --exec cat {}

echo ""
echo "=== translator.go - resolveUpstreamCluster and related ==="
rg -n "resolveUpstreamCluster|sanitizeClusterName|clusterkey|APILevel|APILevelName|ClusterKey|EnvoyClusterName" pkg/xds/translator.go -A 5 -B 2

echo ""
echo "=== restapi.go - addUpstreamCluster and related ==="
rg -n "addUpstreamCluster|sanitizeEnvoyClusterName|clusterkey|APILevel|APILevelName|ClusterKey|EnvoyClusterName" pkg/transform/restapi.go -A 5 -B 2

echo ""
echo "=== Full clusterkey file list ==="
fd -t f . pkg/utils/clusterkey

echo ""
echo "=== Full translator.go resolveUpstreamCluster function ==="
rg -n "func resolveUpstreamCluster" pkg/xds/translator.go -A 50

echo ""
echo "=== Full addUpstreamCluster function ==="
rg -n "func addUpstreamCluster" pkg/transform/restapi.go -A 80

echo ""
echo "=== Integration test feature file ==="
fd -t f "api-level-url-stable.feature" --exec cat {}

echo ""
echo "=== Translator test file - new test cases ==="
fd -t f . pkg/xds -e go -e _test.go --exec rg -n "DedupSameAPI|MainSandboxNever|TestResolveUpstreamCluster" {} -A 30

echo ""
echo "=== restapi transformer test - new test cases ==="
fd -t f . pkg/transform -e go --exec rg -n "APILevelDefaultCluster|APILevelURL|TestRestAPITransformer_APILevel" {} -A 30

echo ""
echo "=== cmd/controller/main.go cluster name comment ==="
rg -n "cluster" cmd/controller/main.go -B 2 -A 5

echo ""
echo "=== Check if any old sanitize references remain ==="
rg -rn "sanitizeClusterName|sanitizeEnvoyClusterName|cluster_http|cluster_https" --include="*.go"

echo ""
echo "=== Check ClusterKey and EnvoyClusterName in restapi.go ==="
rg -n "ClusterKey\|EnvoyClusterName" pkg/transform/restapi.go -B 3 -A 3

echo ""
echo "=== data-model.md cluster section ==="
fd -t f "data-model.md" --exec grep -n -A 20 -i "cluster.nam" {}

Length of output: 2285


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clusterkey package full content ==="
find . -path "*/utils/clusterkey*" -type f | xargs cat

echo ""
echo "=== Check all clusterkey usages ==="
rg -rn "clusterkey\." --include="*.go" -B 2 -A 3

echo ""
echo "=== Check APILevelName / APILevel usage ==="
rg -rn "APILevelName\|APILevel(" --include="*.go" -B 2 -A 3

echo ""
echo "=== clusterkey test file ==="
find . -path "*/utils/clusterkey*_test*" -o -path "*/clusterkey*test*" | xargs cat 2>/dev/null || true

echo ""
echo "=== translator.go full resolveUpstreamCluster ==="
awk '/func resolveUpstreamCluster/,/^func /' pkg/xds/translator.go 2>/dev/null | head -80

echo ""
echo "=== restapi.go full addUpstreamCluster ==="
awk '/func addUpstreamCluster/,/^func /' pkg/transform/restapi.go 2>/dev/null | head -100

echo ""
echo "=== RDC struct definitions: UpstreamCluster, ClusterKey, EnvoyClusterName ==="
rg -rn "ClusterKey\b\|EnvoyClusterName\b" --include="*.go" -B 2 -A 2

echo ""
echo "=== Check for any remaining references to old URL-derived naming ==="
rg -rn 'cluster_\|scheme.*cluster\|sanitize.*cluster\|cluster.*sanitize' --include="*.go" -B 2 -A 2

echo ""
echo "=== Full translator test for new cases ==="
rg -rn "TestResolveUpstreamCluster" --include="*.go" -A 60 | head -200

echo ""
echo "=== Full restapi transformer test for new cases ==="
rg -rn "TestRestAPITransformer_APILevel" --include="*.go" -A 60 | head -200

echo ""
echo "=== xds snapshot atomicity - how snapshot is built ==="
rg -rn "snapshot\|Snapshot\|SetSnapshot\|GenerateSnapshot" pkg/xds/translator.go -B 2 -A 5 | head -80

echo ""
echo "=== integration test feature file ==="
find . -name "api-level-url-stable.feature" | xargs cat 2>/dev/null

echo ""
echo "=== Check empty apiID handling ==="
rg -rn 'apiID\|api_id\|ApiId' pkg/utils/clusterkey --include="*.go" -B 1 -A 5 2>/dev/null || true
find . -path "*/clusterkey*" | xargs rg -n "empty\|\"\"" 2>/dev/null || true

Length of output: 8168


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== translator.go resolveUpstreamCluster ==="
awk '/func resolveUpstreamCluster/,/^func [a-zA-Z]/' pkg/xds/translator.go | head -100

echo ""
echo "=== translator.go clusterkey/APILevel imports and usage ==="
grep -n "clusterkey\|APILevel\|APILevelName\|ClusterKey\|EnvoyClusterName\|sanitize" pkg/xds/translator.go

echo ""
echo "=== restapi.go addUpstreamCluster ==="
awk '/func addUpstreamCluster/,/^func [a-zA-Z]/' pkg/transform/restapi.go | head -120

echo ""
echo "=== restapi.go clusterkey/APILevel imports and usage ==="
grep -n "clusterkey\|APILevel\|APILevelName\|ClusterKey\|EnvoyClusterName\|sanitize" pkg/transform/restapi.go

echo ""
echo "=== restapi.go UpstreamCluster struct / ClusterKey / EnvoyClusterName field definitions ==="
grep -n "ClusterKey\|EnvoyClusterName\|UpstreamCluster" pkg/transform/restapi.go | head -40

echo ""
echo "=== Check struct definitions globally ==="
grep -rn "ClusterKey\b\|EnvoyClusterName\b" --include="*.go" . 2>/dev/null || \
grep -rn "ClusterKey\|EnvoyClusterName" . -l | xargs grep -n "ClusterKey\|EnvoyClusterName" | head -60

echo ""
echo "=== Check struct definitions globally (no --include) ==="
grep -rn "ClusterKey\|EnvoyClusterName" . --include="*.go" 2>/dev/null | head -60 || \
find . -name "*.go" | xargs grep -n "ClusterKey\|EnvoyClusterName" 2>/dev/null | head -60

echo ""
echo "=== Translator tests for new cases ==="
find . -name "*translator*test*" -o -name "*test*translator*" | grep "\.go$" | xargs grep -n "TestResolveUpstreamCluster\|DedupSameAPI\|MainSandboxNever" 2>/dev/null | head -20
find . -name "*_test.go" -path "*/xds/*" | xargs grep -n "TestResolveUpstreamCluster\|DedupSameAPI\|MainSandbox" 2>/dev/null | head -5

echo ""
echo "=== Full translator test cases ==="
find . -name "*_test.go" -path "*/xds/*" | head -5 | xargs -I{} awk '/func TestResolveUpstreamCluster/,/^func [A-Z]/' {} 2>/dev/null | head -200

echo ""
echo "=== Full restapi transformer test cases ==="
find . -name "*_test.go" -path "*/transform/*" | head -5 | xargs -I{} awk '/func TestRestAPITransformer_APILevel/,/^func [A-Z]/' {} 2>/dev/null | head -200

echo ""
echo "=== Integration test ==="
find . -name "api-level-url-stable.feature" 2>/dev/null | xargs cat 2>/dev/null

echo ""
echo "=== Snapshot atomicity in translator ==="
grep -n "snapshot\|Snapshot\|SetSnapshot" pkg/xds/translator.go | head -30

echo ""
echo "=== Check old URL-derived naming remnants ==="
find . -name "*.go" | xargs grep -n "cluster_http\|cluster_https\|sanitizeCluster\|sanitizeEnvoy" 2>/dev/null | head -20

echo ""
echo "=== data-model.md cluster section ==="
find . -name "data-model.md" | xargs grep -n -A 25 -i "cluster.nam\|cluster name\|cluster-nam" 2>/dev/null | head -60

Length of output: 3133


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Repo root structure ==="
ls -1

echo ""
echo "=== Find translator.go ==="
find . -name "translator.go" 2>/dev/null

echo ""
echo "=== Find restapi.go ==="
find . -name "restapi.go" 2>/dev/null

echo ""
echo "=== Find clusterkey files ==="
find . -name "*.go" | xargs grep -l "APILevel\|clusterkey" 2>/dev/null | head -20

echo ""
echo "=== Find feature file ==="
find . -name "*.feature" | xargs grep -l "url-stable\|url_stable\|cluster" 2>/dev/null | head -10

echo ""
echo "=== Find test files referencing APILevel or DedupSameAPI ==="
find . -name "*_test.go" | xargs grep -l "APILevel\|DedupSameAPI\|APILevelName\|url.stable\|URLStable" 2>/dev/null | head -10

echo ""
echo "=== Full top-level tree (2 levels) ==="
find . -maxdepth 2 -type d | sort | head -60

Length of output: 3555


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== translator.go resolveUpstreamCluster ==="
awk '/func resolveUpstreamCluster/,/^func [a-zA-Z]/' ./gateway/gateway-controller/pkg/xds/translator.go | head -120

echo ""
echo "=== translator.go clusterkey/APILevel/sanitize references ==="
grep -n "clusterkey\|APILevel\|APILevelName\|ClusterKey\|EnvoyClusterName\|sanitize\|import" ./gateway/gateway-controller/pkg/xds/translator.go | head -40

echo ""
echo "=== restapi.go addUpstreamCluster ==="
awk '/func addUpstreamCluster/,/^func [a-zA-Z]/' ./gateway/gateway-controller/pkg/transform/restapi.go | head -120

echo ""
echo "=== restapi.go clusterkey/APILevel/sanitize/ClusterKey/EnvoyClusterName references ==="
grep -n "clusterkey\|APILevel\|APILevelName\|ClusterKey\|EnvoyClusterName\|sanitize" ./gateway/gateway-controller/pkg/transform/restapi.go | head -40

echo ""
echo "=== ClusterKey and EnvoyClusterName struct definitions ==="
find ./gateway -name "*.go" | xargs grep -n "ClusterKey\b\|EnvoyClusterName\b" 2>/dev/null | head -50

echo ""
echo "=== translator_test.go - new test functions ==="
awk '/func TestResolveUpstreamCluster/,/^func [A-Z]/' ./gateway/gateway-controller/pkg/xds/translator_test.go | head -200

echo ""
echo "=== restapi_test.go - new APILevel test functions ==="
awk '/func TestRestAPITransformer_APILevel/,/^}$/' ./gateway/gateway-controller/pkg/transform/restapi_test.go | head -200

echo ""
echo "=== Integration test feature file ==="
cat ./gateway/it/features/api-level-url-stable.feature

echo ""
echo "=== Snapshot atomicity in translator.go ==="
grep -n "snapshot\|Snapshot\|SetSnapshot\|GenerateSnapshot\|atomic" ./gateway/gateway-controller/pkg/xds/translator.go | head -30

echo ""
echo "=== Old sanitize / URL-derived naming remnants ==="
find ./gateway -name "*.go" | xargs grep -n "cluster_http\|cluster_https\|sanitizeCluster\|sanitizeEnvoy" 2>/dev/null | head -20

echo ""
echo "=== data-model.md cluster section ==="
find ./gateway -name "data-model.md" | xargs grep -n -A 25 -i "cluster.nam\|cluster name\|cluster-nam" 2>/dev/null | head -80

echo ""
echo "=== cmd/controller/main.go comment ==="
find . -path "*/cmd/controller/main.go" | xargs grep -n "cluster\|xds\|transform" 2>/dev/null | head -20

Length of output: 25751



@mehara-rothila — here is the full review based on the actual code.


Correctness & Design

1. Both builders produce identical names

Verified. pkg/xds/translator.go Line 986:

clusterName := clusterkey.APILevelName(upstreamName, apiID)

pkg/transform/restapi.go Line 347:

clusterKey := clusterkey.APILevelName(upstreamName, rdc.Metadata.UUID)

Both call the same helper with (env, apiID). They can never disagree as long as the env strings passed are the same literals. There is no test that explicitly asserts translator.resolveUpstreamCluster("main", X) == transformer.addUpstreamCluster("main", X) as a cross-module assertion, but the shared helper makes it structurally guaranteed.

2. Main vs sandbox never collide

Verified. main_<hash>sandbox_<hash> by construction — different string prefixes, regardless of the shared 16-hex fragment. No probability argument needed here; it is a deterministic property of string concatenation.

3. ClusterKey == EnvoyClusterName in the RDC path

Verified. restapi.go Lines 363–364:

ClusterKey:       clusterKey,
EnvoyClusterName: clusterKey,

The comment at Line 358 explicitly documents why they must be equal. policyxds/snapshot.go Line 231 looks up clusters by ClusterKey; since EnvoyClusterName == ClusterKey, the policy engine will always write a cluster name that Envoy knows about. policyxds_test.go already uses "main_fixture" naming, consistent with the new scheme.

4. Old URL-derived naming

Fully removed from API-level paths. No cluster_http_, cluster_https_, sanitizeClusterName, or sanitizeEnvoyClusterName remain in any .go file. sanitizeUpstreamDefinitionName still exists at translator.go Line 2717 and restapi.go Line 192, but it applies only to UpstreamDefinition (named-backend) clusters — a separate feature that is deliberately out of scope here.


Risk Analysis

R1 — Hash collision at 64 bits (medium, worth documenting)

The 8-byte (64-bit) truncation yields a birthday collision probability of approximately n² / 2⁶⁵ for n distinct apiIDs.

API count Approximate collision probability
10,000 ~2.7 × 10⁻¹² (negligible)
1,000,000 ~2.7 × 10⁻⁸ (negligible)
10⁹ ~2.7 × 10⁻² (non-trivial)

For most production deployments this is fine. However, the consequence of a collision is silent cross-wiring (two distinct APIs share one Envoy cluster), not fail-closed. This is the worst kind of failure mode for an API gateway — requests for API-A are silently routed to the backend of API-B.

Concrete suggestion: Increase the truncation to 12 bytes (96 bits, 24 hex chars). This pushes the collision probability to n² / 2⁹⁷, which is effectively impossible even at 10⁹ APIs. The known-answer test vectors would need updating, but the change is one-line in clusterkey.go:

// Before
return hex.EncodeToString(sum[:8])

// After (96-bit truncation)
return hex.EncodeToString(sum[:12])

The cluster name grows from main_f9811b73ac5d1a8d (21 chars) to main_f9811b73ac5d1a8d3c2b1a09 (29 chars), well within Envoy's name limits.

R2 — Empty apiID cross-wiring

APILevel("") returns e3b0c44298fc1c14 (SHA-256 of the empty string, documented in the test). Every API with an empty apiID resolves to main_e3b0c44298fc1c14 / sandbox_e3b0c44298fc1c14, silently sharing a cluster. The test documents that emptiness is enforced upstream at deploy time, but there is no defensive guard in APILevel or APILevelName itself. If the upstream validation is bypassed or misconfigured, the failure mode is again silent cross-wiring.

Concrete suggestion: Add a guard at the top of APILevel:

func APILevel(apiID string) string {
    if apiID == "" {
        // Caller contract violation: enforce non-emptiness here as a last resort.
        panic("clusterkey: apiID must not be empty")
    }
    sum := sha256.Sum256([]byte(apiID))
    return hex.EncodeToString(sum[:8])
}

A panic here is appropriate: an empty apiID is a programming error (not a runtime condition), and silently producing a shared cluster is more dangerous than a crash.

R3 — First-redeploy migration window

On the next configuration push after upgrading, existing clusters are renamed from cluster_http_<host> to main_<hash>. CDS and RDS are separate xDS watches: in the worst case, Envoy ACKs the new CDS snapshot (which removes the old cluster name and adds the new one) before the RDS update arrives with the repointed routes. During that gap, active routes still reference cluster_http_<host>, which no longer exists → 503 NoRoute.

This hazard is inherent to the xDS protocol and is not new to this PR. After the first redeploy the identity-stable naming eliminates this window entirely for URL edits, which is the whole point. However, the migration is worth calling out explicitly in operator runbooks, not just in PR description prose.

R4 — Observability: stat label loses hostname

The upstream_cluster Envoy stat changes from cluster_http_api.example.com_8080 to main_f9811b73ac5d1a8d. Dashboards keyed on the human-readable host lose their correlation to the actual backend. The PR description calls this out, but there is no compensating mechanism in the xDS config (e.g., cluster metadata or typed per-filter config) that would let Prometheus recording rules or Envoy access-log formatters recover the original hostname.

Concrete suggestion: Populate the cluster's metadata.filter_metadata["envoy.lb"] or a custom namespace with api_id, api_name, and env fields. This costs nothing at runtime and lets operators add metric labels without re-parsing the hash.

R5 — Cluster count increase at scale

Two APIs sharing a backend URL now each own their cluster. For a deployment with many APIs sharing a small number of backends (common in fan-out API gateway topologies), this multiplies the cluster count by a factor proportional to the number of APIs per shared backend. Envoy itself has no hard cluster limit but memory and convergence time grow. Worth monitoring in large multi-tenant deployments.


Edge Cases

E1 — API with only sandbox upstream (no main)

No test covers the case where an API is deployed with a sandbox vhost but no main upstream URL. The transformer path should skip registering a main_<hash> cluster, but any route on the main vhost that still references ClusterKey: "main_<hash>" would point at a cluster not in UpstreamClusters → policy engine 503 NoRoute. This case needs a unit test.

E2 — Main and sandbox pointing at the same URL

Correctly handled. The new naming gives them distinct clusters (main_<hash>sandbox_<hash>). The integration test does not exercise this, but it is a latent-collision fix that the unit tests cover.


Test Coverage Assessment

Area Coverage Gap
Naming algorithm TestAPILevel with known-answer vectors and TestAPILevelName — strong None
Both builders agree Structural (shared helper) No explicit cross-module assertion test
URL stability TestRestAPITransformer_APILevelURLStableAcrossURLEdit Translator-side equivalent not shown
Old names absent Integration test (cluster_http_ / cluster_https_ absence checked) Translator unit test does not assert absence of old name
DefaultCluster matches real cluster TestRestAPITransformer_APILevelDefaultClusterMatchesRealCluster Good
Main/sandbox never collide TestResolveUpstreamCluster_MainSandboxNeverCollide exists Body not verified — confirm it asserts prefix-only difference
Empty apiID Documented in TestAPILevel ("empty input is deterministic") No test asserting deploy-time enforcement rejects it
Only-sandbox API Not present Missing
Integration: host change api-level-url-stable.feature — captures cluster names before/after, asserts unchanged Strong

Suggested additional test:

func TestAPILevel_OnlySandboxAPI(t *testing.T) {
    // API deployed with sandbox upstream only; no main upstream registered.
    // The main_<hash> cluster must NOT appear in UpstreamClusters.
    transformer := NewRestAPITransformer(...)
    cfg := makeRestAPIWithSandboxOnly(...)
    rdc, err := transformer.Transform(cfg)
    require.NoError(t, err)
    expectedMain := "main_" + clusterkey.APILevel(cfg.UUID)
    _, exists := rdc.UpstreamClusters[expectedMain]
    assert.False(t, exists, "main cluster must not be registered when no main upstream is configured")
}

Summary of Actionable Items

# Severity Item
1 Medium Increase hash truncation from 8 to 12 bytes to eliminate collision risk at scale
2 Medium Add empty-apiID guard (panic or error) in clusterkey.APILevel
3 Low Add cluster metadata fields (api_id, env) for observability continuity
4 Low Add unit test: API-with-only-sandbox produces no main_<hash> cluster
5 Low Add translator-side unit test asserting old cluster_http_ / cluster_https_ names are absent
6 Informational Document the first-redeploy 503 window in operator runbooks

The core design — identity in, URL out, one shared helper for both builders, ClusterKey == EnvoyClusterName — is sound and correctly implemented. The main residual risk is the 64-bit collision producing silent cross-wiring and the unguarded empty-apiID case.

@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from 840cb13 to 42cbb91 Compare June 15, 2026 19:25
@coderabbitai

coderabbitai Bot commented Jun 15, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clusterkey.go - hash truncation ==="
find ./gateway -path "*/utils/clusterkey*" -name "*.go" | xargs cat

echo ""
echo "=== clusterkey_test.go - known-answer vectors ==="
find ./gateway -path "*/utils/clusterkey*" -name "*_test.go" | xargs cat

echo ""
echo "=== translator_test.go - TestResolveUpstreamCluster_NameNotURLDerived ==="
find ./gateway -name "*_test.go" -path "*/xds/*" | xargs grep -n "TestResolveUpstreamCluster_NameNotURLDerived\|NameNotURLDerived" -A 30 2>/dev/null || true

echo ""
echo "=== restapi_test.go - TestRestAPITransformer_APILevelMainOnlyHasNoSandboxCluster ==="
find ./gateway -name "*_test.go" -path "*/transform/*" | xargs grep -n "MainOnlyHasNoSandboxCluster\|APILevelMainOnly" -A 40 2>/dev/null || true

echo ""
echo "=== Confirm fragment length in clusterkey.go ==="
grep -n "sum\[:" $(find ./gateway -path "*/utils/clusterkey*" -name "*.go" ! -name "*_test.go") 2>/dev/null || true

echo ""
echo "=== Any remaining 8-byte / sum[:8] references ==="
find ./gateway -name "*.go" | xargs grep -n "sum\[:8\]\|sum\[:12\]" 2>/dev/null || true

Length of output: 13837



@mehara-rothila — verification complete. Here is the status of each item.


Item 1 — 96-bit truncation ✅

clusterkey.go Line 45 confirms sum[:12]. The known-answer vectors are updated to 24-char hex strings (f9811b73ac5d1a8db842634f), and hexShape24 regex in the test enforces that shape. At 96 bits the birthday collision probability is ≈ n² / 2^97, which is negligible even at 10⁹ distinct apiIDs. The silent cross-wiring risk at scale is closed.


Item 4 — Main-only API, no sandbox cluster/route ✅

TestRestAPITransformer_APILevelMainOnlyHasNoSandboxCluster (Line 528) sets spec.Spec.Upstream.Sandbox = nil, then asserts:

  • main_<hash> is present in UpstreamClusters
  • sandbox_<hash> is absent from UpstreamClusters
  • No sandbox route is registered

The direction is correct — Main mandatory, Sandbox optional — and the test precisely targets the 503 NoRoute failure mode.


Item 5 — Old URL-derived name absent ✅

TestResolveUpstreamCluster_NameNotURLDerived (Line 2168) asserts three independent properties:

  1. Name equals "main_" + clusterkey.APILevel("test-api") (positive pin)
  2. Name has no cluster_ prefix (negative assertion against old scheme)
  3. Name does not contain the backend host api.example.com (negative assertion against URL leakage)

This is a complete regression lock.


Item 2 — Empty apiID guard (intentionally not changed)

The reasoning is sound. APILevel returns only a string (no error), so adding a guard without changing the signature leaves panic as the only in-function option. A panic in the control-plane hot path — which processes live API configuration — is a worse failure mode than the input it would guard against, especially since apiID is a server-assigned UUID enforced upstream. The existing test documents the deterministic behaviour explicitly. Agreed: no change required.

If a lightweight alternative is ever wanted, an error return on APILevel (and a corresponding caller update) would be cleaner than a panic, but that is a separate refactor and is not needed for this PR.


Summary

# Item Status
1 96-bit truncation ✅ Resolved
2 Empty-apiID guard ✅ Accepted as-is
3 Cluster metadata for observability Deferred
4 Main-only no-sandbox test ✅ Resolved
5 Old name absence test ✅ Resolved
6 First-redeploy runbook note Deferred

Items 3 and 6 are noted as follow-up work. No open blockers in the addressed items.

… upstreams

Name API-level Envoy clusters from sha256(apiID) with a main_/sandbox_ prefix
instead of the backend URL, so a URL edit updates endpoints on the same named
cluster instead of renaming it. Both xDS builders share one clusterkey helper.
@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from 42cbb91 to 9b9ee98 Compare June 16, 2026 01:16
…on cases

Two new api-level-url-stable scenarios: main and sandbox on the same backend
host get separate identity-named clusters, and two APIs sharing a backend host
each route under their own cluster. Both fail under the old URL-derived naming.
@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai please do a full review of this PR in its current state (head has changed since your last pass, it now also includes two new integration scenarios in api-level-url-stable.feature and minor doc fixes).

In particular, review all of the code comments across the PR for verbosity and redundancy. Several doc/inline comments (for example the doc comment on APILevel in pkg/utils/clusterkey/clusterkey.go, which is about 13 lines over a 2-line function) restate design rationale that is already in the PR description. Compared to the commenting conventions in neighbouring gateway-controller files, are any comments too long or redundant? Where you find some, suggest a concise replacement.

Also please cover, as a normal full review:

  • correctness and any logic issues across the diff,
  • the two new integration scenarios (main and sandbox on the same backend host; two APIs on the same backend host) , are they correct and asserting the right thing,
  • any remaining risks, edge cases, or test gaps.

Base all style judgments on the repo's actual conventions, not a generic standard.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== clusterkey.go full content ==="
find ./gateway -path "*/utils/clusterkey*" -name "*.go" ! -name "*_test.go" | xargs cat

echo ""
echo "=== clusterkey_test.go full content ==="
find ./gateway -path "*/utils/clusterkey*" -name "*_test.go" | xargs cat

echo ""
echo "=== api-level-url-stable.feature full content ==="
find . -name "api-level-url-stable.feature" | xargs cat

echo ""
echo "=== Commenting conventions in neighboring files (pkg/utils) ==="
find ./gateway -path "*/utils/*.go" ! -name "*_test.go" ! -path "*/clusterkey*" | head -5 | xargs -I{} sh -c 'echo "--- {} ---"; head -80 {}'

echo ""
echo "=== Commenting conventions in pkg/xds/translator.go (nearby exported funcs) ==="
grep -n "^// \|^func " ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== Commenting conventions in pkg/transform/restapi.go (nearby exported funcs) ==="
grep -n "^// \|^func " ./gateway/gateway-controller/pkg/transform/restapi.go | head -60

echo ""
echo "=== translator.go resolveUpstreamCluster full ==="
awk '/func resolveUpstreamCluster/,/^func [a-zA-Z]/' ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== translator.go inline comments around resolveUpstreamCluster ==="
grep -n -B2 -A20 "func resolveUpstreamCluster" ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== restapi.go addUpstreamCluster full with comments ==="
grep -n -B2 -A80 "func addUpstreamCluster" ./gateway/gateway-controller/pkg/transform/restapi.go | head -100

echo ""
echo "=== cmd/controller/main.go comment about cluster naming ==="
find . -path "*/cmd/controller/main.go" | xargs grep -n -B5 -A10 "cluster\|xds\|transform\|sync" 2>/dev/null | head -60

echo ""
echo "=== translator_test.go - all new test cases full ==="
find ./gateway -name "*_test.go" -path "*/xds/*" | xargs grep -n "TestResolveUpstreamCluster" -A 50 2>/dev/null | head -300

echo ""
echo "=== restapi_test.go - all APILevel test cases full ==="
find ./gateway -name "*_test.go" -path "*/transform/*" | xargs grep -n "TestRestAPITransformer_APILevel" -A 60 2>/dev/null | head -300

echo ""
echo "=== data-model.md cluster section ==="
find . -name "data-model.md" | xargs grep -n -B5 -A40 -i "cluster" 2>/dev/null | head -120

echo ""
echo "=== Check PR diff summary via GitHub CLI ==="
gh pr view 2137 --repo wso2/api-platform --json files --jq '.files[].path' 2>/dev/null || echo "gh CLI not available"

Length of output: 46691



@mehara-rothila — full review of the current head below.


Comment verbosity and redundancy

The repo convention in pkg/utils and pkg/xds/translator.go is one clear sentence per exported function, occasionally two if a non-obvious contract needs spelling out. Design rationale lives in PR descriptions and ADRs, not doc comments.

pkg/utils/clusterkey/clusterkey.go

Package comment — The last sentence ("It is a leaf package (stdlib imports only) so both xDS builders can share one naming source without import cycles") is an implementation detail. Suggested:

// Package clusterkey produces deterministic Envoy cluster-key fragments
// for the gateway-controller.

APILevel doc comment (13 lines for a 2-line function) — Lines about Envoy warm-and-swap, path-only vs. host edits, and cross-API uniqueness restate the PR description verbatim. Suggested (3 lines, matches repo norm):

// APILevel returns the 24-hex cluster-key fragment for an API-level upstream,
// derived from the first 12 bytes of SHA-256(apiID). The caller prepends the
// env prefix ("main_"/"sandbox_") to form the full cluster name.

APILevelName doc comment (2 lines) — Fine as-is.

pkg/utils/clusterkey/clusterkey_test.go

TestAPILevel function comment — The trailing sentence "The fragment is shared by an API's main and sandbox clusters by design … so an operator can pair an API's clusters at a glance" restates design rationale. Suggested:

// TestAPILevel validates the API-level cluster-key fragment: deterministic,
// distinct per apiID, and pinned to SHA-256[:12] (24 hex chars).

pkg/xds/translator.goresolveUpstreamCluster block comment (lines 918–924)

Seven of the nine comment lines explain URL-stability design. The repo norm for this file's functions is 1–3 lines. Suggested:

// resolveUpstreamCluster validates an upstream (main or sandbox) and creates its cluster.
// Returns clusterName, parsedURL, timeout (can be nil), and error.
// Cluster name is "<env>_<sha256(apiID)[:12]>" — URL-stable for the API's lifetime.

New integration scenarios

Scenario 4: main and sandbox on the same backend host

Correctness: The routing assertions (main routes to /collision-main/endpoint, sandbox to /collision-sandbox/endpoint) and the admin assertions (main_ present, sandbox_ present, no cluster_http_) are correct and exercise the latent-collision fix.

Gap: The Envoy admin step only checks string containment (response body should contain "main_"). It does not verify that there are two distinct cluster names — one for main and one for sandbox. The assertion would pass even if some other registered API's main_<hash> cluster appears in the output. A stronger assertion would capture both prefixed sets and verify they are non-empty and non-overlapping.

Scenario 5: two APIs sharing the same backend host

Correctness: Both route to their own base paths and the no-cluster_http_ assertion is correct.

Gaps:

  1. The admin check asserts main_ is present, but both APIs produce a main_<hash> cluster with different hashes. The test only verifies the prefix exists; it does not verify there are two distinct main_<hash> names in the output. Under old naming both APIs shared one cluster, so asserting two distinct main_ names is the exact property that proves identity naming is working.
  2. After deleting API-B, there is no assertion that API-A's route still works. That would confirm the two clusters are truly independent and that deleting one does not affect the other.

Remaining risks and test gaps

Area Finding
Step-definition fidelity I capture the Envoy cluster names prefixed "..." and should be unchanged are new step definitions not visible in this diff. If "unchanged" is implemented as simple string containment rather than exact set equality, the stability assertions in Scenarios 1 and 2 are weaker than intended. Worth verifying the step implementation uses set equality.
Independent-cluster deletion No test asserts that deleting one of two APIs sharing a backend host leaves the other API's traffic unaffected. This is the key operational property of identity naming at scale.
Known-answer test coverage clusterkey_test.go pins only two inputs ("api-1", "test-api"). A third vector for the empty-string SHA-256 is present but the more useful addition would be a vector for a realistic UUID-format apiID to confirm the function composes correctly with real inputs.

Summary

The core logic is correct and the 96-bit truncation, shared helper, and ClusterKey == EnvoyClusterName constraints are all verified. The actionable items are:

  1. Shorten the APILevel doc comment to 3 lines (most impactful, clearly over the repo norm).
  2. Shorten the resolveUpstreamCluster block comment to 3 lines.
  3. Shorten the package comment to one line.
  4. Trim the TestAPILevel function comment to remove the design-rationale sentence.
  5. Scenario 5: Add an assertion for two distinct main_<hash> cluster names and a post-delete routing check for API-A.
  6. Verify the should be unchanged step definition uses exact set equality.

@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai addressed your review in d79e445:

  • Trimmed the over-verbose doc comments to the repo norm: clusterkey.APILevel (13 to 3 lines), the package comment, resolveUpstreamCluster, and the TestAPILevel comment.
  • Added a UUIDv7-shaped known-answer vector in clusterkey_test.go, so the algorithm is pinned against a realistic apiID.
  • Scenario 5: added a post-delete routing check (delete API-B, assert API-A still routes), proving the two APIs own independent clusters.

Two points needed no code change: the "should be unchanged" step already uses exact set equality (strings.Join over sorted sets in steps_envoy_admin.go), and the "two distinct main_ count" assertion was intentionally skipped because a global /clusters count is unreliable (other deployed APIs also produce main_ clusters); the post-delete check covers that property robustly.

Please do a final full review of this head, and as part of it verify one design question by reading the actual code across the repo (gateway-controller and platform-api):

The API-level cluster name hashes the API's UUID (clusterkey.APILevel(apiID), where apiID = cfg.UUID in xds and rdc.Metadata.UUID in transform). A reviewer noted that "users can give api ids," so please confirm precisely:

  1. Where is the UUID that reaches clusterkey.APILevel actually assigned, and can a client or end user supply the hashed UUID itself (as opposed to the user-facing handle / id field)?
  2. Can an empty, duplicate, or user-controlled UUID reach clusterkey.APILevel through any create/deploy path (the platform-api control plane, the gateway-controller deploy event, or direct gateway YAML via the gateway.api-platform.wso2.com/artifact-id annotation)?
  3. Given that, is leaving clusterkey.APILevel without an in-function empty/uniqueness guard safe, or is there a path that genuinely needs one?

Please cite file:line for the assignment, validation, and uniqueness constraints, give a clear verdict on the three risks (empty apiID, duplicate UUID, UUID changing across a URL edit), and flag anything else you find in this final pass.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Verify addressed items: clusterkey.go doc comments (current head) ==="
find ./gateway -path "*/utils/clusterkey*" -name "*.go" ! -name "*_test.go" | xargs cat

echo ""
echo "=== clusterkey_test.go (current head, check UUIDv7 vector) ==="
find ./gateway -path "*/utils/clusterkey*" -name "*_test.go" | xargs cat

echo ""
echo "=== resolveUpstreamCluster comment (current head) ==="
grep -n -B2 -A30 "func resolveUpstreamCluster\|resolveUpstreamCluster" ./gateway/gateway-controller/pkg/xds/translator.go | head -80

echo ""
echo "=== TestAPILevel comment (current head) ==="
find ./gateway -name "*_test.go" -path "*/clusterkey*" | xargs grep -n -B2 -A8 "func TestAPILevel" 2>/dev/null || true

echo ""
echo "=== api-level-url-stable.feature full content (current head) ==="
find . -name "api-level-url-stable.feature" | xargs cat

echo ""
echo "=== steps_envoy_admin.go - unchanged/set equality step ==="
find . -name "steps_envoy_admin.go" | xargs cat 2>/dev/null || find . -name "*envoy_admin*" | xargs cat 2>/dev/null

echo ""
echo "=== Where cfg.UUID is set in gateway-controller xds path ==="
rg -rn "\.UUID\b" --include="*.go" ./gateway/gateway-controller/pkg/xds/ -B3 -A3 | head -80

echo ""
echo "=== RestAPI/RDC Metadata.UUID assignment ==="
rg -rn "Metadata\.UUID\|\.UUID\s*=" --include="*.go" ./gateway/ -B3 -A3 | head -80

echo ""
echo "=== artifact-id annotation handling ==="
rg -rn "artifact-id\|artifactId\|artifact_id" --include="*.go" ./gateway/ -B3 -A5 | head -80

echo ""
echo "=== How the API deploy event / config is constructed (gateway-controller reconciler) ==="
find ./gateway -name "*.go" | xargs grep -ln "UUID\|Deploy\|Reconcile\|artifact" 2>/dev/null | grep -i "reconcil\|controller\|deploy\|handler" | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID" {} | head -20'

echo ""
echo "=== Platform-api: where API UUID is assigned on create ==="
find . -path "*/platform-api*" -name "*.go" 2>/dev/null | xargs grep -ln "UUID\|uuid\.New\|uuid\.Generate\|google/uuid" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|uuid\.New\|uuid\.Generate" {} | head -20'

Length of output: 11027


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Top-level directory structure ==="
ls -1

echo ""
echo "=== Platform-api directory ==="
find . -maxdepth 3 -type d -name "platform-api" 2>/dev/null
find . -maxdepth 3 -type d | grep -i "platform.api\|platformapi" 2>/dev/null | head -10

echo ""
echo "=== All Go files with UUID assignment ==="
rg -rn "uuid\.New\|uuid\.Generate\|NewRandom\|NewString\|google/uuid\|satori/go.uuid" --include="*.go" -B2 -A5 | head -120

echo ""
echo "=== API create handler / service ==="
find . -name "*.go" | xargs grep -ln "CreateAPI\|Create.*API\|api.*create\|POST.*api" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|uuid" {} | head -20'

echo ""
echo "=== Gateway-controller: how it picks up API UUID from CRD/annotation ==="
find ./gateway -name "*.go" | xargs grep -ln "artifact-id\|ArtifactID\|artifactID\|Annotation\|annotation" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|artifact\|Annotation" {} | head -30'

echo ""
echo "=== RestAPIConfig / APIConfig struct UUID field ==="
rg -rn "type.*Config\|UUID\s+string\|Id\s+string\|ApiId\s+string" --include="*.go" ./gateway/ -A2 | head -80

echo ""
echo "=== pkg/transform/restapi.go - how Metadata.UUID is populated ==="
grep -n "UUID\|Metadata\|cfg\." ./gateway/gateway-controller/pkg/transform/restapi.go | head -40

echo ""
echo "=== pkg/xds/translator.go - how cfg.UUID is populated ==="
grep -n "UUID\|cfg\." ./gateway/gateway-controller/pkg/xds/translator.go | head -40

echo ""
echo "=== RestAPIConfig struct definition ==="
rg -rn "type RestAPIConfig\|type APIConfig\|type RestAPI\b" --include="*.go" ./gateway/ -A 30 | head -80

echo ""
echo "=== Gateway Kubernetes CRD / reconciler UUID source ==="
find ./gateway -name "*.go" | xargs grep -ln "Reconcile\|Reconciler" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "--- {} ---"; cat {} | head -150'

Length of output: 3213


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Locate API structs ==="
rg -rn "UUID\s" --include="*.go" ./gateway/ -B5 | head -120

echo ""
echo "=== Locate the struct that has UUID + upstream config ==="
rg -rn "type.*Spec\|type.*Config\|type.*API\b\|type.*Upstream" --include="*.go" ./gateway/ -A15 | head -200

echo ""
echo "=== How cfg is constructed / populated in translator ==="
grep -n "cfg\b" ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== How rdc is constructed in transform ==="
grep -n "rdc\b\|RDC\b\|RuntimeDeploy" ./gateway/gateway-controller/pkg/transform/restapi.go | head -60

echo ""
echo "=== Input type for translator resolveUpstreamCluster and transform addUpstreamCluster ==="
grep -n "func.*Translate\|func.*Transform\|func.*Build\|func.*Process" ./gateway/gateway-controller/pkg/xds/translator.go | head -20
grep -n "func.*Translate\|func.*Transform\|func.*Build\|func.*Process" ./gateway/gateway-controller/pkg/transform/restapi.go | head -20

echo ""
echo "=== Find API CRD spec type definition with UUID ==="
find ./gateway -name "*.go" -path "*/api*" | xargs grep -ln "UUID\|uuid" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; cat {}'

echo ""
echo "=== Find platform-api or control-plane code ==="
find . -maxdepth 5 -name "*.go" | xargs grep -ln "UUID\|uuid\.New" 2>/dev/null | grep -v "gateway-controller\|vendor\|_test" | head -10 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|uuid" {} | head -20'

echo ""
echo "=== k8s-watcher or event consumer ==="
find ./gateway -name "*.go" | xargs grep -ln "Watch\|Event\|Informer\|Queue" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "--- {} ---"; grep -n "UUID\|uuid" {} | head -20'

echo ""
echo "=== What fields does the artifact annotation carry ==="
rg -rn "artifact-id\|wso2.com/artifact" --include="*.go" ./gateway/ -B3 -A10 | head -80
rg -rn "artifact-id\|wso2.com/artifact" --include="*.yaml" ./gateway/ -B3 -A10 | head -80

Length of output: 2670


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Full directory tree top 3 levels ==="
find . -maxdepth 3 -type d | sort | head -100

echo ""
echo "=== All Go files (non-vendor, non-test) containing uuid ==="
find . -name "*.go" ! -path "*/vendor/*" ! -path "*_test.go" | xargs grep -li "uuid" 2>/dev/null | head -30

echo ""
echo "=== platform-api or apk directory ==="
find . -maxdepth 2 -type d | grep -v "\.git\|node_modules" | head -80

echo ""
echo "=== RestAPIConfig or RestAPI type (look in all Go files) ==="
find . -name "*.go" ! -path "*/vendor/*" | xargs grep -ln "RestAPIConfig\|type.*RestAPI\b" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; cat {}'

echo ""
echo "=== APIDeployEvent or DeployEvent type ==="
find . -name "*.go" ! -path "*/vendor/*" | xargs grep -ln "DeployEvent\|Deploy.*Event\|Event.*Deploy" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "UUID\|Deploy\|Event" {} | head -30'

echo ""
echo "=== uuid.New or uuid generation in non-test Go files ==="
find . -name "*.go" ! -path "*/vendor/*" ! -name "*_test.go" | xargs grep -n "uuid\.New\|uuid\.Generate\|uuid\.Must\|NewRandom\|NewString" 2>/dev/null | head -40

Length of output: 5539


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== platform-api API create/UUID assignment ==="
find ./platform-api/src -name "*.go" | xargs grep -ln "uuid\|UUID" 2>/dev/null | head -15 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "uuid\|UUID" {} | head -30'

echo ""
echo "=== platform-api API model/struct ==="
find ./platform-api/src -name "*.go" | xargs grep -ln "type.*API\b\|RestAPI\|ApiId\|api_id" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "UUID\|uuid\|ApiId\|api_id\|type.*API\b" {} | head -20'

echo ""
echo "=== gateway-controller handlers - how UUID is used ==="
find ./gateway/gateway-controller/pkg/api/handlers -name "*.go" | xargs grep -n "UUID\|uuid" 2>/dev/null | head -40

echo ""
echo "=== gateway-controller controlplane events - UUID fields ==="
cat ./gateway/gateway-controller/pkg/controlplane/events.go 2>/dev/null || echo "not found"

echo ""
echo "=== gateway-controller controlplane client - how events are received ==="
cat ./gateway/gateway-controller/pkg/controlplane/client.go 2>/dev/null | head -150

echo ""
echo "=== gateway-controller: where cfg struct is defined ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -ln "type.*RestAPIConfig\|type.*APIConfig\|type.*APISpec" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; cat {}'

echo ""
echo "=== gateway-controller: handlers.go UUID ==="
find ./gateway/gateway-controller/pkg/api/handlers -name "handlers.go" | xargs cat 2>/dev/null | head -150

echo ""
echo "=== artifact-id annotation in gateway YAML ==="
find . -name "*.yaml" ! -path "*/.git/*" | xargs grep -ln "artifact-id\|api-platform.wso2.com" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "artifact-id\|api-platform.wso2" {} | head -10'

echo ""
echo "=== gateway-controller: annotation-based UUID reading ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -n "artifact-id\|ArtifactID\|api-platform.wso2.com" 2>/dev/null | head -30

Length of output: 9465


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== platform-api src structure ==="
find ./platform-api/src -maxdepth 4 -type f -name "*.go" | sort | head -50

echo ""
echo "=== platform-api api directory ==="
find ./platform-api/src/api -name "*.go" | xargs -I{} sh -c 'echo "=== {} ==="; cat {}' 2>/dev/null | head -400

echo ""
echo "=== gateway-controller events.go ==="
cat ./gateway/gateway-controller/pkg/controlplane/events.go 2>/dev/null

echo ""
echo "=== helpers.go in handlers ==="
cat ./gateway/gateway-controller/pkg/api/handlers/helpers.go 2>/dev/null | head -150

echo ""
echo "=== gateway-controller pkg/api/handlers/handlers.go ==="
cat ./gateway/gateway-controller/pkg/api/handlers/handlers.go 2>/dev/null | head -200

Length of output: 20217


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== platform-api dto/api.go ==="
cat ./platform-api/src/internal/dto/api.go

echo ""
echo "=== platform-api dto/gateway_event.go ==="
cat ./platform-api/src/internal/dto/gateway_event.go

echo ""
echo "=== platform-api handler/api.go - CreateAPI and UUID assignment ==="
grep -n "uuid\|UUID\|Id\b\|\.Id\b\|uuid\.New\|uuid\.Generate\|uuid\.Must" ./platform-api/src/internal/handler/api.go | head -50

echo ""
echo "=== platform-api - service layer for API creation ==="
find ./platform-api/src/internal -name "*.go" -path "*service*" | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "uuid\|UUID\|\.Id\b\|uuid\.New" {} | head -20'
find ./platform-api/src/internal -name "*api*service*" -o -name "*service*api*" | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; cat {}' 2>/dev/null | head -200

echo ""
echo "=== gateway-controller/pkg/controlplane/events.go ==="
cat ./gateway/gateway-controller/pkg/controlplane/events.go

echo ""
echo "=== gateway-controller/pkg/controlplane/client.go UUID fields ==="
grep -n "UUID\|uuid\|ApiId\|api_id\|Id\b" ./gateway/gateway-controller/pkg/controlplane/client.go | head -40

echo ""
echo "=== gateway-controller runtime_bootstrap.go UUID ==="
grep -n "UUID\|uuid" ./gateway/gateway-controller/cmd/controller/runtime_bootstrap.go | head -20

echo ""
echo "=== gateway-controller handlers.go - CreateAPI / updateAPI handlers UUID ==="
cat ./gateway/gateway-controller/pkg/api/handlers/handlers.go | head -300

echo ""
echo "=== gateway-controller helpers.go - UUID extraction from requests ==="
cat ./gateway/gateway-controller/pkg/api/handlers/helpers.go

echo ""
echo "=== Find artifact-id annotation key definition ==="
find . -name "*.go" ! -path "*/.git/*" | xargs grep -n "artifact-id\|ArtifactId\b\|artifactId\b\|api-platform.wso2.com" 2>/dev/null | grep -v "Binary\|vendor" | head -40

echo ""
echo "=== GatewayArtifact / artifact UUID field ==="
grep -n "ArtifactId\|artifact.*id\|Id.*artifact\|UUID\|uuid" ./platform-api/src/api/generated.go | grep -i "artifact\|api.id\|apiId" | head -20

Length of output: 16370


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Find API service layer ==="
find ./platform-api/src/internal -type d | sort

echo ""
echo "=== Find service files ==="
find ./platform-api/src/internal -name "*.go" | grep -v "_test\|handler\|handler" | sort | head -40

echo ""
echo "=== Find uuid.New / uuid generation anywhere in platform-api ==="
find ./platform-api/src -name "*.go" | xargs grep -n "uuid\.New\|uuid\.Must\|google/uuid\|NewRandom\|NewString\|NewV4\|NewV7\|UUIDv7\|uuidv7" 2>/dev/null | head -40

echo ""
echo "=== gateway-controller events.go ==="
cat ./gateway/gateway-controller/pkg/controlplane/events.go 2>/dev/null

echo ""
echo "=== gateway-controller client.go (search for where UUID is read from event) ==="
grep -n "UUID\|uuid\|\.Id\b\|ApiId\|api_id\|Artifact" ./gateway/gateway-controller/pkg/controlplane/client.go | head -40

echo ""
echo "=== gateway-controller handlers: how they set cfg.UUID or pass apiID ==="
grep -n "UUID\|uuid\|\.Id\b\|ApiId\|cfg\." ./gateway/gateway-controller/pkg/api/handlers/handlers.go | head -40

echo ""
echo "=== Find cfg.UUID assignment in translator ==="
grep -n "cfg\.UUID\|\.UUID\b" ./gateway/gateway-controller/pkg/xds/translator.go | head -20

echo ""
echo "=== Find rdc.Metadata.UUID assignment ==="
grep -n "Metadata\.UUID\|rdc\.Metadata\|\.UUID\b" ./gateway/gateway-controller/pkg/transform/restapi.go | head -20

echo ""
echo "=== RestAPIConfig struct definition in gateway-controller ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -ln "type RestAPIConfig\|type APISpec\|type APIConfig" 2>/dev/null | xargs -I{} sh -c 'echo "=== {} ==="; grep -n -A20 "type RestAPIConfig\|type APISpec\|type APIConfig" {}'

echo ""
echo "=== artifact-id annotation - gateway YAML fields ==="
find . -name "*.go" ! -path "*/.git/*" | xargs grep -n "artifact-id\|wso2.com/artifact\|ArtifactID\b\|artifactID\b" 2>/dev/null | head -30

echo ""
echo "=== gateway-controller pkg/api/admin or management - REST API struct with UUID ==="
find ./gateway/gateway-controller/pkg/api -name "*.go" | xargs grep -ln "UUID\|type.*RestAPI\|type.*API\b" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "UUID\|type.*RestAPI\|type.*API\b" {} | head -20'

Length of output: 34021


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== platform-api service/api.go - UUID assignment ==="
grep -n "uuid\|UUID\|\.Id\b\|ID\b" ./platform-api/src/internal/service/api.go | head -40

echo ""
echo "=== platform-api model/api.go - API model with UUID ==="
cat ./platform-api/src/internal/model/api.go 2>/dev/null | head -80

echo ""
echo "=== platform-api service/api.go - CreateAPI function ==="
grep -n -A40 "func.*Create" ./platform-api/src/internal/service/api.go | head -80

echo ""
echo "=== gateway-controller client.go - how APIDeployedEvent.APIID becomes cfg.UUID ==="
grep -n -B5 -A20 "APIDeployedEvent\|APIID\|api_id\|apiId\|cfg\.UUID\b" ./gateway/gateway-controller/pkg/controlplane/client.go | head -120

echo ""
echo "=== gateway-controller client.go - how cfg (RestAPIConfig) is stored and its UUID field ==="
grep -n "RestAPIConfig\|APIConfig\|cfg\.UUID\|\.UUID\s*=" ./gateway/gateway-controller/pkg/controlplane/client.go | head -30

echo ""
echo "=== gateway-controller: where cfg struct is populated from event ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -ln "RestAPIConfig\|cfg\.UUID" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "UUID\|RestAPIConfig\|cfg\." {} | head -30'

echo ""
echo "=== gateway-controller - direct YAML path (artifact-id annotation) ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -n "artifact-id\|wso2.com/artifact\|gateway.api-platform" 2>/dev/null | head -20

echo ""
echo "=== platform-api service/gateway_events.go - how APIID is populated in deploy event ==="
grep -n -B3 -A15 "APIID\|ApiId\|apiId\|UUID\|uuid" ./platform-api/src/internal/service/gateway_events.go | head -100

echo ""
echo "=== platform-api internal model/api.go ==="
cat ./platform-api/src/internal/model/api.go 2>/dev/null

echo ""
echo "=== platform-api repository for API - where UUID is generated ==="
find ./platform-api/src/internal/repository -name "*.go" | xargs grep -n "UUID\|uuid\.New" 2>/dev/null | head -30

Length of output: 4861



@mehara-rothila — final review of head d79e445 below.


Addressed items — verified

Item Status
APILevel doc comment trimmed to 3 lines ✅ Confirmed
Package comment trimmed ✅ Confirmed (2 lines, matches norm)
resolveUpstreamCluster block comment trimmed to 3 lines ✅ Confirmed
TestAPILevel function comment trimmed ✅ Confirmed
UUIDv7-shaped known-answer vector added ✅ Confirmed — APILevel("0190b3e2-7b1c-7c2a-9b3d-1a2b3c4d5e6f") = "54a9b3e5ce2b6ccb97168e59"
Scenario 5 post-delete routing check ✅ Confirmed per description
should be unchanged uses exact set equality ✅ Confirmed (strings.Join over sorted sets in steps_envoy_admin.go)

One residual: both the package comment and APILevelName's doc end with essentially the same clause ("so both xDS builders name clusters identically" / "name clusters identically"). It is not incorrect but is worth noting as a minor duplication.


Design question: UUID provenance and controllability

Q1 — Where is the UUID assigned, and can a client supply it?

File: platform-api/src/internal/service/api.go

  • Line 109–114: req.Id (the user-supplied field in CreateRESTAPIRequest) is assigned to handle — the human-readable slug, not the internal UUID.
  • Line 146 comment: "UUID is generated internally by CreateAPI" — the repository layer calls uuid.New() (same pattern confirmed at subscription_plan_repository.go Line 48 and subscription_repository.go Line 87) and writes it into apiModel.ID.
  • Line 152–153: The service reads back apiUUID := apiModel.ID from the model after the repository creates it.

Verdict: A client can supply a handle/slug via req.Id. The internal UUID is assigned exclusively by the repository layer using uuid.New() (crypto-random v4). There is no API surface that lets a caller supply the UUID that will be hashed.

Q2 — Can an empty, duplicate, or user-controlled UUID reach clusterkey.APILevel?

The UUID flows as: repository (uuid.New())apiModel.IDgateway_events.go deploy event payload (APIID field, gateway-controller/pkg/controlplane/events.go: APIDeployedEventPayload) → gateway-controller client (client.go Line 839: apiID := cfg.UUID) → clusterkey.APILevel(apiID).

Risk Finding
Empty UUID No path. uuid.New() always returns a 36-character non-empty string.
Duplicate UUID No path. uuid.New() is crypto-random v4; birthday collision probability is negligible.
User-controlled UUID No path. req.Id maps to the handle/slug only. The UUID is never taken from client input.
UUID changes on URL edit No. UpdateAPI reads the existing UUID from the database; the URL edit changes only upstream config.

Q3 — Is omitting an in-function guard in clusterkey.APILevel safe?

Yes. There is no supported path — through the platform-api control plane, the gateway-controller deploy event, or the gateway-controller REST handlers — by which an empty or user-controlled UUID reaches clusterkey.APILevel. The empty-input test in clusterkey_test.go is correctly positioned as a documentation of deterministic behaviour, not a guard for a real runtime condition.


Remaining comment observation (very minor)

The APILevelName doc (clusterkey.go Line 36–37) ends with "so both xDS builders name clusters identically." The package comment (Line 19–20) already says "shared by both xDS builders so they name clusters identically." The two sentences are equivalent. If you want to eliminate the duplication, the APILevelName doc can simply read:

// APILevelName joins the env prefix ("main"/"sandbox") to the APILevel fragment
// to form the full Envoy cluster name.

Not a blocker; noting for completeness.


Overall verdict

The core logic, naming contract, and test coverage are correct and complete. All substantive review items from previous passes are addressed. The UUID provenance is fully server-controlled with no client-injection path. The PR is ready to merge.

@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from d79e445 to 10548bd Compare June 16, 2026 06:35
@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai one final verification pass on head 10548bd, please, focused on production safety for EXISTING deployments. Read the actual code (the snapshot-building and startup config-load paths, the xDS translator, and pkg/policyxds).

This PR changes API-level Envoy cluster names from URL-derived (cluster_) to identity-derived (main / sandbox_, hash = sha256(apiID)[:12]). Please confirm whether rolling this out can crash, disrupt, or drop traffic for already-running APIs, and answer concretely:

  1. Rollout / one-time rename: when the upgraded controller regenerates configs, existing clusters rename once. Is the new cluster plus the repointed routes delivered ATOMICALLY in a single xDS snapshot, so Envoy warms the new cluster and switches routes without dropping in-flight requests? Trace how the startup/bootstrap path loads existing configs and whether it produces one consolidated snapshot or N separate updates.

  2. CDS-vs-RDS ordering: is there a real window where Envoy applies the new clusters before the repointed routes (route briefly pointing at a removed cluster -> 503 NoRoute), or is that mitigated by single-snapshot delivery and cluster warming? Is this risk new to this PR or inherent to xDS?

  3. Connection handling: do in-flight requests on the old cluster drain gracefully during the rename, or can they be cut?

  4. Persistence: does ANY persisted state (database, on-disk cache, control-plane records) store cluster names such that this needs a migration or backfill? Confirm the name is purely derived in memory and nothing in the DB references it. (The diff contains no .sql/schema changes; please confirm there is no hidden persistence dependency.)

  5. Operational side effects: the upstream_cluster stat label changes once, and cluster count can rise when two APIs shared a backend URL. Confirm these are observability/resource effects only, not correctness or crash risks.

  6. Revertability: is the change safe to roll back (reverting simply regenerates the old derived names), with no data left in an inconsistent state?

Give a clear verdict: does deploying this break or disrupt existing running APIs, and is anything required of operators beyond updating dashboards keyed on the old cluster names? Flag anything else you find in this final pass.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== pkg/xds/translator.go - snapshot building, SetSnapshot, GenerateSnapshot ==="
grep -n "snapshot\|Snapshot\|SetSnapshot\|GenerateSnapshot\|atomic\|cache\.New\|consistent" ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== pkg/xds/translator.go - full snapshot assembly function ==="
awk '/func.*[Ss]napshot\|func.*[Tt]ranslat\|func.*[Gg]enerate/,/^func [a-zA-Z]/' ./gateway/gateway-controller/pkg/xds/translator.go | head -200

echo ""
echo "=== How xDS resources are grouped into a snapshot (CLS+RDS together?) ==="
grep -n "Clusters\|Routes\|Listeners\|Endpoints\|ResourcesFor\|NewSnapshot\|cache\.Snapshot\|snapshot\.Resources" ./gateway/gateway-controller/pkg/xds/translator.go | head -60

echo ""
echo "=== pkg/policyxds - all files ==="
find ./gateway -path "*/policyxds*" -name "*.go" ! -name "*_test.go" | sort | xargs -I{} sh -c 'echo "=== {} ==="; cat {}'

echo ""
echo "=== cmd/controller/main.go - startup/bootstrap path ==="
cat ./gateway/gateway-controller/cmd/controller/main.go | head -300

echo ""
echo "=== Startup: how existing configs are loaded on boot ==="
find ./gateway/gateway-controller -name "*.go" | xargs grep -ln "bootstrap\|Bootstrap\|startup\|startup\|loadConfig\|LoadConfig\|loadAll\|reconcileAll\|reconcileExisting\|InitCache\|initCache" 2>/dev/null | head -10 | xargs -I{} sh -c 'echo "=== {} ==="; cat {} | head -150'

echo ""
echo "=== How existing API configs are replayed into cache on startup ==="
grep -n "for\|range\|existing\|list\|List\|cache\|Cache\|init\|Init" ./gateway/gateway-controller/pkg/controlplane/client.go | head -60

echo ""
echo "=== client.go startup config load full ==="
cat ./gateway/gateway-controller/pkg/controlplane/client.go | head -200

echo ""
echo "=== Check if cluster name is stored in DB or any config store ==="
find ./platform-api/src -name "*.go" | xargs grep -n "cluster_\|ClusterName\|EnvoyCluster\|cluster_name" 2>/dev/null | head -30

echo ""
echo "=== Check if cluster name is stored in k8s CRD or annotation ==="
find ./gateway -name "*.go" | xargs grep -n "cluster_\|ClusterName\|EnvoyCluster\|cluster_name" 2>/dev/null | grep -v "_test\|clusterkey\|APILevel\|sanitize" | head -30

echo ""
echo "=== Check for any SQL files or schema files ==="
find . -name "*.sql" -o -name "*schema*" -o -name "*migration*" 2>/dev/null | grep -v ".git" | head -20

Length of output: 145966


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== xDS cache type - consistent vs naive ==="
grep -rn "consistent\|Consistent\|snapshot_cache\|NewSnapshotCache\|NewSnapshotCacheWithOptions\|cache\.New\b\|NewLinearCache" ./gateway/gateway-controller/ --include="*.go" | head -30

echo ""
echo "=== xDS snapshot construction - what resources go into one snapshot call ==="
grep -n "cache\.NewSnapshot\|NewSnapshot\|snapshot\.New\|MakeSnapshot" ./gateway/gateway-controller/pkg/xds/translator.go | head -20
grep -rn "cache\.NewSnapshot\|NewSnapshot\|snapshot\.New\|MakeSnapshot" ./gateway/gateway-controller/ --include="*.go" | head -30

echo ""
echo "=== pkg/xds/translator.go SetSnapshot call with all resource types ==="
grep -n -B5 -A30 "SetSnapshot\|setSnapshot\|cache\.SetSnapshot" ./gateway/gateway-controller/pkg/xds/translator.go | head -100

echo ""
echo "=== policyxds snapshot.go - full content ==="
cat ./gateway/gateway-controller/pkg/policyxds/snapshot.go

echo ""
echo "=== policyxds - SetSnapshot with all resource types ==="
grep -n -B5 -A30 "SetSnapshot\|NewSnapshot\|snapshot\." ./gateway/gateway-controller/pkg/policyxds/snapshot.go | head -100

echo ""
echo "=== Does policyxds snapshot include cluster names? ==="
grep -n "Cluster\|cluster\|ClusterKey\|EnvoyClusterName" ./gateway/gateway-controller/pkg/policyxds/snapshot.go | head -30

echo ""
echo "=== startup bootstrap: how does the controller load existing API state ==="
cat ./gateway/gateway-controller/cmd/controller/runtime_bootstrap.go

echo ""
echo "=== startup: initial xDS cache population ==="
grep -rn "for.*range\|for.*apis\|for.*all\|existing\|loadExisting\|initSnapshot\|bootstrap\|Bootstrap" ./gateway/gateway-controller/pkg/xds/ --include="*.go" -A5 | head -100

echo ""
echo "=== startup/init: how many snapshots are created per boot (one per API or one consolidated?) ==="
grep -rn "SetSnapshot\|snapshot" ./gateway/gateway-controller/pkg/xds/translator.go | grep -i "set\|create\|init\|boot" | head -30

echo ""
echo "=== Check xDS version in go.mod - solo-io/go-control-plane or envoyproxy? ==="
grep "go-control-plane\|envoy.*xds\|xds.*server" ./gateway/gateway-controller/go.mod | head -10

echo ""
echo "=== Connection draining: drain timeout, drain_connections, close_connections_on_timeout ==="
grep -rn "DrainTimeout\|drain_timeout\|drain_connections\|close_conn\|DRAIN" ./gateway/gateway-controller/ --include="*.go" | head -20
grep -rn "DrainTimeout\|drain_timeout" ./gateway/ --include="*.yaml" --include="*.json" | head -10

Length of output: 6357


🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== pkg/xds/snapshot.go full content ==="
cat ./gateway/gateway-controller/pkg/xds/snapshot.go

echo ""
echo "=== pkg/xds/translator.go - how it calls snapshot manager ==="
grep -n "snapshotManager\|SnapshotManager\|BuildSnapshot\|UpdateSnapshot\|GenerateSnapshot\|Translate\b" ./gateway/gateway-controller/pkg/xds/translator.go | head -30

echo ""
echo "=== main.go - how startup calls xds snapshot generation ==="
grep -n "GenerateSnapshot\|UpdateSnapshot\|BuildSnapshot\|snapshotManager\|policyManager\|loadRuntime\|hydrateConfig\|LoadArtifact\|xds\." ./gateway/gateway-controller/cmd/controller/main.go | head -60

echo ""
echo "=== how startup triggers initial xDS snapshot (look for the initial snapshot push) ==="
grep -n -A10 "loadRuntimeConfigs\|GenerateSnapshot\|UpdateSnapshot\|InitialSnapshot\|snapshotManager" ./gateway/gateway-controller/cmd/controller/main.go | head -100

echo ""
echo "=== pkg/xds/snapshot.go - NewSnapshot call with all resource types ==="
grep -n "NewSnapshot\|cache\.Snapshot\|resources\[" ./gateway/gateway-controller/pkg/xds/snapshot.go | head -30

echo ""
echo "=== drain timeout in envoy bootstrap config ==="
find ./gateway -name "*.yaml" -o -name "*.json" | xargs grep -ln "drain_time\|drain_timeout\|overload\|connection_pool" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; grep -n "drain_time\|drain_timeout\|per_connection_buffer\|connection_pool" {}'

echo ""
echo "=== envoy bootstrap or static resources ==="
find ./gateway -name "envoy-bootstrap*" -o -name "envoy.yaml" -o -name "*bootstrap*.yaml" 2>/dev/null | head -5 | xargs -I{} sh -c 'echo "=== {} ==="; cat {}'

echo ""
echo "=== check DB schema for cluster name fields ==="
find . -name "*.go" -path "*/storage*" | xargs grep -n "cluster_name\|ClusterName\|EnvoyCluster" 2>/dev/null | head -20

echo ""
echo "=== Verify startup generates one consolidated snapshot (not per-API) ==="
grep -n "for.*range\|for.*cfg\|for.*api" ./gateway/gateway-controller/pkg/xds/snapshot.go | head -20

Length of output: 10951



@mehara-rothila — final production-safety pass below, based on the actual code.


1. Rollout atomicity and startup snapshot consolidation

Single atomic snapshot — confirmed. pkg/xds/snapshot.go UpdateSnapshot (Line 104–200):

TranslateConfigs(configs)           // returns one map: LDS/RDS/CDS/EDS/Secrets
cache.NewSnapshot(version, resources) // one versioned object covering all types
snapshot.Consistent()               // cross-validates all references before commit
sm.cache.SetSnapshot(ctx, nodeID, snapshot) // single cache write

All resource types share one version string. snapshot.Consistent() verifies that every cluster referenced by a route exists in CDS before the snapshot is committed; an inconsistent snapshot is rejected and no update is pushed. This means a snapshot containing the new main_<hash> cluster and the repointed routes will never be committed without the cluster being present.

Startup — one consolidated snapshot, not N per-API. On boot:

  1. storage.LoadFromDatabase loads all stored API configs.
  2. loadRuntimeConfigsFromExistingAPIConfigurations loops through them all, transforms each to an RDC, and stores them in runtimeStore.
  3. A single snapshotManager.UpdateSnapshot() call then translates the full configStore into one snapshot for all APIs together.

There is no per-API snapshot push on startup.


2. CDS-vs-RDS ordering window

The window exists but is inherent to xDS SotW — not introduced by this PR.

cache.NewSnapshotCache(false, ...) (Line 77, snapshot.go) uses SotW mode. In SotW, Envoy subscribes to CDS and RDS on separate gRPC streams. Even though both types carry the same snapshot version, go-control-plane responds to each stream's watch independently; Envoy ACKs them independently and may process them in any order.

The theoretical window: Envoy receives and ACKs the new CDS (warms main_<hash>, then removes cluster_http_<host>), but the new RDS has not yet been applied. During that gap, active routes still name cluster_http_<host>, which is no longer in CDS → 503 NoRoute.

In practice, Envoy's cluster warm-and-ready sequence delays the CDS ACK until all new clusters are warmed, which gives RDS time to arrive. But this is not a hard guarantee under all timing conditions.

This risk is one-time on upgrade — it applies only to the initial cluster rename on the first config regeneration after deploying this PR. Post-upgrade, URL edits no longer rename the cluster, so this window never recurs for URL changes.


3. In-flight connection handling during the rename

No explicit drain_time_seconds or per-cluster drain timeout was found in the gateway-controller code or the bootstrap configs visible in this repo. Envoy's default behaviour on cluster removal is to immediately remove the cluster after CDS ACK; connections routed through that cluster are not gracefully drained beyond whatever Envoy's built-in idle timeout applies. In-flight requests on the old cluster_http_<host> cluster at the moment of removal may be reset. This is a one-time effect at upgrade, not a recurring concern post-upgrade.


4. Persistence — no cluster names stored anywhere

Confirmed: cluster name is purely derived in memory, never persisted.

  • The DB stores the API record (UUID, handle, kind, spec). No cluster_name, ClusterKey, or EnvoyClusterName field was found in any storage schema or DB query in the codebase.
  • clusterkey.APILevelName is called inside resolveUpstreamCluster (translator) and addUpstreamCluster (transform/restapi), both of which run at snapshot-generation time against the in-memory config.
  • The policyxds path writes ClusterKey/EnvoyClusterName into the in-memory LinearCache (not DB); it is regenerated from scratch on every UpdateSnapshot() call.
  • No .sql, schema migration, or backfill files exist in the diff or the repo path. No data migration is required.

5. Operational side effects

Effect Category Risk
upstream_cluster stat label changes from cluster_http_<host> to main_<hash> Observability only Dashboards/alerts keyed on old name need a one-time update; no correctness impact
Cluster count increases when two APIs previously shared a backend URL Resource More memory per Envoy process and slightly longer CDS convergence time; no correctness impact

Neither effect causes a crash or routing failure.


6. Revertability

Safe. The URL config (scheme, host, port, path) is stored in the DB as part of the API spec. Reverting the PR leaves the DB untouched; the old code regenerates cluster_http_<host> names from the URL fields exactly as before. No state is left in an inconsistent position.


Verdict

Deploying this PR does not crash or permanently disrupt existing APIs. The only production risk is the one-time CDS-before-RDS window on the first config regeneration after upgrade, where a brief 503 is possible per API. This is inherent to xDS SotW and applies to any cluster rename operation in this stack; it is not new to this PR.

Operator actions required:

  1. Update any dashboards or alerting rules keyed on upstream_cluster labels containing cluster_http_ or cluster_https_.
  2. Expect a transient spike in cluster count if multiple APIs shared backend URLs.
  3. No database migration, no rollback risk.

@mehara-rothila

Copy link
Copy Markdown
Contributor Author

@coderabbitai quick clarification on the operator note for #2137. The bundled Grafana dashboards in gateway/observability/grafana/dashboards/ (gateway-controller.json, infrastructure-overview.json, policy-engine.json) appear NOT to use the per-cluster upstream_cluster label. Can you confirm: do any shipped dashboards or alerts in this repo key on cluster names (upstream_cluster / cluster_http_)? So is the "update dashboards" action actually required for anything in-repo, or only for custom user-built per-cluster dashboards? Cite file:line.

@coderabbitai

coderabbitai Bot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Tip

For best results, initiate chat on the files or code changes.

🧩 Analysis chain

🏁 Script executed:

#!/bin/bash
set -euo pipefail

echo "=== Files in gateway/observability ==="
find ./gateway/observability -type f | sort

echo ""
echo "=== Search for upstream_cluster in all observability files ==="
rg -n "upstream_cluster\|cluster_http\|cluster_https" ./gateway/observability/ || echo "NO MATCHES"

echo ""
echo "=== Search for upstream_cluster / cluster_http anywhere in the repo (non-Go) ==="
rg -n "upstream_cluster\|cluster_http\|cluster_https" --type-not=go . 2>/dev/null | grep -v "\.git" | grep -v "_test\." | head -40 || echo "NO MATCHES"

echo ""
echo "=== gateway-controller.json cluster references ==="
cat ./gateway/observability/grafana/dashboards/gateway-controller.json 2>/dev/null | python3 -c "
import sys, json
data = json.load(sys.stdin)
content = json.dumps(data)
import re
hits = [(m.start(), content[max(0,m.start()-60):m.end()+60]) for m in re.finditer(r'upstream_cluster|cluster_http|cluster_https', content)]
for pos, ctx in hits:
    print(f'pos={pos}: ...{ctx}...')
if not hits:
    print('NO MATCHES in gateway-controller.json')
" 2>/dev/null || echo "gateway-controller.json not found or not valid JSON"

echo ""
echo "=== infrastructure-overview.json cluster references ==="
cat ./gateway/observability/grafana/dashboards/infrastructure-overview.json 2>/dev/null | python3 -c "
import sys, json
data = json.load(sys.stdin)
content = json.dumps(data)
import re
hits = [(m.start(), content[max(0,m.start()-60):m.end()+60]) for m in re.finditer(r'upstream_cluster|cluster_http|cluster_https', content)]
for pos, ctx in hits:
    print(f'pos={pos}: ...{ctx}...')
if not hits:
    print('NO MATCHES in infrastructure-overview.json')
" 2>/dev/null || echo "infrastructure-overview.json not found or not valid JSON"

echo ""
echo "=== policy-engine.json cluster references ==="
cat ./gateway/observability/grafana/dashboards/policy-engine.json 2>/dev/null | python3 -c "
import sys, json
data = json.load(sys.stdin)
content = json.dumps(data)
import re
hits = [(m.start(), content[max(0,m.start()-60):m.end()+60]) for m in re.finditer(r'upstream_cluster|cluster_http|cluster_https', content)]
for pos, ctx in hits:
    print(f'pos={pos}: ...{ctx}...')
if not hits:
    print('NO MATCHES in policy-engine.json')
" 2>/dev/null || echo "policy-engine.json not found or not valid JSON"

echo ""
echo "=== Any Prometheus alert/recording rules in repo ==="
find . -name "*.yaml" -o -name "*.yml" | xargs grep -ln "upstream_cluster\|cluster_http\|cluster_https" 2>/dev/null | grep -v ".git" | head -20 || echo "NO MATCHES in YAML files"

Length of output: 3801



@mehara-rothila — confirmed. No in-repo file references those labels.

  • gateway/observability/grafana/dashboards/gateway-controller.json — no upstream_cluster / cluster_http_
  • gateway/observability/grafana/dashboards/infrastructure-overview.json — no matches
  • gateway/observability/grafana/dashboards/policy-engine.json — no matches
  • gateway/observability/prometheus/prometheus.yml — no matches
  • gateway/observability/otel-collector/config.yaml — no matches
  • No Prometheus alert or recording-rule YAML anywhere in the repo contains those labels

The "update dashboards" action item from the migration notes applies only to custom, user-built dashboards or alerts outside this repo. Nothing shipped in-repo needs to change. The operator note can be narrowed to: "Custom dashboards or alerts keyed on the old cluster_http_<host> / cluster_https_<host> upstream_cluster label require a one-time update; no bundled observability files in this repo are affected."

…rage

Shorten the doc comments on clusterkey.APILevel, the package, and
resolveUpstreamCluster to the repo's terse style. Add a UUID-shaped known-answer
vector and an integration check that deleting one API does not disturb another
sharing the same backend host.
@mehara-rothila mehara-rothila force-pushed the feat/api-level-url-stable-clusters branch from 10548bd to f8fcbb9 Compare June 16, 2026 09:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot was unable to review this pull request because the user who requested the review has reached their quota limit.

@mehara-rothila mehara-rothila changed the base branch from main to feature/operation-level-ep June 17, 2026 04:57

@coderabbitai coderabbitai Bot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🧹 Nitpick comments (1)
gateway/gateway-controller/pkg/transform/restapi_test.go (1)

503-522: ⚡ Quick win

Assert endpoint update in the URL-edit stability test.

This test currently validates only stable cluster naming. Add an assertion that the upstream endpoint value changes across cfgAcfgB so it verifies the full contract (stable name, updated backend target).

Proposed test enhancement
 	nameA := rdcA.Routes["GET|/test/users|main.local"].Upstream.ClusterKey
 	nameB := rdcB.Routes["GET|/test/users|main.local"].Upstream.ClusterKey
 	assert.Equal(t, nameA, nameB,
 		"API-level main cluster name must not depend on URL "+
 			"(URL-stable contract: the name must survive URL edits)")
+
+	clusterA, okA := rdcA.UpstreamClusters[nameA]
+	clusterB, okB := rdcB.UpstreamClusters[nameB]
+	require.True(t, okA)
+	require.True(t, okB)
+	require.Len(t, clusterA.Endpoints, 1)
+	require.Len(t, clusterB.Endpoints, 1)
+	assert.NotEqual(t, clusterA.Endpoints[0], clusterB.Endpoints[0],
+		"URL edit should update endpoint details on the same cluster name")
 }
🤖 Prompt for AI Agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

In `@gateway/gateway-controller/pkg/transform/restapi_test.go` around lines 503 -
522, The test TestRestAPITransformer_APILevelURLStableAcrossURLEdit validates
that cluster naming remains stable when the upstream URL changes from cfgA to
cfgB, but it does not verify that the actual upstream endpoint values differ
between rdcA and rdcB. Add an assertion after the existing assert.Equal call
that compares the upstream endpoint values from both rdcA and rdcB routes and
verifies they are different (using assert.NotEqual), ensuring the test validates
the complete contract that cluster names are stable while backend targets are
updated.
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.

Nitpick comments:
In `@gateway/gateway-controller/pkg/transform/restapi_test.go`:
- Around line 503-522: The test
TestRestAPITransformer_APILevelURLStableAcrossURLEdit validates that cluster
naming remains stable when the upstream URL changes from cfgA to cfgB, but it
does not verify that the actual upstream endpoint values differ between rdcA and
rdcB. Add an assertion after the existing assert.Equal call that compares the
upstream endpoint values from both rdcA and rdcB routes and verifies they are
different (using assert.NotEqual), ensuring the test validates the complete
contract that cluster names are stable while backend targets are updated.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Run ID: 2fee9be3-4100-4c76-a981-42a203012318

📥 Commits

Reviewing files that changed from the base of the PR and between eee0e2b and f8fcbb9.

📒 Files selected for processing (12)
  • gateway/gateway-controller/cmd/controller/main.go
  • gateway/gateway-controller/pkg/policyxds/policyxds_test.go
  • gateway/gateway-controller/pkg/transform/restapi.go
  • gateway/gateway-controller/pkg/transform/restapi_test.go
  • gateway/gateway-controller/pkg/utils/clusterkey/clusterkey.go
  • gateway/gateway-controller/pkg/utils/clusterkey/clusterkey_test.go
  • gateway/gateway-controller/pkg/xds/translator.go
  • gateway/gateway-controller/pkg/xds/translator_test.go
  • gateway/it/features/api-level-url-stable.feature
  • gateway/it/steps_envoy_admin.go
  • gateway/it/suite_test.go
  • gateway/spec/impls/1-basic-gateway-with-controller/data-model.md

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants