Skip to content

feat: replace ingress-nginx smoke cluster with Gateway API and Envoy Gateway#3386

Open
avirajkhare00 wants to merge 1 commit into
apache:masterfrom
avirajkhare00:master
Open

feat: replace ingress-nginx smoke cluster with Gateway API and Envoy Gateway#3386
avirajkhare00 wants to merge 1 commit into
apache:masterfrom
avirajkhare00:master

Conversation

@avirajkhare00
Copy link
Copy Markdown
Contributor

Which issue does this PR address?

Closes #3098

Rationale

Ingress NGINX was officially retired in March 2026. Our Helm smoke/CI
cluster setup still pulled the retired controller, which gets no further
security or bug fixes.

What changed?

scripts/ci/setup-helm-smoke-cluster.sh

  • Removed nginx-specific kind config (ingress-ready label, hostPort
    mappings, admission-webhook polling helpers).
  • Install Envoy Gateway via Helm first, then upgrade Gateway API CRDs to
    v1.5.0 with --server-side --force-conflicts. This ordering avoids the
    safe-upgrade ValidatingAdmissionPolicy blocking EG's older bundled CRDs.
  • Creates an EnvoyProxy (NodePort), GatewayClass, and Gateway resource,
    then waits for the Programmed condition.

scripts/ci/test-helm.sh

  • Removed HELM_SMOKE_INGRESS_CLASS; chart is now deployed without Ingress
    objects.
  • After helm install, applies two HTTPRoute resources pointing at the chart
    Services (server :3000, UI :3050).
  • Uses kubectl port-forward to expose the gateway on 127.0.0.1:8080,
    which works on both Linux CI runners and macOS (unlike direct NodePort
    access, which fails inside Docker Desktop's VM).

Local Execution

  • Passed
  • Pre-commit hooks ran

AI Usage

If AI tools were used, please answer:

  1. Which tools? (e.g., GitHub Copilot, Claude, ChatGPT) -> Claude
  2. Scope of usage? - fixed existing code
  3. How did you verify the generated code works correctly? -> ran and verified all locally
  4. Can you explain every line of the code if asked? -> YES

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Jun 1, 2026

Thanks for the PR. It is labeled S-waiting-on-review and queued for review.

Slash commands (own line, regular comment) move it around the queue:

  • /ready - back to S-waiting-on-review after addressing feedback
  • /request-review @user-or-team - request a reviewer

See CONTRIBUTING.md for details.

@github-actions github-actions Bot added the S-waiting-on-review PR is waiting on a reviewer label Jun 1, 2026
Comment thread scripts/ci/test-helm.sh
kubectl port-forward "svc/${svc_name}" "${pf_port}:80" \
--namespace "$HELM_SMOKE_GATEWAY_NAMESPACE" \
>/dev/null 2>&1 &
HELM_SMOKE_GW_PF_PID=$!
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

get_gateway_base_url is invoked as gateway_url="$(get_gateway_base_url)" at line 441, so it runs in a command-substitution subshell. that means this HELM_SMOKE_GW_PF_PID=$! sets the var only inside the subshell - the parent's copy stays empty, so kill "${HELM_SMOKE_GW_PF_PID:-}" at line 477 runs kill "" and does nothing, and the kubectl port-forward orphans holding 127.0.0.1:8080. the comment at 270-271 about killing it regardless of exit path isn't true. on the ephemeral ci runner the orphan dies with the vm so the workflow still passes, but running this locally as a dev tool leaves a stale port-forward that breaks the next run's bind on :8080. separately there's no trap, so even with correct scoping the kill only fires on the success path - a failed curl/test aborts under set -e before reaching line 477. cleanest fix is to start the port-forward directly in smoke() so the pid lands in the parent, plus one trap ... EXIT that kills it and removes the temp values file.

# Install EG first so its bundled GW API CRDs land before the v1.5 safe-upgrade VAP.
# We then upgrade those CRDs to the pinned version via server-side apply.
echo "Installing Envoy Gateway ${HELM_SMOKE_ENVOY_GATEWAY_VERSION}..."
helm upgrade --install eg \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

envoy gateway v1.4.1 is pinned here while the gateway api crds get upgraded to v1.5.0 (line 138) on a v1.35.0 kind node (line 54). per the envoy gateway compat matrix, eg v1.4 targets gateway api v1.3.0 and k8s v1.30-v1.33, so all three are out of the supported matrix (no eg release through v1.6 lists k8s v1.35 or gw-api v1.5). it probably still reconciles since the core Gateway/HTTPRoute types are v1 GA and the v1.5 fields are additive, but it's unsupported, and if the controller can't satisfy a v1.5 default or policy the Gateway never reaches Programmed and the wait at line 184 burns the full 5m then fails. worth bumping eg to a release that lists these versions, or pinning gw-api/kind back to the v1.4 matrix. (the k8s v1.35 pin is carried over from the old nginx path; only eg + gw-api are new here.)

Comment thread scripts/ci/test-helm.sh

svc_selector="gateway.envoyproxy.io/owning-gateway-name=${HELM_SMOKE_GATEWAY_NAME},gateway.envoyproxy.io/owning-gateway-namespace=${HELM_SMOKE_GATEWAY_NAMESPACE}"

svc_name="$(kubectl get svc \
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

single-shot lookup with no retry. the Gateway reaching Programmed doesn't guarantee the owning-gateway Service object exists yet, so a transient miss here fails the run. also on an empty result kubectl ... jsonpath='{.items[0]...}' exits non-zero and prints a raw array index out of bounds to stderr - the [ -z ] guard below does still catch it (the assignment doesn't abort under set -e in this command-sub form), but both the raw error and the friendly one end up printed. add 2>/dev/null to the lookup and wrap it in a short retry.

type: Kubernetes
kubernetes:
envoyService:
type: NodePort
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

type: NodePort here is effectively unused. the new kind config (lines 82-87) has no extraPortMappings, and the smoke test reaches envoy only via kubectl port-forward svc/...:80, which tunnels through the apiserver to the ClusterIP regardless of service type - the allocated node port is never dialed. default ClusterIP is enough, so the provider override can be dropped.

Comment thread scripts/ci/test-helm.sh
gateway_url="$(get_gateway_base_url)"

kubectl version --client=true > "$HELM_SMOKE_REPORT_DIR/kubectl-version.txt"
kubectl -n "$HELM_SMOKE_NAMESPACE" rollout status "deployment/$HELM_SMOKE_RELEASE" --timeout="$HELM_SMOKE_TIMEOUT"
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these two rollout status calls are redundant - the helm upgrade --install --wait above already blocks until both deployments are available, and you return early if helm fails, so by here they're already rolled out and these return immediately. can drop them (they do still emit a line to the ci log if that's worth keeping).

Comment thread scripts/ci/test-helm.sh
>/dev/null 2>&1 &
HELM_SMOKE_GW_PF_PID=$!

# Wait up to 15 s for the tunnel to accept connections.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the comment says up to 15 s, but the loop is 15 iterations of curl --max-time 2 plus sleep 1, so worst case is closer to 45s if the connection hangs to max-time each iter (~30s on the connect-timeout path). just the comment is off, the loop itself is fine.

Comment thread scripts/ci/test-helm.sh
get_gateway_base_url() {
local svc_selector
local svc_name
local pf_port=8080
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

pf_port=8080 is the one knob in this script without a ${VAR:-default} override - every other tunable is env-overridable. minor consistency thing; an env override like HELM_SMOKE_GATEWAY_PF_PORT would let someone work around a busy host loopback.

@github-actions github-actions Bot added S-waiting-on-author PR is waiting on author response and removed S-waiting-on-review PR is waiting on a reviewer labels Jun 1, 2026
Copy link
Copy Markdown
Contributor

@hubcio hubcio left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

a couple of findings that don't map to a line in this diff:

the helm smoke job didn't actually run on this PR. .github/config/components.yml gates the helm component on paths: helm/** only, and this PR only touches scripts/ci/*.sh, so detect-changes skipped the validate+smoke matrix - the green checks here are lint/shellcheck/license, none of them helm-named. so this whole rewrite (the envoy gateway install, the v1.5 crd apply, the port-forward, the curl loops) shipped without ci ever exercising it, and future edits to these smoke scripts will keep skipping too. adding scripts/ci/setup-helm-smoke-cluster.sh and scripts/ci/test-helm.sh to the helm component paths would gate them - rust-bench-dashboard already co-lists its scripts/dashboard/** the same way. this is the one i'd fix first, it's why everything else here went unexercised.

minor: HELM_SMOKE_GATEWAY_NAMESPACE and HELM_SMOKE_GATEWAY_NAME are defined with matching defaults in both scripts, and the HTTPRoute parentRef has to match the Gateway. overriding one of them in only one script silently breaks route attach - worth a note that they have to be set together.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

S-waiting-on-author PR is waiting on author response

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Helm smoke tests: move off retired ingress, use Gateway API

2 participants