feat: replace ingress-nginx smoke cluster with Gateway API and Envoy Gateway#3386
feat: replace ingress-nginx smoke cluster with Gateway API and Envoy Gateway#3386avirajkhare00 wants to merge 1 commit into
Conversation
|
Thanks for the PR. It is labeled Slash commands (own line, regular comment) move it around the queue:
See CONTRIBUTING.md for details. |
| kubectl port-forward "svc/${svc_name}" "${pf_port}:80" \ | ||
| --namespace "$HELM_SMOKE_GATEWAY_NAMESPACE" \ | ||
| >/dev/null 2>&1 & | ||
| HELM_SMOKE_GW_PF_PID=$! |
There was a problem hiding this comment.
get_gateway_base_url is invoked as gateway_url="$(get_gateway_base_url)" at line 441, so it runs in a command-substitution subshell. that means this HELM_SMOKE_GW_PF_PID=$! sets the var only inside the subshell - the parent's copy stays empty, so kill "${HELM_SMOKE_GW_PF_PID:-}" at line 477 runs kill "" and does nothing, and the kubectl port-forward orphans holding 127.0.0.1:8080. the comment at 270-271 about killing it regardless of exit path isn't true. on the ephemeral ci runner the orphan dies with the vm so the workflow still passes, but running this locally as a dev tool leaves a stale port-forward that breaks the next run's bind on :8080. separately there's no trap, so even with correct scoping the kill only fires on the success path - a failed curl/test aborts under set -e before reaching line 477. cleanest fix is to start the port-forward directly in smoke() so the pid lands in the parent, plus one trap ... EXIT that kills it and removes the temp values file.
| # Install EG first so its bundled GW API CRDs land before the v1.5 safe-upgrade VAP. | ||
| # We then upgrade those CRDs to the pinned version via server-side apply. | ||
| echo "Installing Envoy Gateway ${HELM_SMOKE_ENVOY_GATEWAY_VERSION}..." | ||
| helm upgrade --install eg \ |
There was a problem hiding this comment.
envoy gateway v1.4.1 is pinned here while the gateway api crds get upgraded to v1.5.0 (line 138) on a v1.35.0 kind node (line 54). per the envoy gateway compat matrix, eg v1.4 targets gateway api v1.3.0 and k8s v1.30-v1.33, so all three are out of the supported matrix (no eg release through v1.6 lists k8s v1.35 or gw-api v1.5). it probably still reconciles since the core Gateway/HTTPRoute types are v1 GA and the v1.5 fields are additive, but it's unsupported, and if the controller can't satisfy a v1.5 default or policy the Gateway never reaches Programmed and the wait at line 184 burns the full 5m then fails. worth bumping eg to a release that lists these versions, or pinning gw-api/kind back to the v1.4 matrix. (the k8s v1.35 pin is carried over from the old nginx path; only eg + gw-api are new here.)
|
|
||
| svc_selector="gateway.envoyproxy.io/owning-gateway-name=${HELM_SMOKE_GATEWAY_NAME},gateway.envoyproxy.io/owning-gateway-namespace=${HELM_SMOKE_GATEWAY_NAMESPACE}" | ||
|
|
||
| svc_name="$(kubectl get svc \ |
There was a problem hiding this comment.
single-shot lookup with no retry. the Gateway reaching Programmed doesn't guarantee the owning-gateway Service object exists yet, so a transient miss here fails the run. also on an empty result kubectl ... jsonpath='{.items[0]...}' exits non-zero and prints a raw array index out of bounds to stderr - the [ -z ] guard below does still catch it (the assignment doesn't abort under set -e in this command-sub form), but both the raw error and the friendly one end up printed. add 2>/dev/null to the lookup and wrap it in a short retry.
| type: Kubernetes | ||
| kubernetes: | ||
| envoyService: | ||
| type: NodePort |
There was a problem hiding this comment.
type: NodePort here is effectively unused. the new kind config (lines 82-87) has no extraPortMappings, and the smoke test reaches envoy only via kubectl port-forward svc/...:80, which tunnels through the apiserver to the ClusterIP regardless of service type - the allocated node port is never dialed. default ClusterIP is enough, so the provider override can be dropped.
| gateway_url="$(get_gateway_base_url)" | ||
|
|
||
| kubectl version --client=true > "$HELM_SMOKE_REPORT_DIR/kubectl-version.txt" | ||
| kubectl -n "$HELM_SMOKE_NAMESPACE" rollout status "deployment/$HELM_SMOKE_RELEASE" --timeout="$HELM_SMOKE_TIMEOUT" |
There was a problem hiding this comment.
these two rollout status calls are redundant - the helm upgrade --install --wait above already blocks until both deployments are available, and you return early if helm fails, so by here they're already rolled out and these return immediately. can drop them (they do still emit a line to the ci log if that's worth keeping).
| >/dev/null 2>&1 & | ||
| HELM_SMOKE_GW_PF_PID=$! | ||
|
|
||
| # Wait up to 15 s for the tunnel to accept connections. |
There was a problem hiding this comment.
the comment says up to 15 s, but the loop is 15 iterations of curl --max-time 2 plus sleep 1, so worst case is closer to 45s if the connection hangs to max-time each iter (~30s on the connect-timeout path). just the comment is off, the loop itself is fine.
| get_gateway_base_url() { | ||
| local svc_selector | ||
| local svc_name | ||
| local pf_port=8080 |
There was a problem hiding this comment.
pf_port=8080 is the one knob in this script without a ${VAR:-default} override - every other tunable is env-overridable. minor consistency thing; an env override like HELM_SMOKE_GATEWAY_PF_PORT would let someone work around a busy host loopback.
hubcio
left a comment
There was a problem hiding this comment.
a couple of findings that don't map to a line in this diff:
the helm smoke job didn't actually run on this PR. .github/config/components.yml gates the helm component on paths: helm/** only, and this PR only touches scripts/ci/*.sh, so detect-changes skipped the validate+smoke matrix - the green checks here are lint/shellcheck/license, none of them helm-named. so this whole rewrite (the envoy gateway install, the v1.5 crd apply, the port-forward, the curl loops) shipped without ci ever exercising it, and future edits to these smoke scripts will keep skipping too. adding scripts/ci/setup-helm-smoke-cluster.sh and scripts/ci/test-helm.sh to the helm component paths would gate them - rust-bench-dashboard already co-lists its scripts/dashboard/** the same way. this is the one i'd fix first, it's why everything else here went unexercised.
minor: HELM_SMOKE_GATEWAY_NAMESPACE and HELM_SMOKE_GATEWAY_NAME are defined with matching defaults in both scripts, and the HTTPRoute parentRef has to match the Gateway. overriding one of them in only one script silently breaks route attach - worth a note that they have to be set together.
Which issue does this PR address?
Closes #3098
Rationale
Ingress NGINX was officially retired in March 2026. Our Helm smoke/CI
cluster setup still pulled the retired controller, which gets no further
security or bug fixes.
What changed?
scripts/ci/setup-helm-smoke-cluster.shmappings, admission-webhook polling helpers).
v1.5.0 with
--server-side --force-conflicts. This ordering avoids thesafe-upgrade ValidatingAdmissionPolicy blocking EG's older bundled CRDs.
then waits for the Programmed condition.
scripts/ci/test-helm.shHELM_SMOKE_INGRESS_CLASS; chart is now deployed without Ingressobjects.
Services (server :3000, UI :3050).
kubectl port-forwardto expose the gateway on127.0.0.1:8080,which works on both Linux CI runners and macOS (unlike direct NodePort
access, which fails inside Docker Desktop's VM).
Local Execution
AI Usage
If AI tools were used, please answer: