Envoy Gateway extension server#192
Merged
Merged
Conversation
Introduce the Envoy Gateway extension server that performs WAF (Coraza) and Connector tunnel mutation directly on the post-translation xDS snapshot via the PostTranslateModify gRPC hook, replacing per-gateway EnvoyPatchPolicy emission. Core: - internal/extensionserver/: gRPC PostTranslateModify server (cache-backed PolicyIndex, TPP/Connector mutation, mTLS, metrics, optional OTEL tracing), launched via `manager extension-server` subcommand (cmd/main.go). - config/extension-server/: HA Deployment, RBAC, cert-manager mTLS, NetworkPolicy, PDB, Service. - config/tools/envoy-gateway-downstream/: dedicated Envoy Gateway release wired to the extension server (separate from the shared EG to avoid cluster-scoped resource collisions). Controller wiring: - gateway.eppEmissionEnabled config flag (default true): when false, the TPP and HTTPProxy controllers emit zero EnvoyPatchPolicies (the extension server owns mutation); rollback = set true. - replicate TrafficProtectionPolicy/HTTPProxy/Connector to the downstream edge so the extension server can read them without upstream connectivity. - connector Ready-state downstream annotation; per-gateway programmed metric. - downstreamclient: guard the upstream-owner label against an empty cluster name. Regenerated RBAC/deepcopy. Verified: golangci-lint run ./... = 0; go build/vet/ test ./... all pass. E2E consolidation, the performance harness, and observability land in follow-up PRs on top of this core. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the ad-hoc os.Args[1]=="extension-server" peek with a cobra command tree under a "network-services" parent command: network-services manager run the controller manager network-services envoy-gateway-extension-server run the extension server Both subcommands bind their existing flag.FlagSet via Flags().AddGoFlagSet, so the zap flag binding and flag surface are unchanged; flags use pflag's double-dash form, which the deployment manifests already use. Rename the built binary manager -> network-services (Dockerfile, Makefile) for consistency with the repo and the published image, and point the manager and extension-server Deployments at the matching subcommand. cobra was promoted from an indirect to a direct dependency.
1bbc750 to
bcc10c1
Compare
cmd/main.go is now a thin entrypoint: it keeps only the ldflag-injected version vars (so -X main.version=... stays valid) and assembles the cobra root with the manager and extension-server subcommands. The controller-manager command moves to internal/cmd/manager (package managercmd) as NewCommand(BuildInfo), mirroring how the extension server lives in internal/extensionserver/cmd. The scheme/init and the initializeClusterDiscovery/ignoreCanceled helpers move with it. Build info is passed in via a BuildInfo struct instead of package-level vars. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r for TLS Replace the self-signed cert-manager PKI baked into config/extension-server with the platform cert pattern from datum-cloud/infra (mirrors the NSO control-plane manager.yaml): - Pod server cert via the cert-manager CSI driver (csi.cert-manager.io) issued by the datum-control-plane ClusterIssuer; mounted at /tls (tls.crt, tls.key). - CA bundle via the trust-manager datum-control-plane-trust-bundle ConfigMap, mounted at /tls-ca (ca.crt); --tls-client-ca points there. - Delete the SelfSigned Issuer + self-signed CA + CA Issuer entirely. - EG's mTLS client cert (extension-server-eg-client-tls) stays a cert-manager Certificate (EG's extensionManager API takes a SecretObjectReference) but is now issued by datum-control-plane instead of the self-signed CA. The cert CA/issuers/CSI driver/trust bundle are infra-owned; this repo declares only the workload's volumes + the EG client cert. Open items documented inline for infra: the EG certificateRef CA Secret, the trust-bundle injection label on the deploy namespace, and dns-names if deployed outside network-services-operator-system. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The publish-kustomize-bundle step only ran `kustomize edit set image` against config/manager, so the extension-server overlay shipped with newTag: latest in the OCI bundle — version drift for a workload that runs the same image as the manager. Add config/extension-server to image-overlays so both are pinned to the release tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atch - Delete config/extension-server/extensionmanager-patch.yaml: a standalone kubectl-patch doc for the shared envoy-gateway HelmRelease, referenced by no kustomization and redundant with the dedicated downstream EG (which bakes the same extensionManager wiring into its Helm values). - Repoint the downstream EG extensionManager certificateRef off the now-deleted self-signed extension-server-ca Secret to the datum-control-plane CA (open item for infra to confirm exact Secret name/namespace). - Trim tutorial-style/verbose comments across the extension-server manifests (~50-66% fewer comment lines), keeping the load-bearing operational rationale. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ension-server Rename the extension server's Kubernetes resources from `extension-server` to `envoy-gateway-extension-server` so they're self-describing and match the subcommand. Covers the Deployment, Service, ServiceAccount, PDB, Role/RoleBinding, and NetworkPolicy names, plus the references that must follow: serviceAccountName, the RoleBinding subject, the server cert dns-names, and the downstream EG extensionManager service.fqdn.hostname. The app.kubernetes.io/component label (extension-server) and the EG client cert Secret (extension-server-eg-client-tls) are left unchanged — a categorical label and EG's own cert, respectively. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ments - Default the cert issuer/CA refs to placeholders (placeholder-issuer, placeholder-ca-bundle, placeholder-ca/placeholder-namespace) that an overlay must patch; the base no longer names any environment-specific issuer. - Cut YAML comments down to only the load-bearing ones: placeholder/overlay-patch directives and a few non-obvious production gotchas (includeAll mandatory, failOpen/maxUnavailable coupling, startup cache-sync window). Removed tutorial/restated-default/banner prose throughout. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onsistently Finish the rename so every extension-server config uses the self-documenting form: the app.kubernetes.io/component label value (across all manifests and their selectors) and the EG client-cert Secret (extension-server-eg-client-tls -> envoy-gateway-extension-server-eg-client-tls, plus the downstream EG clientCertificateRef that consumes it). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The downstream edge cluster running the dedicated Envoy Gateway needs the NSO CRDs that the GatewayResourceReplicator mirrors into it and the extension server's cache reads locally: TrafficProtectionPolicy, HTTPProxy, and Connector. Without them the replicator's downstream writes and the extension server's informers fail to sync — observed as a staging failure where no downstream gateway programmed. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Temporarily point validate-kustomize at the datum-cloud/actions feature branch (datum-cloud/actions#80) so the downstream CRD overlay's ../bases references validate the same way Flux deploys them. Pin back to a release tag once it merges and a tag is cut. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
datum-cloud/actions#80 (build with LoadRestrictionsNone) merged and shipped in v1.16.0. Pin back from the feature branch to the release tag. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ides Reference all extension-server flags through env vars ($(GRPC_ADDR), …, $(SERVER_CONFIG)) with defaults set in the deployment. Overlays can now override any value with a strategic-merge patch on env (matched by name) instead of replacing the args list. SERVER_CONFIG defaults to empty so the operator coraza config is opt-in per environment.
ecv
approved these changes
Jun 17, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
NSO mutates the edge data plane (WAF + Connector tunneling) by emitting a per-gateway
EnvoyPatchPolicyfor every gateway — which scales poorly and couples mutation to upstream connectivity. This PR introduces the Envoy Gateway extension server: it performs the same mutations directly on the post-translation xDS snapshot via EG'sPostTranslateModifygRPC hook, co-located with the data plane.Implements the Envoy Gateway extension server enhancement proposal (#187).
What's in this PR (the core)
Extension server —
internal/extensionserver/PostTranslateModifyserver with a cache-backedPolicyIndex(TrafficProtectionPolicy, HTTPProxy, Connector, Namespace), Coraza WAF injection + per-route config, Connector cluster/route rewiring, mTLS, Prometheus metrics, and optional OTEL tracing. Launched via amanager extension-serversubcommand.Deployment —
config/extension-server/(HA Deployment, RBAC, cert-manager mTLS, NetworkPolicy, PDB, Service) andconfig/tools/envoy-gateway-downstream/(a dedicated EG release wired to the extension server, kept separate from the shared EG to avoid cluster-scoped resource collisions).Controller wiring
gateway.eppEmissionEnabledconfig flag (defaulttrue— no behavior change until flipped): whenfalse, the TPP/HTTPProxy controllers emit zero EnvoyPatchPolicies and the extension server owns mutation. Rollback = settrue.programmedmetric.Regenerated RBAC + deepcopy.
Verification
golangci-lint run ./...→ 0 issues;go build ./...,go vet ./...,go test ./...all pass. The extension server has its own unit tests (cache/mutate/server), and the controller gating/replication paths have tests.Scope / follow-ups
Deliberately core only. These land in separate PRs on top:
config/extension-server-e2e)test/perf/)Built on the merged Go 1.26 / EG 1.8 / Gateway API 1.5 base (#189) and lint baseline (#191).
🤖 Generated with Claude Code