Skip to content

Envoy Gateway extension server#192

Merged
scotwells merged 13 commits into
mainfrom
feat/eg-extension-server
Jun 18, 2026
Merged

Envoy Gateway extension server#192
scotwells merged 13 commits into
mainfrom
feat/eg-extension-server

Conversation

@scotwells

Copy link
Copy Markdown
Contributor

Why

NSO mutates the edge data plane (WAF + Connector tunneling) by emitting a per-gateway EnvoyPatchPolicy for every gateway — which scales poorly and couples mutation to upstream connectivity. This PR introduces the Envoy Gateway extension server: it performs the same mutations directly on the post-translation xDS snapshot via EG's PostTranslateModify gRPC hook, co-located with the data plane.

Implements the Envoy Gateway extension server enhancement proposal (#187).

What's in this PR (the core)

Extension serverinternal/extensionserver/

  • gRPC PostTranslateModify server with a cache-backed PolicyIndex (TrafficProtectionPolicy, HTTPProxy, Connector, Namespace), Coraza WAF injection + per-route config, Connector cluster/route rewiring, mTLS, Prometheus metrics, and optional OTEL tracing. Launched via a manager extension-server subcommand.

Deploymentconfig/extension-server/ (HA Deployment, RBAC, cert-manager mTLS, NetworkPolicy, PDB, Service) and config/tools/envoy-gateway-downstream/ (a dedicated EG release wired to the extension server, kept separate from the shared EG to avoid cluster-scoped resource collisions).

Controller wiring

  • gateway.eppEmissionEnabled config flag (default true — no behavior change until flipped): when false, the TPP/HTTPProxy controllers emit zero EnvoyPatchPolicies and the extension server owns mutation. Rollback = set true.
  • Replicate TrafficProtectionPolicy/HTTPProxy/Connector to the downstream edge so the extension server reads them without upstream connectivity.
  • Connector Ready-state downstream annotation; per-gateway programmed metric.
  • Defensive guard against an empty upstream-owner cluster-name label.

Regenerated RBAC + deepcopy.

Verification

golangci-lint run ./...0 issues; go build ./..., go vet ./..., go test ./... all pass. The extension server has its own unit tests (cache/mutate/server), and the controller gating/replication paths have tests.

Scope / follow-ups

Deliberately core only. These land in separate PRs on top:

  • E2E consolidation (two-cluster chainsaw + bootstrap, config/extension-server-e2e)
  • Performance / scaling harness (test/perf/)
  • Observability (jsonnet mixin + dashboards/alerts)

Built on the merged Go 1.26 / EG 1.8 / Gateway API 1.5 base (#189) and lint baseline (#191).

🤖 Generated with Claude Code

scotwells and others added 2 commits June 17, 2026 15:35
Introduce the Envoy Gateway extension server that performs WAF (Coraza) and
Connector tunnel mutation directly on the post-translation xDS snapshot via the
PostTranslateModify gRPC hook, replacing per-gateway EnvoyPatchPolicy emission.

Core:
- internal/extensionserver/: gRPC PostTranslateModify server (cache-backed
  PolicyIndex, TPP/Connector mutation, mTLS, metrics, optional OTEL tracing),
  launched via `manager extension-server` subcommand (cmd/main.go).
- config/extension-server/: HA Deployment, RBAC, cert-manager mTLS, NetworkPolicy,
  PDB, Service.
- config/tools/envoy-gateway-downstream/: dedicated Envoy Gateway release wired to
  the extension server (separate from the shared EG to avoid cluster-scoped
  resource collisions).

Controller wiring:
- gateway.eppEmissionEnabled config flag (default true): when false, the TPP and
  HTTPProxy controllers emit zero EnvoyPatchPolicies (the extension server owns
  mutation); rollback = set true.
- replicate TrafficProtectionPolicy/HTTPProxy/Connector to the downstream edge so
  the extension server can read them without upstream connectivity.
- connector Ready-state downstream annotation; per-gateway programmed metric.
- downstreamclient: guard the upstream-owner label against an empty cluster name.

Regenerated RBAC/deepcopy. Verified: golangci-lint run ./... = 0; go build/vet/
test ./... all pass.

E2E consolidation, the performance harness, and observability land in follow-up
PRs on top of this core.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Replace the ad-hoc os.Args[1]=="extension-server" peek with a cobra command
tree under a "network-services" parent command:

  network-services manager                          run the controller manager
  network-services envoy-gateway-extension-server   run the extension server

Both subcommands bind their existing flag.FlagSet via Flags().AddGoFlagSet, so
the zap flag binding and flag surface are unchanged; flags use pflag's
double-dash form, which the deployment manifests already use.

Rename the built binary manager -> network-services (Dockerfile, Makefile) for
consistency with the repo and the published image, and point the manager and
extension-server Deployments at the matching subcommand. cobra was promoted
from an indirect to a direct dependency.
@scotwells scotwells force-pushed the feat/eg-extension-server branch from 1bbc750 to bcc10c1 Compare June 17, 2026 21:07
scotwells and others added 11 commits June 17, 2026 16:16
cmd/main.go is now a thin entrypoint: it keeps only the ldflag-injected version
vars (so -X main.version=... stays valid) and assembles the cobra root with the
manager and extension-server subcommands.

The controller-manager command moves to internal/cmd/manager (package
managercmd) as NewCommand(BuildInfo), mirroring how the extension server lives
in internal/extensionserver/cmd. The scheme/init and the
initializeClusterDiscovery/ignoreCanceled helpers move with it. Build info is
passed in via a BuildInfo struct instead of package-level vars.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…r for TLS

Replace the self-signed cert-manager PKI baked into config/extension-server with
the platform cert pattern from datum-cloud/infra (mirrors the NSO control-plane
manager.yaml):

- Pod server cert via the cert-manager CSI driver (csi.cert-manager.io) issued by
  the datum-control-plane ClusterIssuer; mounted at /tls (tls.crt, tls.key).
- CA bundle via the trust-manager datum-control-plane-trust-bundle ConfigMap,
  mounted at /tls-ca (ca.crt); --tls-client-ca points there.
- Delete the SelfSigned Issuer + self-signed CA + CA Issuer entirely.
- EG's mTLS client cert (extension-server-eg-client-tls) stays a cert-manager
  Certificate (EG's extensionManager API takes a SecretObjectReference) but is now
  issued by datum-control-plane instead of the self-signed CA.

The cert CA/issuers/CSI driver/trust bundle are infra-owned; this repo declares
only the workload's volumes + the EG client cert. Open items documented inline
for infra: the EG certificateRef CA Secret, the trust-bundle injection label on
the deploy namespace, and dns-names if deployed outside network-services-operator-system.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The publish-kustomize-bundle step only ran `kustomize edit set image` against
config/manager, so the extension-server overlay shipped with newTag: latest in
the OCI bundle — version drift for a workload that runs the same image as the
manager. Add config/extension-server to image-overlays so both are pinned to the
release tag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…atch

- Delete config/extension-server/extensionmanager-patch.yaml: a standalone
  kubectl-patch doc for the shared envoy-gateway HelmRelease, referenced by no
  kustomization and redundant with the dedicated downstream EG (which bakes the
  same extensionManager wiring into its Helm values).
- Repoint the downstream EG extensionManager certificateRef off the now-deleted
  self-signed extension-server-ca Secret to the datum-control-plane CA (open item
  for infra to confirm exact Secret name/namespace).
- Trim tutorial-style/verbose comments across the extension-server manifests
  (~50-66% fewer comment lines), keeping the load-bearing operational rationale.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ension-server

Rename the extension server's Kubernetes resources from `extension-server` to
`envoy-gateway-extension-server` so they're self-describing and match the
subcommand. Covers the Deployment, Service, ServiceAccount, PDB, Role/RoleBinding,
and NetworkPolicy names, plus the references that must follow: serviceAccountName,
the RoleBinding subject, the server cert dns-names, and the downstream EG
extensionManager service.fqdn.hostname.

The app.kubernetes.io/component label (extension-server) and the EG client cert
Secret (extension-server-eg-client-tls) are left unchanged — a categorical label
and EG's own cert, respectively.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ments

- Default the cert issuer/CA refs to placeholders (placeholder-issuer,
  placeholder-ca-bundle, placeholder-ca/placeholder-namespace) that an overlay
  must patch; the base no longer names any environment-specific issuer.
- Cut YAML comments down to only the load-bearing ones: placeholder/overlay-patch
  directives and a few non-obvious production gotchas (includeAll mandatory,
  failOpen/maxUnavailable coupling, startup cache-sync window). Removed
  tutorial/restated-default/banner prose throughout.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…onsistently

Finish the rename so every extension-server config uses the self-documenting
form: the app.kubernetes.io/component label value (across all manifests and
their selectors) and the EG client-cert Secret
(extension-server-eg-client-tls -> envoy-gateway-extension-server-eg-client-tls,
plus the downstream EG clientCertificateRef that consumes it).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The downstream edge cluster running the dedicated Envoy Gateway needs the
NSO CRDs that the GatewayResourceReplicator mirrors into it and the
extension server's cache reads locally: TrafficProtectionPolicy, HTTPProxy,
and Connector. Without them the replicator's downstream writes and the
extension server's informers fail to sync — observed as a staging failure
where no downstream gateway programmed.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Temporarily point validate-kustomize at the datum-cloud/actions feature
branch (datum-cloud/actions#80) so the downstream CRD overlay's ../bases
references validate the same way Flux deploys them. Pin back to a release
tag once it merges and a tag is cut.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
datum-cloud/actions#80 (build with LoadRestrictionsNone) merged and shipped
in v1.16.0. Pin back from the feature branch to the release tag.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ides

Reference all extension-server flags through env vars ($(GRPC_ADDR), …,
$(SERVER_CONFIG)) with defaults set in the deployment. Overlays can now
override any value with a strategic-merge patch on env (matched by name)
instead of replacing the args list. SERVER_CONFIG defaults to empty so the
operator coraza config is opt-in per environment.
@scotwells scotwells changed the title Envoy Gateway extension server (core) Envoy Gateway extension server Jun 17, 2026
@scotwells scotwells marked this pull request as ready for review June 17, 2026 23:51
@scotwells scotwells requested review from a team, ecv and kevwilliams June 17, 2026 23:51
@scotwells scotwells merged commit a238174 into main Jun 18, 2026
10 of 11 checks passed
@scotwells scotwells deleted the feat/eg-extension-server branch June 18, 2026 00:57
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants