Phase 5: Dynamic Webhook Middleware Kubernetes Controller#4564
Phase 5: Dynamic Webhook Middleware Kubernetes Controller#4564Sanskarzz wants to merge 1 commit intostacklok:mainfrom
Conversation
37488cf to
b10c8ae
Compare
There was a problem hiding this comment.
Large PR Detected
This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.
How to unblock this PR:
Add a section to your PR description with the following format:
## Large PR Justification
[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformationAlternative:
Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.
See our Contributing Guidelines for more details.
This review will be automatically dismissed once you add the justification section.
|
✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review. |
Large PR justification has been provided. Thank you!
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## main #4564 +/- ##
========================================
Coverage 69.07% 69.07%
========================================
Files 531 534 +3
Lines 55170 55436 +266
========================================
+ Hits 38110 38294 +184
- Misses 14132 14197 +65
- Partials 2928 2945 +17 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
b10c8ae to
5f60a5a
Compare
37488cf to
cf06a57
Compare
cf06a57 to
7e1551a
Compare
4588a30 to
31a895d
Compare
31a895d to
c007daf
Compare
c007daf to
ef4dc62
Compare
ef4dc62 to
5cc59b9
Compare
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
5cc59b9 to
c55d5c7
Compare
|
Hey @JAORMX I need some guidance on fixing the E2E tests for this PR. |
|
So, I dug into the CI failures on this PR and there are three issues causing the operator e2e tests to fail. All three are in the test code, not the controller logic itself. 1. Finalizer deadlock during chainsaw cleanup This is the big one. The The fix: add explicit cleanup ordering so the MCPServer gets deleted before the MCPWebhookConfig. The finalizer logic is actually working correctly here. It's just the test that doesn't account for it. 2. The CRD enum is 3. Empty spec vs CEL rule The CRD has a CEL validation rule: Note that the |
JAORMX
left a comment
There was a problem hiding this comment.
Thanks for the work here! The overall structure is solid and the CRD design makes sense. The main theme across most of my comments is: the issue says to follow the MCPExternalAuthConfig pattern, and there are a few spots where this controller diverges from it in ways that cause real problems.
The biggest ones:
-
Missing MCPServer watch in
SetupWithManager-- this is what causes the chainsaw e2e test to fail. Without it, creating an MCPWebhookConfig won't trigger re-reconciliation of a failed MCPServer that references it. TheReferencingServersstatus also goes stale. -
handleDeletionreturns an error instead ofRequeueAfter+ condition -- this triggers exponential backoff and is the other half of why the chainsaw cleanup times out at ~48 seconds. -
Sequential
Status().Update()calls in the Reconcile loop will produce 409 Conflict errors in practice. The auth config controller avoids this by returning immediately from the hash-change path. -
handleWebhookConfigdoesn't callsetReadyCondition-- breaks the pattern every other handler follows.
On the security side, GenerateWebhookEnvVars needs the same sanitization that externalauth.go applies to env var names, and the non-deterministic map iteration will cause spurious pod restarts.
The chainsaw tests also have a couple of issues I mentioned in a separate comment (failurePolicy casing, empty spec vs CEL rule).
Looking forward to the next iteration!
| func (r *MCPWebhookConfigReconciler) SetupWithManager(mgr ctrl.Manager) error { | ||
| return ctrl.NewControllerManagedBy(mgr). | ||
| For(&mcpv1alpha1.MCPWebhookConfig{}). | ||
| Complete(r) |
There was a problem hiding this comment.
So, this is the big one. The MCPExternalAuthConfig controller watches MCPServer (and MCPRemoteProxy) changes via handler.EnqueueRequestsFromMapFunc in its SetupWithManager. That's how ReferencingWorkloads stays up to date when servers come and go.
This controller only watches its own type. That means:
ReferencingServerswon't update when an MCPServer is created or deleted- When a new MCPWebhookConfig is created that an existing (failed) MCPServer references, there's no trigger to re-reconcile that MCPServer
- The chainsaw reconciliation test will hang waiting for the MCPServer to enter Running... because nothing wakes it up
Take a look at how the auth config controller does it (mapMCPServerToExternalAuthConfig + the Watches call). That's the pattern we need here.
| } | ||
|
|
||
| return ctrl.Result{}, fmt.Errorf("MCPWebhookConfig %s is still referenced by MCPServers: %v", | ||
| webhookConfig.Name, serverNames) |
There was a problem hiding this comment.
Returning an error here triggers controller-runtime's exponential backoff. So you'll get retries at 1s, 2s, 4s, 8s... eventually backing off to 16+ minutes between checks. That's also why the chainsaw cleanup times out after ~48 seconds.
The auth config controller handles this differently: it sets a DeletionBlocked condition (so users can see why deletion is stuck via kubectl describe), then returns RequeueAfter: 30 * time.Second. No error, no backoff, just a steady re-check.
Note that this is an expected state, not an error. The controller is doing its job by blocking deletion while references exist. It should communicate that calmly, not panic about it :)
|
|
||
| if err := r.Update(ctx, &server); err != nil { | ||
| logger.Error(err, "Failed to update MCPServer annotation", "mcpserver", server.Name) | ||
| } |
There was a problem hiding this comment.
This error is logged but not returned. If the annotation update fails, the reconciler thinks everything went fine, the MCPServer never gets the hash annotation, and nothing triggers a retry.
Per our Go style rules: return errors by default, never silently swallow them. Either return the error or collect failures and return them after the loop.
| } | ||
| } | ||
|
|
||
| // Update condition if it changed | ||
| if conditionChanged { | ||
| if err := r.Status().Update(ctx, webhookConfig); err != nil { |
There was a problem hiding this comment.
When hashChanged is true, handleConfigHashChange already does a Status().Update() (which bumps the resourceVersion on the API server). Then control falls through here and this second Status().Update() operates on a stale resourceVersion... so you'll get 409 Conflict errors.
The auth config controller avoids this by returning immediately from the hash change path. Consider doing the same, or consolidate all status mutations into a single write at the end of the reconcile loop.
| // Update status to reflect the error | ||
| mcpServer.Status.Phase = mcpv1alpha1.MCPServerPhaseFailed |
There was a problem hiding this comment.
Every other handler in the reconcile loop (handleExternalAuthConfig, handleTelemetryConfig, handleToolConfig, handleOIDCConfig) calls setReadyCondition when failing. This one only sets Phase. So a webhook config failure won't show up in the Ready condition that users and tooling check via kubectl get.
Add a setReadyCondition(mcpServer, metav1.ConditionFalse, mcpv1alpha1.ConditionReasonNotReady, err.Error()) call here to keep it consistent with the rest.
| // | ||
| // +kubebuilder:object:root=true | ||
| // +kubebuilder:subresource:status | ||
| // +kubebuilder:resource:shortName=mwc |
There was a problem hiding this comment.
Missing categories=toolhive here. Every other CRD in the project has it, and without it kubectl get toolhive won't list MCPWebhookConfig resources. Should be:
// +kubebuilder:resource:shortName=mwc,categories=toolhive|
|
||
| // ReferencingServers lists the names of MCPServers currently using this configuration | ||
| // +optional | ||
| ReferencingServers []string `json:"referencingServers,omitempty"` |
There was a problem hiding this comment.
All the other config CRDs (auth, OIDC, telemetry, tool) use []WorkloadReference with +listType=map and +listMapKey=name for this field. Using []string here is inconsistent and would need a breaking API change if MCPRemoteProxy or VirtualMCPServer ever gains webhook support.
Would be good to match the established pattern here.
| for name, ref := range secretsToExpose { | ||
| envVarName := fmt.Sprintf("TOOLHIVE_SECRET_%s", name) |
There was a problem hiding this comment.
Two things here:
-
The map iteration is non-deterministic, so the env var order will be random on each reconcile.
deploymentNeedsUpdateusesDeepEqualon the env var slice, so this will trigger spurious pod restarts. Sort the output by env var name before returning (or useslices.SortFunc). -
The env var name has no sanitization. The existing
externalauth.gousesstrings.ToUpperandenvVarSanitizer.ReplaceAllStringto produce valid POSIX env var names. A secret namedmy-hmac.keywould produceTOOLHIVE_SECRET_my-hmac.keyhere... which is not a valid env var name. Also worth adding a distinguishing prefix (likeWEBHOOK_HMAC_instead of justSECRET_) to avoid collisions with otherTOOLHIVE_SECRET_*vars from the auth side.
| // +kubebuilder:validation:Enum=fail;ignore | ||
| // +kubebuilder:default=fail | ||
| // +optional | ||
| FailurePolicy webhook.FailurePolicy `json:"failurePolicy,omitempty"` |
There was a problem hiding this comment.
Nit: API types generally shouldn't import runtime implementation packages. The existing pattern (e.g., ExternalAuthType in the auth CRD) is to define the type locally in the API package. Importing pkg/webhook here creates a dependency from the API layer into the implementation layer, which can cause import cycle headaches down the road.
Consider defining FailurePolicy as a local type in api/v1alpha1/.
|
Hey @JAORMX |
|
@Sanskarzz completely understandable. I'm sorry you're dealing with such a thing. I wish you and your family the best and hope everything turns out OK. Don't worry about coming back to this, it can wait til you have more time, or I can pick it up. You'd, of course, keep credit for this work since you started it. Thanks for the heads up. |
[WIP]
Summary
This PR implements the fifth phase of the dynamic webhook middleware configuration system (RFC THV-0017), introducing Kubernetes custom resource definitions (CRDs), their respective controller reconciling mechanisms, and integration into the core
MCPServerlifecycle.Fixes #3401
Large PR Justification
This is a new feature package with a large test suite, and it needs to land as one coherent phase.
Key Changes
MCPWebhookConfigCRD CreationMCPWebhookConfigCRD inapi/v1alpha1matching the specifications described in RFC THV-0017.ValidatingandMutatingwebhooks.HMACSecretReffor signing request payloads.TLSConfig(CA, Client Cert, and Key secrets) for rigorous mTLS connections.fail/ignoreforFailurePolicyto align with the runner's runtime validation requirements.Controller Logic and Finalizers
MCPWebhookConfigReconcilerincmd/thv-operator/controllers/..Status.ConfigHashcalculating changes to the configuration.MCPServersvia.Status.ReferencingServers.MCPWebhookConfigwhile actively referenced by anMCPServer.MCPServer Controller Integration
WebhookConfigRefnatively intoMCPServerSpec.MCPServerStatusto explicitly trace configuration hashes linked via annotation hooks.deploymentNeedsUpdate) to trace webhook Secret updates.createRunConfigFromMCPServerto evaluate and translate webhook settings locally using newly extracted utility functions inpkg/controllerutil/webhook.go.FailurePolicyinbuildWebhookConfigto ensure compatibility with thethv-proxyrunner, regardless of the case used in the CRD.Testing and Verification
mcpwebhookconfig_types_test.go, the controller logic (mcpwebhookconfig_controller_test.go), and utilities (webhook_test.go).chainsawtests ensuring valid configurations proceed through creation securely, rejecting any malformed specs early on with CEL validation endpoints.Type of change
Test plan
task test)task test-e2e)task lint-fix)Manual Verification
Manual testing was performed using a local Kind cluster and the
fetchMCPServer.task operator-deploy-local.kubectl apply -f manual-testing-phase5/echo-server.yaml.MCPWebhookConfigpointing to the echo server withinsecureSkipVerify: true.fetchMCPServerreferencing the config.MCPWebhookConfigand generated aconfigHash.fetchserver picked up the configuration and started thethv-proxyrunner.fetchpod logs and confirmed that the mutating webhook middleware was active and correctly invoking the echo server (resulting in "denied request" logs as expected since the echo server doesn't return a validallowed: trueresponse).MCPWebhookConfig(e.g., changed the failure policy or URL).fetchpod automatically to load the new settings.