Skip to content

Phase 5: Dynamic Webhook Middleware Kubernetes Controller#4564

Open
Sanskarzz wants to merge 1 commit intostacklok:mainfrom
Sanskarzz:dynamicwebhook5
Open

Phase 5: Dynamic Webhook Middleware Kubernetes Controller#4564
Sanskarzz wants to merge 1 commit intostacklok:mainfrom
Sanskarzz:dynamicwebhook5

Conversation

@Sanskarzz
Copy link
Copy Markdown
Contributor

@Sanskarzz Sanskarzz commented Apr 4, 2026

[WIP]

Summary

This PR implements the fifth phase of the dynamic webhook middleware configuration system (RFC THV-0017), introducing Kubernetes custom resource definitions (CRDs), their respective controller reconciling mechanisms, and integration into the core MCPServer lifecycle.

Fixes #3401

Large PR Justification

This is a new feature package with a large test suite, and it needs to land as one coherent phase.

Key Changes

  1. MCPWebhookConfig CRD Creation

    • Introduced MCPWebhookConfig CRD in api/v1alpha1 matching the specifications described in RFC THV-0017.
    • Allows users to declaratively specify sets of Validating and Mutating webhooks.
    • Includes full configuration for security integrations:
      • HMACSecretRef for signing request payloads.
      • TLSConfig (CA, Client Cert, and Key secrets) for rigorous mTLS connections.
    • Fix: Updated CRD markers to use lowercase fail/ignore for FailurePolicy to align with the runner's runtime validation requirements.
  2. Controller Logic and Finalizers

    • Created the MCPWebhookConfigReconciler in cmd/thv-operator/controllers/.
    • The controller manages .Status.ConfigHash calculating changes to the configuration.
    • Cross-references incoming configurations dynamically by injecting finalizers. It correctly tracks all referencing MCPServers via .Status.ReferencingServers.
    • Integrated safety guards preventing the deletion of an MCPWebhookConfig while actively referenced by an MCPServer.
  3. MCPServer Controller Integration

    • Embedded WebhookConfigRef natively into MCPServerSpec.
    • Updated MCPServerStatus to explicitly trace configuration hashes linked via annotation hooks.
    • Adapted the Pod Environment builder (deploymentNeedsUpdate) to trace webhook Secret updates.
    • Upgraded createRunConfigFromMCPServer to evaluate and translate webhook settings locally using newly extracted utility functions in pkg/controllerutil/webhook.go.
    • Fix: Implemented robust lowercasing of FailurePolicy in buildWebhookConfig to ensure compatibility with the thv-proxyrunner, regardless of the case used in the CRD.
  4. Testing and Verification

    • Added robust unit test coverage confirming behavior for mcpwebhookconfig_types_test.go, the controller logic (mcpwebhookconfig_controller_test.go), and utilities (webhook_test.go).
    • Introduced comprehensive end-to-end chainsaw tests ensuring valid configurations proceed through creation securely, rejecting any malformed specs early on with CEL validation endpoints.

Type of change

  • Bug fix
  • New feature
  • Refactoring (no behavior change)
  • Dependency update
  • Documentation
  • Other (describe):

Test plan

  • Unit tests (task test)
  • E2E tests (task test-e2e)
  • Linting (task lint-fix)
  • Manual testing (describe below)

Manual Verification

Manual testing was performed using a local Kind cluster and the fetch MCPServer.

  1. Setup:
    • Deployed the operator using task operator-deploy-local.
    • Deployed an echo webhook server: kubectl apply -f manual-testing-phase5/echo-server.yaml.
     spec:
       containers:
       - name: echo
         image: ealen/echo-server:latest
    
  2. Configuration:
    • Created an MCPWebhookConfig pointing to the echo server with insecureSkipVerify: true.
    • Created a fetch MCPServer referencing the config.
  3. Execution:
    • Verified that the operator successfully reconciled the MCPWebhookConfig and generated a configHash.
    • Verified that the fetch server picked up the configuration and started the thv-proxyrunner.
    • Result: Inspected the fetch pod logs and confirmed that the mutating webhook middleware was active and correctly invoking the echo server (resulting in "denied request" logs as expected since the echo server doesn't return a valid allowed: true response).
  4. Dynamic Updates:
    • Updated the MCPWebhookConfig (e.g., changed the failure policy or URL).
    • Verified that the operator detected the change and restarted the fetch pod automatically to load the new settings.

@github-actions github-actions bot added the size/XL Extra large PR: 1000+ lines changed label Apr 13, 2026
Copy link
Copy Markdown
Contributor

@github-actions github-actions bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Large PR Detected

This PR exceeds 1000 lines of changes and requires justification before it can be reviewed.

How to unblock this PR:

Add a section to your PR description with the following format:

## Large PR Justification

[Explain why this PR must be large, such as:]
- Generated code that cannot be split
- Large refactoring that must be atomic
- Multiple related changes that would break if separated
- Migration or data transformation

Alternative:

Consider splitting this PR into smaller, focused changes (< 1000 lines each) for easier review and reduced risk.

See our Contributing Guidelines for more details.


This review will be automatically dismissed once you add the justification section.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 13, 2026
@github-actions
Copy link
Copy Markdown
Contributor

✅ Large PR justification has been provided. The size review has been dismissed and this PR can now proceed with normal review.

@github-actions github-actions bot dismissed their stale review April 13, 2026 11:47

Large PR justification has been provided. Thank you!

@codecov
Copy link
Copy Markdown

codecov bot commented Apr 13, 2026

Codecov Report

❌ Patch coverage is 72.93233% with 72 lines in your changes missing coverage. Please review.
✅ Project coverage is 69.07%. Comparing base (bfac267) to head (c55d5c7).
⚠️ Report is 3 commits behind head on main.

Files with missing lines Patch % Lines
...perator/controllers/mcpwebhookconfig_controller.go 64.34% 29 Missing and 12 partials ⚠️
...d/thv-operator/controllers/mcpserver_controller.go 57.89% 18 Missing and 6 partials ⚠️
cmd/thv-operator/main.go 0.00% 5 Missing ⚠️
...md/thv-operator/controllers/mcpserver_runconfig.go 50.00% 1 Missing and 1 partial ⚠️
Additional details and impacted files
@@           Coverage Diff            @@
##             main    #4564    +/-   ##
========================================
  Coverage   69.07%   69.07%            
========================================
  Files         531      534     +3     
  Lines       55170    55436   +266     
========================================
+ Hits        38110    38294   +184     
- Misses      14132    14197    +65     
- Partials     2928     2945    +17     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
@Sanskarzz Sanskarzz force-pushed the dynamicwebhook5 branch 2 times, most recently from 37488cf to cf06a57 Compare April 15, 2026 13:18
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
@Sanskarzz Sanskarzz force-pushed the dynamicwebhook5 branch 2 times, most recently from 4588a30 to 31a895d Compare April 15, 2026 15:07
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
Signed-off-by: Sanskarzz <sanskar.gur@gmail.com>
@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
@Sanskarzz Sanskarzz marked this pull request as ready for review April 15, 2026 19:01
Copy link
Copy Markdown
Contributor

@claude claude bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@github-actions github-actions bot added size/XL Extra large PR: 1000+ lines changed and removed size/XL Extra large PR: 1000+ lines changed labels Apr 15, 2026
@Sanskarzz
Copy link
Copy Markdown
Contributor Author

Hey @JAORMX I need some guidance on fixing the E2E tests for this PR.

@JAORMX
Copy link
Copy Markdown
Collaborator

JAORMX commented Apr 16, 2026

So, I dug into the CI failures on this PR and there are three issues causing the operator e2e tests to fail. All three are in the test code, not the controller logic itself.

1. Finalizer deadlock during chainsaw cleanup

This is the big one. The mcpwebhookconfig-reconciliation test creates an MCPServer that references an MCPWebhookConfig. The controller correctly adds a finalizer to prevent deletion while referenced... but chainsaw cleans up in reverse creation order. So it tries to delete the MCPWebhookConfig first, the finalizer blocks it because the MCPServer still exists, and boom... context deadline exceeded after ~48 seconds.

The fix: add explicit cleanup ordering so the MCPServer gets deleted before the MCPWebhookConfig. The finalizer logic is actually working correctly here. It's just the test that doesn't account for it.

2. failurePolicy case mismatch in the validation test

The CRD enum is fail/ignore (lowercase), but the validation test uses Fail and Ignore (capitalized). The API server will reject those values outright since they're not in the enum. The test step is called accept-valid-webhooks and expects success... which it won't get.

3. Empty spec vs CEL rule

The CRD has a CEL validation rule: size(self.validating) + size(self.mutating) > 0. But the validation test has a step called accept-empty-webhook that applies spec: {} and expects it to succeed. That's 0 + 0 > 0 which is false, so the API server rejects it. Either the CEL rule needs to go (if empty specs should be valid) or the test needs to expect rejection.

Note that the api-workloads failure is a pre-existing flaky test unrelated to this PR.

Copy link
Copy Markdown
Collaborator

@JAORMX JAORMX left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the work here! The overall structure is solid and the CRD design makes sense. The main theme across most of my comments is: the issue says to follow the MCPExternalAuthConfig pattern, and there are a few spots where this controller diverges from it in ways that cause real problems.

The biggest ones:

  1. Missing MCPServer watch in SetupWithManager -- this is what causes the chainsaw e2e test to fail. Without it, creating an MCPWebhookConfig won't trigger re-reconciliation of a failed MCPServer that references it. The ReferencingServers status also goes stale.

  2. handleDeletion returns an error instead of RequeueAfter + condition -- this triggers exponential backoff and is the other half of why the chainsaw cleanup times out at ~48 seconds.

  3. Sequential Status().Update() calls in the Reconcile loop will produce 409 Conflict errors in practice. The auth config controller avoids this by returning immediately from the hash-change path.

  4. handleWebhookConfig doesn't call setReadyCondition -- breaks the pattern every other handler follows.

On the security side, GenerateWebhookEnvVars needs the same sanitization that externalauth.go applies to env var names, and the non-deterministic map iteration will cause spurious pod restarts.

The chainsaw tests also have a couple of issues I mentioned in a separate comment (failurePolicy casing, empty spec vs CEL rule).

Looking forward to the next iteration!

Comment on lines +248 to +251
func (r *MCPWebhookConfigReconciler) SetupWithManager(mgr ctrl.Manager) error {
return ctrl.NewControllerManagedBy(mgr).
For(&mcpv1alpha1.MCPWebhookConfig{}).
Complete(r)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, this is the big one. The MCPExternalAuthConfig controller watches MCPServer (and MCPRemoteProxy) changes via handler.EnqueueRequestsFromMapFunc in its SetupWithManager. That's how ReferencingWorkloads stays up to date when servers come and go.

This controller only watches its own type. That means:

  1. ReferencingServers won't update when an MCPServer is created or deleted
  2. When a new MCPWebhookConfig is created that an existing (failed) MCPServer references, there's no trigger to re-reconcile that MCPServer
  3. The chainsaw reconciliation test will hang waiting for the MCPServer to enter Running... because nothing wakes it up

Take a look at how the auth config controller does it (mapMCPServerToExternalAuthConfig + the Watches call). That's the pattern we need here.

Comment on lines +186 to +189
}

return ctrl.Result{}, fmt.Errorf("MCPWebhookConfig %s is still referenced by MCPServers: %v",
webhookConfig.Name, serverNames)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Returning an error here triggers controller-runtime's exponential backoff. So you'll get retries at 1s, 2s, 4s, 8s... eventually backing off to 16+ minutes between checks. That's also why the chainsaw cleanup times out after ~48 seconds.

The auth config controller handles this differently: it sets a DeletionBlocked condition (so users can see why deletion is stuck via kubectl describe), then returns RequeueAfter: 30 * time.Second. No error, no backoff, just a steady re-check.

Note that this is an expected state, not an error. The controller is doing its job by blocking deletion while references exist. It should communicate that calmly, not panic about it :)

Comment on lines +152 to +155

if err := r.Update(ctx, &server); err != nil {
logger.Error(err, "Failed to update MCPServer annotation", "mcpserver", server.Name)
}
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This error is logged but not returned. If the annotation update fails, the reconciler thinks everything went fine, the MCPServer never gets the hash annotation, and nothing triggers a retry.

Per our Go style rules: return errors by default, never silently swallow them. Either return the error or collect failures and return them after the loop.

Comment on lines +92 to +97
}
}

// Update condition if it changed
if conditionChanged {
if err := r.Status().Update(ctx, webhookConfig); err != nil {
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When hashChanged is true, handleConfigHashChange already does a Status().Update() (which bumps the resourceVersion on the API server). Then control falls through here and this second Status().Update() operates on a stale resourceVersion... so you'll get 409 Conflict errors.

The auth config controller avoids this by returning immediately from the hash change path. Consider doing the same, or consolidate all status mutations into a single write at the end of the reconcile loop.

Comment on lines +274 to +275
// Update status to reflect the error
mcpServer.Status.Phase = mcpv1alpha1.MCPServerPhaseFailed
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every other handler in the reconcile loop (handleExternalAuthConfig, handleTelemetryConfig, handleToolConfig, handleOIDCConfig) calls setReadyCondition when failing. This one only sets Phase. So a webhook config failure won't show up in the Ready condition that users and tooling check via kubectl get.

Add a setReadyCondition(mcpServer, metav1.ConditionFalse, mcpv1alpha1.ConditionReasonNotReady, err.Error()) call here to keep it consistent with the rest.

//
// +kubebuilder:object:root=true
// +kubebuilder:subresource:status
// +kubebuilder:resource:shortName=mwc
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Missing categories=toolhive here. Every other CRD in the project has it, and without it kubectl get toolhive won't list MCPWebhookConfig resources. Should be:

// +kubebuilder:resource:shortName=mwc,categories=toolhive


// ReferencingServers lists the names of MCPServers currently using this configuration
// +optional
ReferencingServers []string `json:"referencingServers,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All the other config CRDs (auth, OIDC, telemetry, tool) use []WorkloadReference with +listType=map and +listMapKey=name for this field. Using []string here is inconsistent and would need a breaking API change if MCPRemoteProxy or VirtualMCPServer ever gains webhook support.

Would be good to match the established pattern here.

Comment on lines +94 to +95
for name, ref := range secretsToExpose {
envVarName := fmt.Sprintf("TOOLHIVE_SECRET_%s", name)
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Two things here:

  1. The map iteration is non-deterministic, so the env var order will be random on each reconcile. deploymentNeedsUpdate uses DeepEqual on the env var slice, so this will trigger spurious pod restarts. Sort the output by env var name before returning (or use slices.SortFunc).

  2. The env var name has no sanitization. The existing externalauth.go uses strings.ToUpper and envVarSanitizer.ReplaceAllString to produce valid POSIX env var names. A secret named my-hmac.key would produce TOOLHIVE_SECRET_my-hmac.key here... which is not a valid env var name. Also worth adding a distinguishing prefix (like WEBHOOK_HMAC_ instead of just SECRET_) to avoid collisions with other TOOLHIVE_SECRET_* vars from the auth side.

Comment on lines +56 to +59
// +kubebuilder:validation:Enum=fail;ignore
// +kubebuilder:default=fail
// +optional
FailurePolicy webhook.FailurePolicy `json:"failurePolicy,omitempty"`
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: API types generally shouldn't import runtime implementation packages. The existing pattern (e.g., ExternalAuthType in the auth CRD) is to define the type locally in the API package. Importing pkg/webhook here creates a dependency from the API layer into the implementation layer, which can cause import cycle headaches down the road.

Consider defining FailurePolicy as a local type in api/v1alpha1/.

@Sanskarzz
Copy link
Copy Markdown
Contributor Author

Hey @JAORMX
I wanted to let you know that I’m currently dealing with a family medical emergency, so I haven’t been able to start working review comments for this PR. I’ll do my best to push review commits for this in the next 4–5 days.

@JAORMX
Copy link
Copy Markdown
Collaborator

JAORMX commented Apr 20, 2026

@Sanskarzz completely understandable. I'm sorry you're dealing with such a thing. I wish you and your family the best and hope everything turns out OK. Don't worry about coming back to this, it can wait til you have more time, or I can pick it up. You'd, of course, keep credit for this work since you started it.

Thanks for the heads up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/XL Extra large PR: 1000+ lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Webhook Middleware Phase 5: Kubernetes CRD and controller integration

2 participants