Skip to content

Add replicas and sessionStorage to MCPRemoteProxy for HA#5237

Open
aron-muon wants to merge 8 commits into
stacklok:mainfrom
aron-muon:add-mcpremoteproxy-session-storage
Open

Add replicas and sessionStorage to MCPRemoteProxy for HA#5237
aron-muon wants to merge 8 commits into
stacklok:mainfrom
aron-muon:add-mcpremoteproxy-session-storage

Conversation

@aron-muon

@aron-muon aron-muon commented May 9, 2026

Copy link
Copy Markdown
Contributor

Summary

When MCPRemoteProxy runs with multiple replicas behind a load balancer that doesn't preserve client-IP affinity (e.g. AWS ALB across multiple AZs), every non-initialize request fails with Session not found. The transparent proxy validates Mcp-Session-Id against pod-local in-memory state on every hop — transparent_proxy.go even calls this out:

// Guard: reject non-initialize requests with unknown session IDs.
// When multiple proxyrunner replicas share a Redis session store,
// a valid session will always be found.

On top of that, MCPRemoteProxy had no way to run multiple replicas in the first place: the operator hardcoded replicas: 1 on Deployment creation, so operators had to pin counts externally (e.g. a min=max HPA workaround that exploits the reconciler preserving the live count).

This PR completes the HA story by mirroring the two pieces MCPServer (#4368) and VirtualMCPServer (#4367) already have: a spec.replicas field and Redis-backed shared session storage.

What changed:

  • Add MCPRemoteProxySpec.Replicas — same *int32 semantics as VirtualMCPServer.spec.replicas: nil leaves Deployment.Spec.Replicas unset (apiserver defaults to 1; an HPA or other external controller owns scaling and is never fought by the reconciler), non-nil is operator-authoritative — set on create, drift-checked in deploymentNeedsUpdate, synced on update.
  • Add MCPRemoteProxySpec.SessionStorage — same SessionStorageConfig shape used by MCPServer and VirtualMCPServer. populateScalingConfigForRemoteProxy writes the non-sensitive Redis parameters into runner.ScalingConfig.SessionRedis (before PopulateMiddlewareConfigs, since rate limiting reads SessionRedis); buildRedisPasswordEnvVarForRemoteProxy injects THV_SESSION_REDIS_PASSWORD via SecretKeyRef so the password never lands in the RunConfig ConfigMap.
  • Surface the shared SessionStorageWarning condition when replicas > 1 without Redis session storage — status-condition parity with the MCPServer and VirtualMCPServer validators.
  • Cross-reference sessionAffinity and sessionStorage in the field docs — Redis-backed storage delivers HA only with sessionAffinity: None, and None without Redis is the exact failure mode this PR fixes.

The change is intentionally a near-verbatim mirror of the MCPServer / VirtualMCPServer implementations to keep review easy — it's the same pattern reviewers have already accepted twice.

Type of change

  • Bug fix
  • New feature

Test plan

  • Unit tests — full suite via task test. New coverage: TestPopulateScalingConfigForRemoteProxy (4 cases incl. password-never-serialized), TestBuildRedisPasswordEnvVarForRemoteProxy (4 cases), TestDeploymentForMCPRemoteProxyReplicas (nil/0/3), TestMCPRemoteProxyDeploymentNeedsUpdateReplicas (drift matrix), TestMCPRemoteProxyEnsureDeploymentReplicaSync (fake-cluster preserve-vs-sync), TestValidateSessionStorageForReplicasRemoteProxy (condition matrix)
  • Build verification — task build and task build-operator
  • CRDs/docs regenerated — task operator-generate, task operator-manifests, task crdref-gen (controller-gen v0.17.3)

API Compatibility

  • This PR does not break the v1beta1 API. Both added fields (spec.replicas, spec.sessionStorage) are optional; when omitted, behavior is identical to today (the only delta: a freshly created Deployment carries no explicit replicas: 1 and relies on the apiserver default of 1).

Changes

File Lines Change
cmd/thv-operator/api/v1beta1/mcpremoteproxy_types.go +39 Add Replicas *int32 and SessionStorage *SessionStorageConfig with cross-referenced docs
cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go +10 Generated DeepCopy for both fields (controller-gen v0.17.3)
cmd/thv-operator/controllers/mcpremoteproxy_controller.go +53 Replica drift check in deploymentNeedsUpdate; non-nil sync / nil preserve on update; validateSessionStorageForReplicas advisory condition
cmd/thv-operator/controllers/mcpremoteproxy_deployment.go +36 Deployment.Spec.Replicas from spec (was hardcoded 1); Redis password env var via SecretKeyRef
cmd/thv-operator/controllers/mcpremoteproxy_runconfig.go +29 populateScalingConfigForRemoteProxy, called before PopulateMiddlewareConfigs
cmd/thv-operator/controllers/mcpremoteproxy_replicas_test.go +255 Replica create/drift/sync + condition tests
cmd/thv-operator/controllers/mcpremoteproxy_deployment_test.go +66 Redis password env var tests; nil-replicas assertion
cmd/thv-operator/controllers/mcpremoteproxy_runconfig_test.go +86 TestPopulateScalingConfigForRemoteProxy
deploy/charts/operator-crds/{files,templates}/...mcpremoteproxies.yaml +162 ea Generated replicas + sessionStorage schema
docs/operator/crd-api.md +4/−1 Generated CRD API reference

Does this introduce a user-facing change?

Yes. Two new optional fields on MCPRemoteProxy:

  • spec.replicas — desired proxy replica count. Leave unset to keep scaling external (HPA-compatible: the operator never overwrites the live count); set it to make the operator authoritative.
  • spec.sessionStorage — Redis-backed shared session state across replicas, same shape as MCPServer.spec.sessionStorage. For HA set replicas >= 2, sessionStorage.provider: redis, and sessionAffinity: None; a SessionStorageWarning status condition flags replicas > 1 without Redis.

🤖 Generated with Claude Code

When MCPRemoteProxy runs with multiple replicas behind a load balancer
that doesn't preserve client-IP affinity (e.g. AWS ALB across multiple
AZs), every non-initialize request fails with `Session not found` because
the transparent proxy validates `Mcp-Session-Id` against pod-local
in-memory state on every hop. From transparent_proxy.go:

    // Guard: reject non-initialize requests with unknown session IDs.
    // When multiple proxyrunner replicas share a Redis session store,
    // a valid session will always be found.

The transport layer already supports a Redis-backed session store via
runner.ScalingConfig.SessionRedis — MCPServer and VirtualMCPServer wire
it through. MCPRemoteProxy simply never populated it.

This change ports the symmetric work from MCPServer (PR stacklok#4368) and
VirtualMCPServer (PR stacklok#4367) to MCPRemoteProxy:

- Add MCPRemoteProxySpec.SessionStorage field (same SessionStorageConfig
  shape used by MCPServer / VirtualMCPServer)
- populateScalingConfigForRemoteProxy: write the non-sensitive Redis
  parameters (address/db/keyPrefix) into runner.ScalingConfig.SessionRedis
- buildRedisPasswordEnvVarForRemoteProxy: inject THV_SESSION_REDIS_PASSWORD
  on the proxy Deployment via SecretKeyRef when sessionStorage.passwordRef
  is set, so the password never lands in the RunConfig ConfigMap

Tests:
- TestPopulateScalingConfigForRemoteProxy mirrors TestPopulateScalingConfig
  from mcpserver_runconfig_test.go (4 cases including a check that the
  password never leaks into the serialized SessionRedis)
- TestBuildRedisPasswordEnvVarForRemoteProxy mirrors TestBuildRedisPasswordEnvVar
  from virtualmcpserver_deployment_test.go (4 cases covering the matrix
  of nil/memory/redis-no-pwd/redis-with-pwd)

Generated:
- zz_generated.deepcopy.go (controller-gen v0.17.3)
- toolhive.stacklok.dev_mcpremoteproxies.yaml CRD schema (controller-gen v0.17.3)

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size/M Medium PR: 300-599 lines changed label May 9, 2026
@codecov

codecov Bot commented May 9, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 69.73%. Comparing base (7d1ebf5) to head (20cfd13).

Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5237      +/-   ##
==========================================
+ Coverage   69.71%   69.73%   +0.02%     
==========================================
  Files         645      645              
  Lines       65598    65651      +53     
==========================================
+ Hits        45729    45782      +53     
- Misses      16522    16523       +1     
+ Partials     3347     3346       -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 9, 2026
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels May 9, 2026
@aron-muon aron-muon marked this pull request as ready for review May 9, 2026 06:27
@yrobla yrobla requested a review from Copilot May 14, 2026 08:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Wires Redis-backed session storage into MCPRemoteProxy so multi-replica proxy deployments behind load balancers without client-IP affinity can share session state, closing the parity gap with MCPServer (#4368) and VirtualMCPServer (#4367).

Changes:

  • Adds MCPRemoteProxySpec.SessionStorage (mirrors the existing SessionStorageConfig schema) and regenerates DeepCopy/CRD/docs.
  • Adds populateScalingConfigForRemoteProxy to copy address/db/keyPrefix into runner.ScalingConfig.SessionRedis (password intentionally excluded).
  • Adds buildRedisPasswordEnvVarForRemoteProxy to inject THV_SESSION_REDIS_PASSWORD via SecretKeyRef on the proxy Deployment, with corresponding unit tests.

Reviewed changes

Copilot reviewed 8 out of 9 changed files in this pull request and generated 1 comment.

Show a summary per file
File Description
cmd/thv-operator/api/v1beta1/mcpremoteproxy_types.go New optional SessionStorage field.
cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go Generated DeepCopy for the new field.
cmd/thv-operator/controllers/mcpremoteproxy_runconfig.go Populates ScalingConfig.SessionRedis from spec.
cmd/thv-operator/controllers/mcpremoteproxy_runconfig_test.go Unit tests for the new populator.
cmd/thv-operator/controllers/mcpremoteproxy_deployment.go Injects Redis password env var via SecretKeyRef.
cmd/thv-operator/controllers/mcpremoteproxy_deployment_test.go Unit tests covering env-var injection cases.
deploy/charts/operator-crds/files/crds/toolhive.stacklok.dev_mcpremoteproxies.yaml Regenerated CRD with sessionStorage subschema.
deploy/charts/operator-crds/templates/toolhive.stacklok.dev_mcpremoteproxies.yaml Template counterpart of the regenerated CRD.
docs/operator/crd-api.md Doc additions for the new field.
Files not reviewed (1)
  • cmd/thv-operator/api/v1beta1/zz_generated.deepcopy.go: Language not supported

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +170 to +181
// shared Redis-backed session store. The Redis password is intentionally
// excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env
// var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go.
populateScalingConfigForRemoteProxy(runConfig, proxy)

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c065e7.

@yrobla yrobla left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Functional Review — Wire SessionStorage into MCPRemoteProxy

Context

The fix addresses a real architectural gap. When MCPRemoteProxy runs with >1 replica behind a load balancer that lacks client-IP affinity (AWS ALB across AZs, GCP NEGs, etc.), every non-initialize request fails with Session not found because the transparent proxy validates Mcp-Session-Id against pod-local in-memory state. MCPServer and VirtualMCPServer already had the Redis-backed escape hatch; MCPRemoteProxy was the only component missing it.

What changed and does it make sense?

The PR adds three wiring pieces:

  1. MCPRemoteProxySpec.SessionStorage — same optional field shape used by MCPServer/VirtualMCPServer. Additive and backward-compatible. ✅
  2. populateScalingConfigForRemoteProxy — writes non-sensitive Redis params into runner.ScalingConfig.SessionRedis in the RunConfig ConfigMap. ✅
  3. buildRedisPasswordEnvVarForRemoteProxy — injects THV_SESSION_REDIS_PASSWORD as a SecretKeyRef pod env var, so the password never lands in the ConfigMap. ✅

The security split (connection params in ConfigMap, password as pod secret) matches the established pattern and is correct.

Bug: populateScalingConfigForRemoteProxy is called after PopulateMiddlewareConfigs

In mcpserver_runconfig.go the analogous call order is:

// Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis.
populateScalingConfig(runConfig, m)
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil { ... }

PopulateMiddlewareConfigsaddRateLimitMiddleware reads config.ScalingConfig.SessionRedis and errors when nil and rate limiting is configured. The PR reverses this in mcpremoteproxy_runconfig.go. MCPRemoteProxy has no rateLimiting field today so this is benign now, but it diverges from the documented invariant and will break the moment rate limiting is added to MCPRemoteProxy. See inline comment for suggested fix.

Design note: sessionAffinity / sessionStorage interaction needs documentation

MCPRemoteProxySpec.SessionAffinity defaults to ClientIP. The two fields now coexist without guidance:

sessionAffinity sessionStorage Result
ClientIP (default) nil / memory ✅ works (pod-sticky routing)
ClientIP redis Redis wired but Service stickiness makes it a no-op
None redis ✅ intended HA configuration
None nil / memory Session not found on any cross-pod request

The last row is the exact failure mode this PR fixes — but users can still hit it if they set sessionAffinity: None without configuring Redis. The SessionStorage field comment should mention setting sessionAffinity: None for the HA case, or a kubebuilder XValidation could enforce it at admission time (similar to how MCPServer validates that rate limiting requires Redis).

Alternative approaches

  • Consistent hashing at the ingress (hash Mcp-Session-Id to a specific pod): avoids Redis but requires custom LB config outside ToolHive's control.
  • Session migration between pods: more complex with no clear benefit over Redis here.

Redis is the right call — same decision already made for MCPServer and VirtualMCPServer, and the operational overhead is low (one Redis shared across all three CRD types).

Summary

Finding Severity Action
populateScalingConfigForRemoteProxy called after PopulateMiddlewareConfigs Medium — latent bug Reorder above PopulateMiddlewareConfigs
No guidance on sessionAffinity + sessionStorage interaction Low — docs/UX Add cross-reference in field comments
Implementation correctness, security split, API compatibility No action needed

// shared Redis-backed session store. The Redis password is intentionally
// excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env
// var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go.
populateScalingConfigForRemoteProxy(runConfig, proxy)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ordering bug (latent): populateScalingConfigForRemoteProxy must be called before PopulateMiddlewareConfigs, not after.

PopulateMiddlewareConfigsaddRateLimitMiddleware reads config.ScalingConfig.SessionRedis and returns an error when it is nil and a rate-limit config is present. The equivalent call in mcpserver_runconfig.go explicitly documents this:

// Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis.
populateScalingConfig(runConfig, m)
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil { ... }

MCPRemoteProxy has no rateLimiting field today so this is harmless right now, but the moment rate limiting is added (a natural follow-on given MCPServer already has it) this ordering will cause a confusing runtime error. Please swap the two blocks:

Suggested change
populateScalingConfigForRemoteProxy(runConfig, proxy)
// Populate ScalingConfig.SessionRedis from spec.sessionStorage so the
// proxy runner has the address/db/keyPrefix needed to construct a
// shared Redis-backed session store. The Redis password is intentionally
// excluded here — it is injected as the THV_SESSION_REDIS_PASSWORD env
// var by buildRedisPasswordEnvVar in mcpremoteproxy_deployment.go.
// Must run before PopulateMiddlewareConfigs because rate limiting reads SessionRedis.
populateScalingConfigForRemoteProxy(runConfig, proxy)
// Populate middleware configs from the configuration fields
// This ensures that middleware_configs is properly set for serialization
if err := runner.PopulateMiddlewareConfigs(runConfig); err != nil {
return nil, fmt.Errorf("failed to populate middleware configs: %w", err)
}
return runConfig, nil

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 4c065e7 — applied the suggestion as-is; populateScalingConfigForRemoteProxy now runs before PopulateMiddlewareConfigs, mirroring the MCPServer ordering.

// across replicas.
//
// Mirrors MCPServer.spec.sessionStorage and VirtualMCPServer.spec.sessionStorage.
// +optional

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider documenting the sessionAffinity + sessionStorage interaction.

These two fields now coexist without any cross-reference, which creates a footgun. When sessionStorage.provider == redis the correct HA setup also requires sessionAffinity: None — otherwise the k8s Service's ClientIP stickiness makes the Redis store redundant. Conversely, sessionAffinity: None without Redis is the broken configuration this PR was written to fix.

At minimum, the SessionStorage field comment should say: "When using the Redis provider, also set sessionAffinity: None so the Service routes requests round-robin and all replicas rely on the shared session store rather than pod-local state."

A stronger option is a kubebuilder XValidation that requires sessionAffinity == "None" when sessionStorage.provider == "redis", catching the misconfiguration at admission time.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Documented in 9364db0 — both field comments now cross-reference each other (regenerated into the CRD schema and crd-api.md). Went with docs-only rather than the CEL rule: ClientIP + redis is still a valid combination (the shared store preserves sessions when a pod dies and affinity re-pins the client), so rejecting it at admission felt too strict.

aron-muon and others added 3 commits June 12, 2026 07:28
Resolve import conflict in mcpremoteproxy_deployment.go: keep both the
session import (this branch) and the wirefmt import (upstream stacklok#5239).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Address review feedback on stacklok#5237: PopulateMiddlewareConfigs ->
addRateLimitMiddleware reads ScalingConfig.SessionRedis and errors
when it is nil while a rate-limit config is present. MCPRemoteProxy
has no rate limiting yet, but the reversed ordering diverged from
the documented invariant in mcpserver_runconfig.go and would break
as soon as rate limiting is wired in. Mirror the MCPServer ordering.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
Address review feedback on stacklok#5237: the two fields coexisted with no
cross-reference. Redis-backed sessionStorage only delivers HA when
sessionAffinity is None, and sessionAffinity None without Redis is
the exact Session-not-found failure this feature fixes. Spell out
both directions in the field comments so the guidance surfaces in
the CRD schema and kubectl explain.

Generated with controller-gen v0.17.3 and crd-ref-docs.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added size/M Medium PR: 300-599 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jun 12, 2026
@aron-muon aron-muon requested a review from yrobla June 12, 2026 18:32
aron-muon and others added 2 commits June 12, 2026 19:42
MCPRemoteProxy was the only scalable type without a replicas field:
the operator hardcoded 1 on create, forcing operators to pin counts
with a min=max HPA workaround. Mirror the MCPServer/VirtualMCPServer
semantics so the HA story matches sessionStorage:

- Replicas *int32 on the spec; nil leaves Deployment.Spec.Replicas
  unset (apiserver default 1, HPA-compatible), non-nil is operator
  authoritative — set on create, drift-checked in
  deploymentNeedsUpdate, synced on update while nil preserves the
  live count.
- Surface the shared SessionStorageWarning condition when
  replicas > 1 without Redis session storage (status condition
  parity with MCPServer and VirtualMCPServer).
- Tests cover create, drift detection, the ensureDeployment
  preserve-vs-sync paths against a fake cluster, and the condition
  matrix. CRDs and docs regenerated (controller-gen v0.17.3).

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/M Medium PR: 300-599 lines changed labels Jun 12, 2026
@aron-muon aron-muon changed the title Wire SessionStorage into MCPRemoteProxy for HA support Add replicas and sessionStorage to MCPRemoteProxy for HA Jun 12, 2026
@aron-muon

Copy link
Copy Markdown
Contributor Author

Pushed d6264234fe: added spec.replicas alongside spec.sessionStorage, mirroring the VirtualMCPServer semantics — nil leaves scaling to an HPA/external controller (the reconciler never overwrites the live count), non-nil is operator-authoritative with drift reconciliation, and the shared SessionStorageWarning condition flags replicas > 1 without Redis. Previously the operator hardcoded replicas: 1, so multi-replica deployments required pinning the count with an external min=max HPA workaround. Also merged latest main. PR title/description updated to reflect the fuller HA scope.

@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Jun 12, 2026
aron-muon added a commit to aron-muon/toolhive that referenced this pull request Jun 12, 2026
Document what this fork is: upstream main plus stripped CI publishing
to ghcr.io/aron-muon, and the MCPRemoteProxy HA patch pending upstream
PR stacklok#5237. Sets expectations that feature work goes upstream first.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
aron-muon added a commit to aron-muon/toolhive that referenced this pull request Jun 12, 2026
Carry spec.replicas + spec.sessionStorage for MCPRemoteProxy on the
fork until the upstream PR merges.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
The delete-workload spec hit its 60s Eventually timeout — the same
flaky spec previously tracked upstream. This PR only touches
operator code and generated CRDs, none of which is compiled into
the thv binary the api-workloads e2e suite exercises.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>
@github-actions github-actions Bot added size/L Large PR: 600-999 lines changed and removed size/L Large PR: 600-999 lines changed labels Jun 12, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

size/L Large PR: 600-999 lines changed

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants