Skip to content

fix(upstream): require N consecutive health-check timeouts before marking Error#470

Merged
Dumbris merged 1 commit into
mainfrom
fix/health-check-flap
May 20, 2026
Merged

fix(upstream): require N consecutive health-check timeouts before marking Error#470
Dumbris merged 1 commit into
mainfrom
fix/health-check-flap

Conversation

@Dumbris
Copy link
Copy Markdown
Member

@Dumbris Dumbris commented May 15, 2026

Summary

The complement to #469. Slow remote upstreams (notably hf.co/mcp under load) routinely miss a single 5-second health-check window without actually being down. The previous code flipped the server to Error on the very first miss, which caused two visible bugs:

  1. Tools list goes empty after every toggle. Frontend's syncAfterToolToggle() re-fetches immediately after a tool enable/disable. If a health check timed out in that window, StateView returned no tools and the UI showed "No tools available" until the next reconnect (~30-60s later).
  2. Scary "Server Error" alert combined with fix(diagnostics): classify HTTP timeouts and string-wrapped 5xx as known codes #469's MCPX_UNKNOWN_UNCLASSIFIED paint a red banner that paid no real-world dividend — by the next 30s tick the server was Ready again.

Approach

Add a small consecutiveHealthFailures counter to the managed Client.

  • Transient errors (deadline exceeded, timeout, context canceled) require healthCheckFailureThreshold = 3 consecutive misses (~90 s) before flipping Error.
  • Hard errors (connection refused, no such host, network unreachable, connection reset, broken pipe) bypass the counter and trigger Error on the first miss — those are real outages and waiting helps no one.
  • A successful health check resets the counter to zero.
  • A fresh Connect() success resets the counter so reconnect cycles don't carry stale debt.

The isTransientHealthCheckError helper is a strict subset of the existing isConnectionError predicate — it doesn't change which errors are considered connection failures, only whether they get the multi-failure tolerance.

Test plan

  • go test ./internal/upstream/managed/ -race — green, including the four new cases:
    • TestHealthCheck_TransientTimeoutToleratedBelowThreshold — counter ticks but state stays Ready until threshold
    • TestHealthCheck_HardErrorTriggersImmediateError — connection-refused → Error on first miss
    • TestHealthCheck_SuccessResetsCounter — recovery wipes the slate
    • TestHealthCheck_ResetOnConnect — reconnect starts fresh
    • TestIsTransientHealthCheckError — matrix of error → category
  • go build ./... clean

What this does NOT do

It doesn't bump the per-call timeout (still 5s) or change the tick (still 30s). The minimum perceived recovery time for a real outage is now ~90s instead of ~30s — an acceptable trade for eliminating false flap alerts on slow remote upstreams.

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 15, 2026

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 64ebf76
Status: ✅  Deploy successful!
Preview URL: https://0c3d7f78.mcpproxy-docs.pages.dev
Branch Preview URL: https://fix-health-check-flap.mcpproxy-docs.pages.dev

View logs

…king Error

Slow remote upstreams (notably hf.co/mcp under load) routinely miss a
single 5-second health-check window without actually being down. The
previous code flipped the server to Error on the very first miss, which
caused two visible bugs:

1. The Web UI's tools list went empty for ~30-60s every time the user
   toggled a tool, because the post-toggle re-fetch hit the State=Error
   window where StateView returns no tools.
2. Combined with the unclassified-error code (PR #469), the user saw a
   red "Server Error / MCPX_UNKNOWN_UNCLASSIFIED" alert that paid no
   real-world dividend — by the next 30s tick the server was Ready again.

Add a small consecutive-failure counter to the managed Client. Transient
errors (deadline exceeded, timeout, context canceled) require
healthCheckFailureThreshold=3 misses (~90s) before flipping Error. Hard
errors (connection refused, no such host, network unreachable, etc.)
bypass the counter and trigger Error on the first miss — those are real
outages and waiting helps no one. A successful health check or a fresh
Connect() resets the counter to zero.

Tests cover all four behaviors: tolerated transient, immediate hard,
success-resets, and connect-resets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dumbris Dumbris force-pushed the fix/health-check-flap branch from e57adae to 64ebf76 Compare May 20, 2026 05:52
@github-actions
Copy link
Copy Markdown

📦 Build Artifacts

Workflow Run: View Run
Branch: fix/health-check-flap

Available Artifacts

  • archive-darwin-amd64 (26 MB)
  • archive-darwin-arm64 (23 MB)
  • archive-linux-amd64 (15 MB)
  • archive-linux-arm64 (13 MB)
  • archive-windows-amd64 (26 MB)
  • archive-windows-arm64 (23 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (20 MB)
  • installer-dmg-darwin-arm64 (18 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26144177596 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

@Dumbris Dumbris merged commit 0cbdb89 into main May 20, 2026
23 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants