fix(upstream): require N consecutive health-check timeouts before marking Error by Dumbris · Pull Request #470 · smart-mcp-proxy/mcpproxy-go

Dumbris · 2026-05-15T07:40:04Z

Summary

The complement to #469. Slow remote upstreams (notably hf.co/mcp under load) routinely miss a single 5-second health-check window without actually being down. The previous code flipped the server to Error on the very first miss, which caused two visible bugs:

Tools list goes empty after every toggle. Frontend's syncAfterToolToggle() re-fetches immediately after a tool enable/disable. If a health check timed out in that window, StateView returned no tools and the UI showed "No tools available" until the next reconnect (~30-60s later).
Scary "Server Error" alert combined with fix(diagnostics): classify HTTP timeouts and string-wrapped 5xx as known codes #469's MCPX_UNKNOWN_UNCLASSIFIED paint a red banner that paid no real-world dividend — by the next 30s tick the server was Ready again.

Approach

Add a small consecutiveHealthFailures counter to the managed Client.

Transient errors (deadline exceeded, timeout, context canceled) require healthCheckFailureThreshold = 3 consecutive misses (~90 s) before flipping Error.
Hard errors (connection refused, no such host, network unreachable, connection reset, broken pipe) bypass the counter and trigger Error on the first miss — those are real outages and waiting helps no one.
A successful health check resets the counter to zero.
A fresh Connect() success resets the counter so reconnect cycles don't carry stale debt.

The isTransientHealthCheckError helper is a strict subset of the existing isConnectionError predicate — it doesn't change which errors are considered connection failures, only whether they get the multi-failure tolerance.

Test plan

go test ./internal/upstream/managed/ -race — green, including the four new cases:
- TestHealthCheck_TransientTimeoutToleratedBelowThreshold — counter ticks but state stays Ready until threshold
- TestHealthCheck_HardErrorTriggersImmediateError — connection-refused → Error on first miss
- TestHealthCheck_SuccessResetsCounter — recovery wipes the slate
- TestHealthCheck_ResetOnConnect — reconnect starts fresh
- TestIsTransientHealthCheckError — matrix of error → category
go build ./... clean

What this does NOT do

It doesn't bump the per-call timeout (still 5s) or change the tick (still 30s). The minimum perceived recovery time for a real outage is now ~90s instead of ~30s — an acceptable trade for eliminating false flap alerts on slow remote upstreams.

🤖 Generated with Claude Code

cloudflare-workers-and-pages · 2026-05-15T07:41:25Z

Deploying mcpproxy-docs with Cloudflare Pages

Latest commit:	`64ebf76`
Status:	✅ Deploy successful!
Preview URL:	https://0c3d7f78.mcpproxy-docs.pages.dev
Branch Preview URL:	https://fix-health-check-flap.mcpproxy-docs.pages.dev

View logs

…king Error Slow remote upstreams (notably hf.co/mcp under load) routinely miss a single 5-second health-check window without actually being down. The previous code flipped the server to Error on the very first miss, which caused two visible bugs: 1. The Web UI's tools list went empty for ~30-60s every time the user toggled a tool, because the post-toggle re-fetch hit the State=Error window where StateView returns no tools. 2. Combined with the unclassified-error code (PR #469), the user saw a red "Server Error / MCPX_UNKNOWN_UNCLASSIFIED" alert that paid no real-world dividend — by the next 30s tick the server was Ready again. Add a small consecutive-failure counter to the managed Client. Transient errors (deadline exceeded, timeout, context canceled) require healthCheckFailureThreshold=3 misses (~90s) before flipping Error. Hard errors (connection refused, no such host, network unreachable, etc.) bypass the counter and trigger Error on the first miss — those are real outages and waiting helps no one. A successful health check or a fresh Connect() resets the counter to zero. Tests cover all four behaviors: tolerated transient, immediate hard, success-resets, and connect-resets. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

github-actions · 2026-05-20T06:03:36Z

📦 Build Artifacts

Workflow Run: View Run
Branch: fix/health-check-flap

Available Artifacts

archive-darwin-amd64 (26 MB)
archive-darwin-arm64 (23 MB)
archive-linux-amd64 (15 MB)
archive-linux-arm64 (13 MB)
archive-windows-amd64 (26 MB)
archive-windows-arm64 (23 MB)
frontend-dist-pr (0 MB)
installer-dmg-darwin-amd64 (20 MB)
installer-dmg-darwin-arm64 (18 MB)

How to Download

Option 1: GitHub Web UI (easiest)

Go to the workflow run page linked above
Scroll to the bottom "Artifacts" section
Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26144177596 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

Dumbris force-pushed the fix/health-check-flap branch from e57adae to 64ebf76 Compare May 20, 2026 05:52

Dumbris merged commit 0cbdb89 into main May 20, 2026
23 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(upstream): require N consecutive health-check timeouts before marking Error#470

fix(upstream): require N consecutive health-check timeouts before marking Error#470
Dumbris merged 1 commit into
mainfrom
fix/health-check-flap

Dumbris commented May 15, 2026

Uh oh!

cloudflare-workers-and-pages Bot commented May 15, 2026 •

edited

Loading

Uh oh!

github-actions Bot commented May 20, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

Dumbris commented May 15, 2026

Summary

Approach

Test plan

What this does NOT do

Uh oh!

cloudflare-workers-and-pages Bot commented May 15, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Deploying mcpproxy-docs with Cloudflare Pages

Uh oh!

github-actions Bot commented May 20, 2026

📦 Build Artifacts

Available Artifacts

How to Download

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

cloudflare-workers-and-pages Bot commented May 15, 2026 •

edited

Loading