Skip to content

fix(diagnostics): classify HTTP timeouts and string-wrapped 5xx as known codes#469

Merged
Dumbris merged 1 commit into
mainfrom
fix/classifier-http-timeout
May 20, 2026
Merged

fix(diagnostics): classify HTTP timeouts and string-wrapped 5xx as known codes#469
Dumbris merged 1 commit into
mainfrom
fix/classifier-http-timeout

Conversation

@Dumbris
Copy link
Copy Markdown
Member

@Dumbris Dumbris commented May 15, 2026

Summary

Investigating why disabling tools in the huggingface server surfaced a scary "Server Error / MCPX_UNKNOWN_UNCLASSIFIED" alert, I found the toggle itself works (200 OK), but the upstream's HTTP transport occasionally times out against hf.co/mcp. When that happens during the post-toggle re-fetch, the error reached the user wrapped as plain text — the classifier's typed paths (errors.Is, *statusError) never fired, so it fell through to UNCLASSIFIED.

This PR adds two recovery rules in classifyHTTP:

  • context.DeadlineExceeded on http transport → new MCPX_HTTP_TIMEOUT (severity: warn — transient by nature)
  • substring "context deadline exceeded" on http transport → same code (catches the string-wrapped form the upstream layer emits)
  • "request failed with status NNN" / "notification failed with status NNN" → routed through the existing DiagnoseHTTPStatus(int) so 401/403/404/5xx all get their typed codes (no more "UNCLASSIFIED" for plain-string 504s)

Tests are taken verbatim from ~/Library/Logs/mcpproxy/server-hugginface.log so the rules cover the exact wire shape the field produces.

Why this matters

MCPX_UNKNOWN_UNCLASSIFIED renders as red "Server Error" with no remediation. MCPX_HTTP_TIMEOUT and MCPX_HTTP_5XX already have catalog entries with calmer messaging ("usually transient", "upstream status page link"), and HTTPTimeout is SeverityWarn so the UI can choose a yellow/degraded badge instead of a red error.

This is half of the user-visible fix discussed in the investigation thread. The other half — don't flap the server to Error state on a single 5-second health-check miss — is being addressed in a separate PR.

Test plan

  • go test ./internal/diagnostics/ -race — all green, including the three new cases (TestClassify_HTTP_Timeout, TestClassify_HTTP_TimeoutStringWrapped, TestClassify_HTTP_StatusFromText)
  • go build ./... clean
  • go test ./internal/runtime/supervisor/ -race — green (no churn on attach path)

🤖 Generated with Claude Code

@cloudflare-workers-and-pages
Copy link
Copy Markdown

cloudflare-workers-and-pages Bot commented May 15, 2026

Deploying mcpproxy-docs with  Cloudflare Pages  Cloudflare Pages

Latest commit: 1d3e4cd
Status: ✅  Deploy successful!
Preview URL: https://cf2e7cb2.mcpproxy-docs.pages.dev
Branch Preview URL: https://fix-classifier-http-timeout.mcpproxy-docs.pages.dev

View logs

@github-actions
Copy link
Copy Markdown

github-actions Bot commented May 15, 2026

📦 Build Artifacts

Workflow Run: View Run
Branch: fix/classifier-http-timeout

Available Artifacts

  • archive-darwin-amd64 (26 MB)
  • archive-darwin-arm64 (23 MB)
  • archive-linux-amd64 (15 MB)
  • archive-linux-arm64 (13 MB)
  • archive-windows-amd64 (26 MB)
  • archive-windows-arm64 (23 MB)
  • frontend-dist-pr (0 MB)
  • installer-dmg-darwin-amd64 (20 MB)
  • installer-dmg-darwin-arm64 (18 MB)

How to Download

Option 1: GitHub Web UI (easiest)

  1. Go to the workflow run page linked above
  2. Scroll to the bottom "Artifacts" section
  3. Click on the artifact you want to download

Option 2: GitHub CLI

gh run download 26144226306 --repo smart-mcp-proxy/mcpproxy-go

Note: Artifacts expire in 14 days.

Dumbris pushed a commit that referenced this pull request May 20, 2026
…king Error

Slow remote upstreams (notably hf.co/mcp under load) routinely miss a
single 5-second health-check window without actually being down. The
previous code flipped the server to Error on the very first miss, which
caused two visible bugs:

1. The Web UI's tools list went empty for ~30-60s every time the user
   toggled a tool, because the post-toggle re-fetch hit the State=Error
   window where StateView returns no tools.
2. Combined with the unclassified-error code (PR #469), the user saw a
   red "Server Error / MCPX_UNKNOWN_UNCLASSIFIED" alert that paid no
   real-world dividend — by the next 30s tick the server was Ready again.

Add a small consecutive-failure counter to the managed Client. Transient
errors (deadline exceeded, timeout, context canceled) require
healthCheckFailureThreshold=3 misses (~90s) before flipping Error. Hard
errors (connection refused, no such host, network unreachable, etc.)
bypass the counter and trigger Error on the first miss — those are real
outages and waiting helps no one. A successful health check or a fresh
Connect() resets the counter to zero.

Tests cover all four behaviors: tolerated transient, immediate hard,
success-resets, and connect-resets.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
…own codes

The HTTP transport adapter wraps context.DeadlineExceeded and non-2xx
responses as plain "transport error: ..." strings before bubbling them up.
The typed errors.Is / statusError paths the classifier relied on never
fire for those, so every hf.co timeout or 504 surfaced to the UI as
MCPX_UNKNOWN_UNCLASSIFIED ("Server Error") even though the cause was a
well-known transient upstream condition.

Add two recovery rules in classifyHTTP:
- context.DeadlineExceeded on http transport -> new MCPX_HTTP_TIMEOUT
  (severity: warn — usually transient, not a config bug)
- substring "context deadline exceeded" on http transport -> same
- "request failed with status NNN" / "notification failed with status NNN"
  -> route through DiagnoseHTTPStatus so 401/403/404/5xx all get their
  existing typed codes

Tests cover both the typed and stringified forms taken verbatim from
production server-hugginface.log.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@Dumbris Dumbris force-pushed the fix/classifier-http-timeout branch from fc13dfe to 1d3e4cd Compare May 20, 2026 05:53
@Dumbris Dumbris merged commit fd78619 into main May 20, 2026
23 checks passed
Dumbris added a commit that referenced this pull request May 20, 2026
…king Error (#470)

Slow remote upstreams (notably hf.co/mcp under load) routinely miss a
single 5-second health-check window without actually being down. The
previous code flipped the server to Error on the very first miss, which
caused two visible bugs:

1. The Web UI's tools list went empty for ~30-60s every time the user
   toggled a tool, because the post-toggle re-fetch hit the State=Error
   window where StateView returns no tools.
2. Combined with the unclassified-error code (PR #469), the user saw a
   red "Server Error / MCPX_UNKNOWN_UNCLASSIFIED" alert that paid no
   real-world dividend — by the next 30s tick the server was Ready again.

Add a small consecutive-failure counter to the managed Client. Transient
errors (deadline exceeded, timeout, context canceled) require
healthCheckFailureThreshold=3 misses (~90s) before flipping Error. Hard
errors (connection refused, no such host, network unreachable, etc.)
bypass the counter and trigger Error on the first miss — those are real
outages and waiting helps no one. A successful health check or a fresh
Connect() resets the counter to zero.

Tests cover all four behaviors: tolerated transient, immediate hard,
success-resets, and connect-resets.

Co-authored-by: Claude Code <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants