Skip to content

fix: retry transient HTTP 400 errors from upstream providers#1

Open
BlueBoobyAI wants to merge 5 commits into
mainfrom
fix/retry-400-transient
Open

fix: retry transient HTTP 400 errors from upstream providers#1
BlueBoobyAI wants to merge 5 commits into
mainfrom
fix/retry-400-transient

Conversation

@BlueBoobyAI

@BlueBoobyAI BlueBoobyAI commented Jun 14, 2026

Copy link
Copy Markdown
Owner

Problem

Claude Code sessions crash multiple times per day when proxied through free-claude-code to DeepSeek V4 Flash:

Invalid request sent to provider. Request ID: req_d10fcb8d9fb7

The user sees a fatal error in their terminal:

Provider request failed unexpectedly.
Request ID: req_7317b1104a14

Root cause: DeepSeek occasionally returns HTTP 400 on transient internal hiccups (not real request bugs). The proxy's retryable_upstream_status() function at providers/rate_limit.py:28 only treats HTTP 429 and 5xx as retryable — 400 passes through immediately, raising InvalidRequestError → session crash → user restarts Claude Code.

The /messages POST is idempotent and 400s aren't billed, so retrying is safe.

Fix

providers/rate_limit.py — add if status == 400: return 400 to both branches of retryable_upstream_status():

# Before: 400 falls through to return None → immediate crash
if isinstance(exc, httpx.HTTPStatusError):
    status = exc.response.status_code
    if _upstream_http_retryable(status):  # only 429 + 5xx
        return status
    return None  # ← 400 dies here

# After: 400 enters the existing exponential-backoff retry loop
if isinstance(exc, httpx.HTTPStatusError):
    status = exc.response.status_code
    if _upstream_http_retryable(status):
        return status
    if status == 400:
        return 400  # ← transient 400 → retry
    return None

Important safety detail: HTTP 400 retries do NOT call set_blocked() on the shared GlobalRateLimiter. Unlike 429/5xx (which signal upstream congestion worth a global pause), a transient 400 is a per-request hiccup. A genuine bad request (wrong model name) retries with individual backoff but does not stall concurrent requests.

Additional fixes per review:

  • retryable_upstream_status docstring updated to document 400 behavior
  • Log label changed from "Upstream server error (400)" to "Transient bad request (400)" — 400 is a client error, not server error
  • _upstream_http_retryable docstring notes 400 is intentionally excluded (separate branch to skip set_blocked)

Safety Evidence

All 1440 existing tests pass. New tests added:

Unit tests (test_provider_rate_limit.py)

  • test_execute_with_retry_400_retried_then_exhausts — httpx HTTPStatusError with 400: asserts 3 calls (1 initial + 2 retries)
  • test_execute_with_retry_400_then_200_recovers — transient 400 then 200: asserts call_count == 2 and result == "ok"
  • test_execute_with_retry_openai_400_retried_then_exhausts — openai.BadRequestError with 400: asserts 3 calls

Integration test (test_anthropic_messages_429_retry.py)

  • test_transient_400_is_retried_then_exhausts — real execute_with_retry, 5 send calls, SSE error envelope with "Invalid request sent to provider."

BlueBoobyAI and others added 5 commits June 14, 2026 10:01
DeepSeek and other providers occasionally return HTTP 400 on transient
internal failures (not a real request bug). The retry gate explicitly
excluded 400, so these bypassed the retry loop and killed the session.

Adding 400 to retryable_upstream_status() lets transient 400s enter the
existing exponential-backoff retry loop (5 attempts, 2s base, 60s cap).
Real 400s (malformed requests) simply retry to the same 400 — an extra
fast request with no billing impact.

Same pattern as AWS SDK's RetryMode.ADAPTIVE — classify transient service
failures as retryable regardless of status code.
Three new tests in test_provider_rate_limit:
- test_execute_with_retry_400_retried_then_exhausts — asserts 3 calls
- test_execute_with_retry_400_then_200_recovers — asserts recovery
- test_execute_with_retry_openai_400_retried_then_exhausts — asserts 3 calls via openai SDK

One updated test in test_anthropic_messages_429_retry:
- test_transient_400_is_retried_then_exhausts — real execute_with_retry, 5 send calls, SSE error envelope with "Invalid request sent to provider."

All 1440 tests pass with these changes.

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… requests

A genuine bad request (wrong model name, malformed prompt) should not
block all concurrent proxy requests during retry backoff. Only 429 and
5xx signal upstream congestion worth a global pause.

Also fixes duplicate @pytest.mark.asyncio decorator on the renamed test,
and bumps version to 1.2.42 per AGENTS.md requirements (version + uv.lock).

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- retryable_upstream_status docstring now mentions 400 (no reactive block)
- Log label for 400 changed from "Upstream server error (400)" to
  "Transient bad request (400)" — 400 is a client error, not server error
- _upstream_http_retryable docstring notes 400 is intentionally excluded
  (it lives in a separate branch to skip set_blocked)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add isinstance(exc, openai.BadRequestError): return 400 before the
  generic openai.APIError branch (BadRequestError is a subclass, so
  it would pass through the generic branch only if status_code attr
  is present — defensive ordering)
- Use 0.5s base_delay for 400 retries vs 2s for 429/5xx (a transient
  DeepSeek hiccup resolves in <500ms; 2s was unnecessarily slow)

Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant