fix: retry transient HTTP 400 errors from upstream providers#1
Open
BlueBoobyAI wants to merge 5 commits into
Open
fix: retry transient HTTP 400 errors from upstream providers#1BlueBoobyAI wants to merge 5 commits into
BlueBoobyAI wants to merge 5 commits into
Conversation
DeepSeek and other providers occasionally return HTTP 400 on transient internal failures (not a real request bug). The retry gate explicitly excluded 400, so these bypassed the retry loop and killed the session. Adding 400 to retryable_upstream_status() lets transient 400s enter the existing exponential-backoff retry loop (5 attempts, 2s base, 60s cap). Real 400s (malformed requests) simply retry to the same 400 — an extra fast request with no billing impact. Same pattern as AWS SDK's RetryMode.ADAPTIVE — classify transient service failures as retryable regardless of status code.
Three new tests in test_provider_rate_limit: - test_execute_with_retry_400_retried_then_exhausts — asserts 3 calls - test_execute_with_retry_400_then_200_recovers — asserts recovery - test_execute_with_retry_openai_400_retried_then_exhausts — asserts 3 calls via openai SDK One updated test in test_anthropic_messages_429_retry: - test_transient_400_is_retried_then_exhausts — real execute_with_retry, 5 send calls, SSE error envelope with "Invalid request sent to provider." All 1440 tests pass with these changes. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
… requests A genuine bad request (wrong model name, malformed prompt) should not block all concurrent proxy requests during retry backoff. Only 429 and 5xx signal upstream congestion worth a global pause. Also fixes duplicate @pytest.mark.asyncio decorator on the renamed test, and bumps version to 1.2.42 per AGENTS.md requirements (version + uv.lock). Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- retryable_upstream_status docstring now mentions 400 (no reactive block) - Log label for 400 changed from "Upstream server error (400)" to "Transient bad request (400)" — 400 is a client error, not server error - _upstream_http_retryable docstring notes 400 is intentionally excluded (it lives in a separate branch to skip set_blocked) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
- Add isinstance(exc, openai.BadRequestError): return 400 before the generic openai.APIError branch (BadRequestError is a subclass, so it would pass through the generic branch only if status_code attr is present — defensive ordering) - Use 0.5s base_delay for 400 retries vs 2s for 429/5xx (a transient DeepSeek hiccup resolves in <500ms; 2s was unnecessarily slow) Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Problem
Claude Code sessions crash multiple times per day when proxied through free-claude-code to DeepSeek V4 Flash:
The user sees a fatal error in their terminal:
Root cause: DeepSeek occasionally returns HTTP 400 on transient internal hiccups (not real request bugs). The proxy's
retryable_upstream_status()function atproviders/rate_limit.py:28only treats HTTP 429 and 5xx as retryable — 400 passes through immediately, raisingInvalidRequestError→ session crash → user restarts Claude Code.The
/messagesPOST is idempotent and 400s aren't billed, so retrying is safe.Fix
providers/rate_limit.py— addif status == 400: return 400to both branches ofretryable_upstream_status():Important safety detail: HTTP 400 retries do NOT call
set_blocked()on the sharedGlobalRateLimiter. Unlike 429/5xx (which signal upstream congestion worth a global pause), a transient 400 is a per-request hiccup. A genuine bad request (wrong model name) retries with individual backoff but does not stall concurrent requests.Additional fixes per review:
retryable_upstream_statusdocstring updated to document 400 behavior"Upstream server error (400)"to"Transient bad request (400)"— 400 is a client error, not server error_upstream_http_retryabledocstring notes 400 is intentionally excluded (separate branch to skipset_blocked)Safety Evidence
All 1440 existing tests pass. New tests added:
Unit tests (
test_provider_rate_limit.py)test_execute_with_retry_400_retried_then_exhausts— httpx HTTPStatusError with 400: asserts 3 calls (1 initial + 2 retries)test_execute_with_retry_400_then_200_recovers— transient 400 then 200: assertscall_count == 2and result =="ok"test_execute_with_retry_openai_400_retried_then_exhausts— openai.BadRequestError with 400: asserts 3 callsIntegration test (
test_anthropic_messages_429_retry.py)test_transient_400_is_retried_then_exhausts— realexecute_with_retry, 5 send calls, SSE error envelope with"Invalid request sent to provider."