Skip to content

Commit 2087b68

Browse files
lesnik512claude
andauthored
feat(circuit-breaker): time-based failure-rate trip mode (0.13.0) (#69)
* feat(circuit-breaker): add time-bucketed _RollingWindow recorder Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(circuit-breaker): thread rate-mode config + validation Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * feat(circuit-breaker): rate-over-window trip mode Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * test(circuit-breaker): assert rate-mode circuit.opened attributes Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(circuit-breaker): document rate mode; 0.13.0 release notes Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(planning): add the circuit-breaker-rate-mode change bundle Design + plan for the opt-in time-based failure-rate trip mode, and the Active Index entry. Bundle stays active/draft until merge. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(circuit-breaker): document rate mode in the module docstring Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * chore(planning): archive the circuit-breaker-rate-mode bundle (#69) Ship bookkeeping for PR #69: fill the 0.13.0 release-notes PR number, mark the bundle shipped (pr: 69), move it from changes/active/ to changes/archive/, flip its Index line to Archived, and trim the deferred "CircuitBreaker v2" item to the still-open axes (count-based window, manual control + state). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> * docs(circuit-breaker): clarify mode-switch precedence and cross-mode validation Address PR #69 review: note that failure_rate_threshold is the sole mode switch (both-set → rate wins, not an error), and that window_seconds / minimum_calls are validated in both modes even when inert in classic. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> --------- Co-authored-by: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
1 parent 2b7d9b7 commit 2087b68

11 files changed

Lines changed: 1389 additions & 32 deletions

File tree

architecture/resilience.md

Lines changed: 2 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -14,6 +14,8 @@
1414

1515
`AsyncCircuitBreaker` and sync `CircuitBreaker` are a classic consecutive-failure circuit breaker: the circuit opens after `failure_threshold` consecutive counted failures, fast-fails while OPEN, admits one probe after `reset_timeout` (HALF_OPEN), and closes again after `success_threshold` consecutive probe successes; a probe failure re-opens it. A *counted failure* is a `NetworkError`, an httpware `TimeoutError`, or a `StatusError` whose `status_code` is in the effective failure set (default: all 5xx, 500–599); 4xx including 429 count as successes, and any other exception type propagates unchanged without affecting circuit state. When the breaker refuses a request — OPEN, or HALF_OPEN with the single probe slot already taken — it raises `CircuitOpenError` and never forwards to `next`; the error's `retry_after` carries the seconds until the next probe will be admitted, or `None` when a concurrent probe is already in flight. A breaker instance is sharable across clients (one shared circuit); a sync instance cannot be shared with an async one.
1616

17+
The classic consecutive-failure mode is the default and unchanged. An opt-in time-based failure-rate mode is available: set `failure_rate_threshold` (a float in `(0, 1]`) to switch. In rate mode the circuit opens when the observed failure rate over a rolling `window_seconds` window (default `30.0` s) meets or exceeds the threshold, but only once `minimum_calls` outcomes have been observed in that window (default `20`). The presence of `failure_rate_threshold` is the sole mode switch: when it is set, the breaker is in rate mode and `failure_threshold` is ignored (setting both is not an error — rate mode wins). `window_seconds` and `minimum_calls` are validated at construction in both modes even though they are inert in classic mode, so an invalid value is rejected eagerly regardless of mode. Half-open recovery (`reset_timeout`, `success_threshold`, the single-probe admission) is identical to classic mode. The event names (`circuit.opened`, `circuit.rejected`, `circuit.half_open`, `circuit.closed`) are the same in both modes; in rate mode the `circuit.opened` event carries extra attributes — `failure_rate`, `failure_rate_threshold`, `window_seconds`, `observed_calls` — and its message is `"circuit opened — failure rate threshold reached"`.
18+
1719
`AsyncTimeout` is an async-only middleware that bounds the total wall-clock for the whole inner pipeline (most importantly across an `AsyncRetry` loop, whose attempts and backoff sleeps `httpx2` cannot bound). It is not a per-call timeout — `httpx2`'s connect/read/write/pool timeouts are the right tool for a single outbound call, and `AsyncTimeout` does not duplicate them. It rejects a non-finite or non-positive `timeout` at construction, and on expiry raises httpware `TimeoutError`. There is no sync `Timeout`: a sync total-deadline cannot interrupt a blocking call mid-flight, and `httpx2` already covers sync per-call timeouts. Sync callers configure `httpx2`'s timeouts directly.
1820

1921
The recommended (documented, not enforced) composition order is `AsyncTimeout → AsyncCircuitBreaker → AsyncBulkhead → AsyncRetry → terminal`. With the breaker outside retry, an open circuit short-circuits the entire retry loop and the breaker counts one outcome per fully-exhausted retry sequence rather than per attempt.

docs/resilience.md

Lines changed: 19 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -191,6 +191,25 @@ Emitted on logger `httpware.circuit_breaker`:
191191
| `circuit.half_open` | Reset timeout elapsed; circuit transitions OPEN → HALF_OPEN |
192192
| `circuit.closed` | Success threshold reached; circuit transitions HALF_OPEN → CLOSED |
193193

194+
### Time-based failure-rate mode
195+
196+
By default the circuit breaker trips on `failure_threshold` *consecutive* counted failures. This can miss partial degradation: a downstream returning errors on exactly half of all requests will never form a consecutive streak long enough to trip — the circuit stays closed while the error rate sits at 50%.
197+
198+
For that pattern, switch to rate mode by passing `failure_rate_threshold`:
199+
200+
```python
201+
from httpware.middleware.resilience import AsyncCircuitBreaker
202+
203+
204+
breaker = AsyncCircuitBreaker(
205+
failure_rate_threshold=0.5, # open at ≥50% failures
206+
window_seconds=30.0, # over a rolling 30s window
207+
minimum_calls=20, # but only once 20+ calls are observed
208+
)
209+
```
210+
211+
When `failure_rate_threshold` is set the breaker watches the rolling `window_seconds` window (default `30.0` s) and opens once the failure rate meets the threshold — provided at least `minimum_calls` (default `20`) outcomes have been observed in that window. Classic mode is the default; `failure_threshold` is ignored in rate mode. Half-open recovery works identically in both modes. The same `CircuitBreaker` constructor accepts the same parameters for sync clients.
212+
194213
### Sharing
195214

196215
Pass the same instance to multiple clients to enforce one shared circuit across them. A `CircuitBreaker` (sync) cannot be shared with an `AsyncCircuitBreaker` — they use different concurrency primitives.

planning/README.md

Lines changed: 1 addition & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -74,6 +74,7 @@ _None._
7474

7575
### Archived (shipped)
7676

77+
- **[circuit-breaker-rate-mode](changes/archive/2026-06-16.02-circuit-breaker-rate-mode/design.md)** (#69, 2026-06-16) — Added an opt-in time-based failure-rate trip mode to the circuit breaker (classic stays default). Shipped 0.13.0; closed deferred item "CircuitBreaker v2 (a)".
7778
- **[per-verb-with-response](changes/archive/2026-06-16.01-per-verb-with-response/design.md)** (#68, 2026-06-16) — Added `get_with_response``request_with_response` siblings (required `response_model`, returns `(Response, T)`) to both clients. Shipped 0.12.0; closed the deferred "Per-verb-with-response siblings" item.
7879
- **[custom-decoder-guide](changes/archive/2026-06-15.01-custom-decoder-guide/change.md)** (#67, 2026-06-15) — Docs: a "write your own `ResponseDecoder`" guide for Seam B, mirroring `docs/middleware.md`. Closed deferred item G6.
7980
- **[audit-doc-fixes](changes/archive/2026-06-14.06-audit-doc-fixes/change.md)** (#66, 2026-06-14) — Closed the [deep-audit](audits/2026-06-14-deep-audit.md) doc-accuracy findings: `Client.stream()` docs, terminal-call attribution, the four auto-raise sites, the pydantic upper bound, and root import paths.
Lines changed: 180 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,180 @@
1+
---
2+
status: shipped
3+
date: 2026-06-16
4+
slug: circuit-breaker-rate-mode
5+
supersedes: null
6+
superseded_by: null
7+
pr: 69
8+
outcome: Shipped 0.13.0 — opt-in time-based failure-rate trip mode (failure_rate_threshold + window_seconds + minimum_calls) on both breakers; classic stays default. Closed the "CircuitBreaker v2 (a)" deferred item; count-based windows, slow-call axis, and manual control + state remain deferred.
9+
---
10+
11+
# Design: CircuitBreaker v2 — time-based failure-rate trip mode
12+
13+
## Summary
14+
15+
Add an additive, opt-in **time-based failure-rate** trip mode to
16+
`AsyncCircuitBreaker` / `CircuitBreaker`. The classic consecutive-failure model
17+
stays the default and is byte-unchanged; nothing trips differently unless the
18+
caller sets `failure_rate_threshold`. Rate mode opens the circuit when the
19+
failure rate over a rolling time window meets the threshold, once a minimum call
20+
volume is observed. Ships as 0.13.0.
21+
22+
## Motivation
23+
24+
The 0.10.0 breaker ships only the classic model: open after N *consecutive*
25+
counted failures. That cannot catch *partial* degradation — a steady 50% error
26+
rate that alternates success/fail never reaches a consecutive streak, so the
27+
breaker never trips while half the traffic is failing. This was deferred to v2
28+
in the 0.10.0 spec, with the config deliberately shaped so a rate mode is purely
29+
additive (see [`deferred.md`](../../deferred.md) → "CircuitBreaker v2").
30+
31+
The verified comparison in `deferred.md` (2026-06-13) shows rate-over-window is
32+
the mainstream model for service-level breakers: Hystrix (time-bucketed),
33+
Polly v8 (time-based only), and Envoy/Istio outlier detection (time intervals)
34+
are all time-based; Resilience4j defaults to count-based but offers both. We
35+
choose **time-based** because the mental model matches the HTTP domain ("trip if
36+
>50% of calls failed in the last 30s"), it degrades sanely under variable
37+
traffic (a count-based window can hold hour-old outcomes when traffic is low),
38+
and it is consistent with the existing wall-clock `reset_timeout`.
39+
40+
## Non-goals
41+
42+
- **Count-based windows.** Deferred; the config leaves room to add a window-type
43+
selector later if anyone asks.
44+
- **Slow-call rate axis.** Resilience4j-only; redundant with `AsyncTimeout`.
45+
- **Manual control / read-only `state` introspection** (deferred item b). Stays
46+
parked as YAGNI; independent design axis.
47+
- **Rate-based half-open recovery.** Half-open stays identical to v1 in both
48+
modes (consecutive `success_threshold` probe successes) — simpler, and the
49+
trip mode is the only behavioral change.
50+
51+
## Design
52+
53+
### 1. Opt-in config shape
54+
55+
`failure_rate_threshold` is the mode switch on both wrappers' `__init__`:
56+
57+
```python
58+
AsyncCircuitBreaker(
59+
failure_rate_threshold=0.5, # None (default) = classic; set = rate mode
60+
window_seconds=30.0, # rolling window duration (default 30.0)
61+
minimum_calls=20, # floor before the rate is evaluated (default 20)
62+
# unchanged, shared by both modes:
63+
reset_timeout=30.0,
64+
success_threshold=1,
65+
failure_status_codes=None,
66+
)
67+
```
68+
69+
- **Shared across modes:** `reset_timeout`, `success_threshold` (half-open
70+
recovery), `failure_status_codes` (the counted-failure set — 429/4xx remain
71+
successes).
72+
- **Classic-only:** `failure_threshold`. In rate mode it is **silently ignored**
73+
(documented). The two thresholds don't conflict — the mode is selected solely
74+
by whether `failure_rate_threshold` is `None` — so no raise-on-both guard is
75+
added.
76+
- **Validation** (in `_CircuitBreakerState.__init__`, alongside the existing
77+
checks): when `failure_rate_threshold is not None`, require
78+
`0.0 < failure_rate_threshold <= 1.0`; require `window_seconds > 0`; require
79+
`minimum_calls >= 1`. New message constants follow the existing
80+
`_FAILURE_THRESHOLD_INVALID` pattern.
81+
82+
### 2. Time-based rolling-bucket window
83+
84+
A new internal `_RollingWindow` (or inline state on `_CircuitBreakerState`):
85+
`window_seconds` divided into a fixed **10 buckets** (`_BUCKET_COUNT`), each a
86+
`[successes, failures]` pair tagged with the time-slot it represents. Bucket
87+
width = `window_seconds / 10`.
88+
89+
Recording an outcome (synchronous, no await):
90+
1. `slot = floor(self._now() / bucket_width)`.
91+
2. `index = slot % _BUCKET_COUNT`. If the bucket at `index` carries a stale slot
92+
tag (`!= slot`), reset it to `[0, 0]` and retag — this evicts data older than
93+
one full window in O(`_BUCKET_COUNT`), independent of call volume.
94+
3. Increment the bucket's success or failure count.
95+
96+
Rate computation sums `(successes, failures)` across buckets whose slot tag is
97+
within the last `_BUCKET_COUNT` slots (live), giving `total` and `failures`;
98+
`rate = failures / total` when `total > 0`. Eviction-on-read drops buckets that
99+
fell out of the window since the last write.
100+
101+
All bucket reads/writes happen inside the same synchronous critical section the
102+
breaker already uses (async: lock-free under one event loop; sync:
103+
`threading.Lock`), and `_now()` is read inside that section.
104+
105+
### 3. State-machine integration — mode changes only the CLOSED trip test
106+
107+
The trip mode affects exactly one decision: when to open from CLOSED. Everything
108+
else is shared.
109+
110+
- **CLOSED, rate mode:** `on_success` and `on_failure` record the outcome into
111+
the window (a counted failure increments failures; a success increments
112+
successes). After recording, if `total >= minimum_calls` **and**
113+
`rate >= failure_rate_threshold`, open the circuit. The classic consecutive
114+
counters are not used in rate mode.
115+
- **CLOSED, classic mode:** unchanged — consecutive-failure counter, open at
116+
`failure_threshold`.
117+
- **OPEN → HALF_OPEN → CLOSED:** identical for both modes — lazy probe after
118+
`reset_timeout`, `success_threshold` consecutive probe successes close it, one
119+
probe failure re-opens. On transition to CLOSED, the window is cleared (all
120+
buckets reset) so recovery starts from a clean slate.
121+
- **`release_probe` and non-counted exceptions** never touch the window —
122+
consistent with today (programming errors can't trip the breaker).
123+
124+
This logic lives entirely in the shared `_CircuitBreakerState`, so
125+
`AsyncCircuitBreaker` and `CircuitBreaker` reach parity with no per-wrapper code
126+
(the wrappers' `__init__` just forward the three new params).
127+
128+
### 4. Observability
129+
130+
Event names are unchanged (`circuit.opened`, `circuit.rejected`,
131+
`circuit.half_open`, `circuit.closed`) — the stable observability surface is
132+
preserved. In rate mode, `circuit.opened` carries rate attributes instead of the
133+
classic ones: `failure_rate`, `failure_rate_threshold`, `window_seconds`,
134+
`observed_calls` (the `total`). Classic mode keeps emitting `failure_threshold` +
135+
`failures`. `circuit.rejected`/`half_open`/`closed` are unchanged.
136+
137+
## Testing
138+
139+
Deterministic tests with a pinned `_now` callable (the existing constructor
140+
already accepts `_now`), sync + async mirrors:
141+
142+
- **Trips at threshold:** with `minimum_calls` met and `rate >=
143+
failure_rate_threshold`, the circuit opens; an alternating 50% pattern that
144+
never trips the classic breaker DOES trip rate mode.
145+
- **Volume floor:** below `minimum_calls`, a 100%-failure burst does NOT open.
146+
- **Time eviction:** failures recorded, then `_now` advanced past
147+
`window_seconds`, then fresh successes — old failures age out and the rate
148+
reflects only the live window.
149+
- **Classic unchanged:** existing breaker tests stay green (no behavior drift
150+
when `failure_rate_threshold is None`).
151+
- **Half-open in rate mode:** open → probe after `reset_timeout`
152+
`success_threshold` successes close → window cleared (a subsequent single
153+
failure doesn't immediately re-trip).
154+
- **Validation:** out-of-range `failure_rate_threshold`, non-positive
155+
`window_seconds`, `minimum_calls < 1` raise `ValueError`.
156+
- **Hypothesis prop** (`test_circuit_breaker_props.py` companion) for the
157+
rolling-window recorder: arbitrary interleavings of outcomes and time advances
158+
never miscount the live-window totals or evict live data.
159+
160+
`just test` green; `just lint` clean.
161+
162+
## Risk
163+
164+
- **Window-eviction correctness (medium × high).** Off-by-one in slot tagging or
165+
the modulo ring could count stale data or drop live data. Mitigated by the
166+
Hypothesis prop on the recorder and explicit time-advance tests; the standard
167+
slot-tag-and-retag pattern is well understood.
168+
- **Concurrency (low × high).** Recording stays a synchronous mutation, so the
169+
async lock-free atomicity invariant and the sync `threading.Lock` both still
170+
hold — no new await points. Eviction reads `_now()` inside the critical
171+
section. This matches the `deferred.md` concurrency note.
172+
- **Config confusion (low × low).** `failure_threshold` being ignored in rate
173+
mode could surprise; mitigated by docstring + `architecture/resilience.md`
174+
wording.
175+
176+
## Out of scope
177+
178+
Count-based windows; slow-call axis; manual control + `state`; rate-based
179+
half-open; any change to classic-mode behavior, `AsyncTimeout`, or the
180+
composition-order recommendation.

0 commit comments

Comments
 (0)