|
| 1 | +--- |
| 2 | +status: shipped |
| 3 | +date: 2026-06-16 |
| 4 | +slug: circuit-breaker-rate-mode |
| 5 | +supersedes: null |
| 6 | +superseded_by: null |
| 7 | +pr: 69 |
| 8 | +outcome: Shipped 0.13.0 — opt-in time-based failure-rate trip mode (failure_rate_threshold + window_seconds + minimum_calls) on both breakers; classic stays default. Closed the "CircuitBreaker v2 (a)" deferred item; count-based windows, slow-call axis, and manual control + state remain deferred. |
| 9 | +--- |
| 10 | + |
| 11 | +# Design: CircuitBreaker v2 — time-based failure-rate trip mode |
| 12 | + |
| 13 | +## Summary |
| 14 | + |
| 15 | +Add an additive, opt-in **time-based failure-rate** trip mode to |
| 16 | +`AsyncCircuitBreaker` / `CircuitBreaker`. The classic consecutive-failure model |
| 17 | +stays the default and is byte-unchanged; nothing trips differently unless the |
| 18 | +caller sets `failure_rate_threshold`. Rate mode opens the circuit when the |
| 19 | +failure rate over a rolling time window meets the threshold, once a minimum call |
| 20 | +volume is observed. Ships as 0.13.0. |
| 21 | + |
| 22 | +## Motivation |
| 23 | + |
| 24 | +The 0.10.0 breaker ships only the classic model: open after N *consecutive* |
| 25 | +counted failures. That cannot catch *partial* degradation — a steady 50% error |
| 26 | +rate that alternates success/fail never reaches a consecutive streak, so the |
| 27 | +breaker never trips while half the traffic is failing. This was deferred to v2 |
| 28 | +in the 0.10.0 spec, with the config deliberately shaped so a rate mode is purely |
| 29 | +additive (see [`deferred.md`](../../deferred.md) → "CircuitBreaker v2"). |
| 30 | + |
| 31 | +The verified comparison in `deferred.md` (2026-06-13) shows rate-over-window is |
| 32 | +the mainstream model for service-level breakers: Hystrix (time-bucketed), |
| 33 | +Polly v8 (time-based only), and Envoy/Istio outlier detection (time intervals) |
| 34 | +are all time-based; Resilience4j defaults to count-based but offers both. We |
| 35 | +choose **time-based** because the mental model matches the HTTP domain ("trip if |
| 36 | +>50% of calls failed in the last 30s"), it degrades sanely under variable |
| 37 | +traffic (a count-based window can hold hour-old outcomes when traffic is low), |
| 38 | +and it is consistent with the existing wall-clock `reset_timeout`. |
| 39 | + |
| 40 | +## Non-goals |
| 41 | + |
| 42 | +- **Count-based windows.** Deferred; the config leaves room to add a window-type |
| 43 | + selector later if anyone asks. |
| 44 | +- **Slow-call rate axis.** Resilience4j-only; redundant with `AsyncTimeout`. |
| 45 | +- **Manual control / read-only `state` introspection** (deferred item b). Stays |
| 46 | + parked as YAGNI; independent design axis. |
| 47 | +- **Rate-based half-open recovery.** Half-open stays identical to v1 in both |
| 48 | + modes (consecutive `success_threshold` probe successes) — simpler, and the |
| 49 | + trip mode is the only behavioral change. |
| 50 | + |
| 51 | +## Design |
| 52 | + |
| 53 | +### 1. Opt-in config shape |
| 54 | + |
| 55 | +`failure_rate_threshold` is the mode switch on both wrappers' `__init__`: |
| 56 | + |
| 57 | +```python |
| 58 | +AsyncCircuitBreaker( |
| 59 | + failure_rate_threshold=0.5, # None (default) = classic; set = rate mode |
| 60 | + window_seconds=30.0, # rolling window duration (default 30.0) |
| 61 | + minimum_calls=20, # floor before the rate is evaluated (default 20) |
| 62 | + # unchanged, shared by both modes: |
| 63 | + reset_timeout=30.0, |
| 64 | + success_threshold=1, |
| 65 | + failure_status_codes=None, |
| 66 | +) |
| 67 | +``` |
| 68 | + |
| 69 | +- **Shared across modes:** `reset_timeout`, `success_threshold` (half-open |
| 70 | + recovery), `failure_status_codes` (the counted-failure set — 429/4xx remain |
| 71 | + successes). |
| 72 | +- **Classic-only:** `failure_threshold`. In rate mode it is **silently ignored** |
| 73 | + (documented). The two thresholds don't conflict — the mode is selected solely |
| 74 | + by whether `failure_rate_threshold` is `None` — so no raise-on-both guard is |
| 75 | + added. |
| 76 | +- **Validation** (in `_CircuitBreakerState.__init__`, alongside the existing |
| 77 | + checks): when `failure_rate_threshold is not None`, require |
| 78 | + `0.0 < failure_rate_threshold <= 1.0`; require `window_seconds > 0`; require |
| 79 | + `minimum_calls >= 1`. New message constants follow the existing |
| 80 | + `_FAILURE_THRESHOLD_INVALID` pattern. |
| 81 | + |
| 82 | +### 2. Time-based rolling-bucket window |
| 83 | + |
| 84 | +A new internal `_RollingWindow` (or inline state on `_CircuitBreakerState`): |
| 85 | +`window_seconds` divided into a fixed **10 buckets** (`_BUCKET_COUNT`), each a |
| 86 | +`[successes, failures]` pair tagged with the time-slot it represents. Bucket |
| 87 | +width = `window_seconds / 10`. |
| 88 | + |
| 89 | +Recording an outcome (synchronous, no await): |
| 90 | +1. `slot = floor(self._now() / bucket_width)`. |
| 91 | +2. `index = slot % _BUCKET_COUNT`. If the bucket at `index` carries a stale slot |
| 92 | + tag (`!= slot`), reset it to `[0, 0]` and retag — this evicts data older than |
| 93 | + one full window in O(`_BUCKET_COUNT`), independent of call volume. |
| 94 | +3. Increment the bucket's success or failure count. |
| 95 | + |
| 96 | +Rate computation sums `(successes, failures)` across buckets whose slot tag is |
| 97 | +within the last `_BUCKET_COUNT` slots (live), giving `total` and `failures`; |
| 98 | +`rate = failures / total` when `total > 0`. Eviction-on-read drops buckets that |
| 99 | +fell out of the window since the last write. |
| 100 | + |
| 101 | +All bucket reads/writes happen inside the same synchronous critical section the |
| 102 | +breaker already uses (async: lock-free under one event loop; sync: |
| 103 | +`threading.Lock`), and `_now()` is read inside that section. |
| 104 | + |
| 105 | +### 3. State-machine integration — mode changes only the CLOSED trip test |
| 106 | + |
| 107 | +The trip mode affects exactly one decision: when to open from CLOSED. Everything |
| 108 | +else is shared. |
| 109 | + |
| 110 | +- **CLOSED, rate mode:** `on_success` and `on_failure` record the outcome into |
| 111 | + the window (a counted failure increments failures; a success increments |
| 112 | + successes). After recording, if `total >= minimum_calls` **and** |
| 113 | + `rate >= failure_rate_threshold`, open the circuit. The classic consecutive |
| 114 | + counters are not used in rate mode. |
| 115 | +- **CLOSED, classic mode:** unchanged — consecutive-failure counter, open at |
| 116 | + `failure_threshold`. |
| 117 | +- **OPEN → HALF_OPEN → CLOSED:** identical for both modes — lazy probe after |
| 118 | + `reset_timeout`, `success_threshold` consecutive probe successes close it, one |
| 119 | + probe failure re-opens. On transition to CLOSED, the window is cleared (all |
| 120 | + buckets reset) so recovery starts from a clean slate. |
| 121 | +- **`release_probe` and non-counted exceptions** never touch the window — |
| 122 | + consistent with today (programming errors can't trip the breaker). |
| 123 | + |
| 124 | +This logic lives entirely in the shared `_CircuitBreakerState`, so |
| 125 | +`AsyncCircuitBreaker` and `CircuitBreaker` reach parity with no per-wrapper code |
| 126 | +(the wrappers' `__init__` just forward the three new params). |
| 127 | + |
| 128 | +### 4. Observability |
| 129 | + |
| 130 | +Event names are unchanged (`circuit.opened`, `circuit.rejected`, |
| 131 | +`circuit.half_open`, `circuit.closed`) — the stable observability surface is |
| 132 | +preserved. In rate mode, `circuit.opened` carries rate attributes instead of the |
| 133 | +classic ones: `failure_rate`, `failure_rate_threshold`, `window_seconds`, |
| 134 | +`observed_calls` (the `total`). Classic mode keeps emitting `failure_threshold` + |
| 135 | +`failures`. `circuit.rejected`/`half_open`/`closed` are unchanged. |
| 136 | + |
| 137 | +## Testing |
| 138 | + |
| 139 | +Deterministic tests with a pinned `_now` callable (the existing constructor |
| 140 | +already accepts `_now`), sync + async mirrors: |
| 141 | + |
| 142 | +- **Trips at threshold:** with `minimum_calls` met and `rate >= |
| 143 | + failure_rate_threshold`, the circuit opens; an alternating 50% pattern that |
| 144 | + never trips the classic breaker DOES trip rate mode. |
| 145 | +- **Volume floor:** below `minimum_calls`, a 100%-failure burst does NOT open. |
| 146 | +- **Time eviction:** failures recorded, then `_now` advanced past |
| 147 | + `window_seconds`, then fresh successes — old failures age out and the rate |
| 148 | + reflects only the live window. |
| 149 | +- **Classic unchanged:** existing breaker tests stay green (no behavior drift |
| 150 | + when `failure_rate_threshold is None`). |
| 151 | +- **Half-open in rate mode:** open → probe after `reset_timeout` → |
| 152 | + `success_threshold` successes close → window cleared (a subsequent single |
| 153 | + failure doesn't immediately re-trip). |
| 154 | +- **Validation:** out-of-range `failure_rate_threshold`, non-positive |
| 155 | + `window_seconds`, `minimum_calls < 1` raise `ValueError`. |
| 156 | +- **Hypothesis prop** (`test_circuit_breaker_props.py` companion) for the |
| 157 | + rolling-window recorder: arbitrary interleavings of outcomes and time advances |
| 158 | + never miscount the live-window totals or evict live data. |
| 159 | + |
| 160 | +`just test` green; `just lint` clean. |
| 161 | + |
| 162 | +## Risk |
| 163 | + |
| 164 | +- **Window-eviction correctness (medium × high).** Off-by-one in slot tagging or |
| 165 | + the modulo ring could count stale data or drop live data. Mitigated by the |
| 166 | + Hypothesis prop on the recorder and explicit time-advance tests; the standard |
| 167 | + slot-tag-and-retag pattern is well understood. |
| 168 | +- **Concurrency (low × high).** Recording stays a synchronous mutation, so the |
| 169 | + async lock-free atomicity invariant and the sync `threading.Lock` both still |
| 170 | + hold — no new await points. Eviction reads `_now()` inside the critical |
| 171 | + section. This matches the `deferred.md` concurrency note. |
| 172 | +- **Config confusion (low × low).** `failure_threshold` being ignored in rate |
| 173 | + mode could surprise; mitigated by docstring + `architecture/resilience.md` |
| 174 | + wording. |
| 175 | + |
| 176 | +## Out of scope |
| 177 | + |
| 178 | +Count-based windows; slow-call axis; manual control + `state`; rate-based |
| 179 | +half-open; any change to classic-mode behavior, `AsyncTimeout`, or the |
| 180 | +composition-order recommendation. |
0 commit comments