fix(circuit_breaker): make probe election per-thread and synchronize local counters#8323
fix(circuit_breaker): make probe election per-thread and synchronize local counters#8323Iamrodos wants to merge 4 commits into
Conversation
…local counters The half-open probe owner id was minted once per process, so every thread in a process passed the owner check once any sibling won the election and all of them probed the recovering backend. The id is now a per-thread uuid (pid-aware for forked workers); the existing DynamoDB conditional write then elects one prober across threads and processes with no protocol change. The in-memory failure/success counters and observed-state map are now guarded by a lock, with threshold crossings detected atomically so a trip is persisted exactly once. Persistence settings are keyed per circuit name instead of living as shared instance attributes, and the local cache no longer raises into the protected call on concurrent expiry.
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #8323 +/- ##
===========================================
+ Coverage 96.63% 96.65% +0.01%
===========================================
Files 296 296
Lines 14730 14756 +26
Branches 1241 1243 +2
===========================================
+ Hits 14235 14262 +27
Misses 360 360
+ Partials 135 134 -1 ☔ View full report in Codecov by Harness. 🚀 New features to boost your workflow:
|
…election The takeover's failure arm was untested: when another environment wins the conditional election, the caller must get the open-circuit response and the protected function must not run.
There was a problem hiding this comment.
Thanks @Iamrodos. I put the per-process id back locally and the election tests fail 10/10, and I couldn't find an interleaving that breaks single-prober election. Good catch on the per-circuit settings and the cache KeyError too.
Two things before I approve:
-
The counter test doesn't catch the bug it's guarding. I removed the lock from
_call_closedand it still passed 10/10 because the GIL hides the lost update unless you force a switch between the read and the write. Worth forcing that, or dropping the 'deterministic' claim for this one. -
The docs say the hooks can run concurrently but don't tell people to make their callbacks thread-safe. That's the change most likely to bite someone, so I'd spell it out.
Rest is nits below. Thanks again.
| assert store.db["c"].half_open_owner is not None | ||
|
|
||
|
|
||
| def test_threads_trip_at_exactly_the_failure_threshold(store): |
There was a problem hiding this comment.
This passes 10/10 even with the lock removed — the GIL makes the read-modify-write effectively atomic, so the lost update never shows. It proves 'persist once' but not the atomic increment. Forcing a yield between read and write catches it:
import aws_lambda_powertools.utilities.circuit_breaker_alpha.base as cb_base
real_get = cb_base._LOCAL_FAILURES.get
def racy_get(name, default=0):
value = real_get(name, default)
time.sleep(0) # force a switch mid-increment
return value
monkeypatch.setattr(cb_base._LOCAL_FAILURES, "get", racy_get)| !!! note "Thread safety" | ||
| The utility is safe to share across threads: within a multi-threaded environment the probe election picks a single | ||
| thread, so the single-prober guarantee spans threads as well as environments, and the in-memory failure counter is | ||
| synchronized. Single-threaded functions (the normal Lambda model) are unaffected. `on_circuit_open` and |
There was a problem hiding this comment.
Add the actionable bit here — people need to know their callbacks must be thread-safe:
+ !!! warning
+ If your function runs multiple threads, make these callbacks thread-safe.
+ They can run concurrently for the same circuit.| def _settings_for(self, name: str) -> _CircuitSettings: | ||
| """Return a circuit's configured settings, or the defaults if never configured.""" | ||
| with self._lock: | ||
| return self._settings.get(name) or _DEFAULT_SETTINGS |
There was a problem hiding this comment.
Nit: works today since the tuple is always truthy, but .get(name, _DEFAULT_SETTINGS) says 'default when absent' without leaning on truthiness.
- return self._settings.get(name) or _DEFAULT_SETTINGS
+ return self._settings.get(name, _DEFAULT_SETTINGS)- Make the counter race test catch a missing lock: park readers mid-increment so an unsynchronized read-modify-write deterministically loses updates (fails 10/10 unlocked, passes 10/10 locked) - Document that on_circuit_open and on_transition callbacks can run concurrently and must be thread-safe - Use dict.get's default instead of `or` for the per-circuit settings lookup
f48b3b5 to
be27f1b
Compare
|
|
Thanks for the review @leandrodamascena — all three items are addressed in be27f1b. First, an apology for the churn on this PR: the replies that appeared and then vanished, and the force-push. My coding agent went rogue and pushed a commit and posted comments without me reviewing them first. I normally inspect and validate everything in detail before anything gets posted; this time it went nuclear before I looked up. It has been fined 1M tokens and had its push access revoked pending a performance review (it's going on the dev list too). I've deleted the auto-posted replies and replaced the commit. What's on the branch now is what I stand behind. On the substance:
Ready for another look whenever you have time. |



Issue number: closes #8320
Summary
Changes
The half-open probe owner id (
_ENVIRONMENT_ID) was minted once per process at import, so every thread in a process carried the same id. The DynamoDB conditional write elected a single owner correctly, but therecord.half_open_owner == idcheck then passed for every sibling thread — all of them probed, which is the thundering herd the election exists to prevent.The id is now a per-thread uuid held in
threading.local(), minted on first use. The existing conditional write then elects one prober across threads and processes — no protocol change, no schema change; per-thread granularity just changes what the owner string identifies. Two deliberate details:threading.get_ident(), since the OS reuses thread ids and a recycled id would let an unrelated thread pass the owner check — the same bug at lower frequencyos.getpid()and re-mints afterfork, since forked children inherit the forking thread's locals — the same bug again, cross-process, formultiprocessingworker poolsAs requested, the counter fixes cover all three module-level dicts in one pass:
_LOCAL_FAILURES,_LOCAL_SUCCESSES, and_LAST_OBSERVED_STATEnow sit behind one lock, with increment and threshold-crossing detected atomically — exactly one thread observes the crossing, fixing both the lost updates (tripping late) and the redundantsave_open/save_closedwrites. The lock is never held across persistence I/O or user callbacks (on_circuit_open/on_transitionhooks may now run concurrently, noted in the docs).Two more shared-state issues surfaced while in there:
configure()wrotelocal_cache_max_age/recovery_timeoutonto the shared persistence instance and read them later, so two circuits with different configs on one store could stamp each other's TTL and probe lease. A lock can't restore that association across the window, so settings are now keyed per circuit name (a plain dict — evicting a live circuit's settings would silently swap in defaults, unlike the read cache where eviction just costs a re-read). The write-onlycircuit_nameattribute is removed; a subclass setting these attributes directly would now create dead attributes (alpha, and nothing in the repo or docs references them).delcould raiseKeyErrorthroughget_state()into the protected call under concurrent expiry — the one path where the breaker could become the outage. Cache access is lock-guarded and usespop.One trade-off worth stating explicitly: probe ownership is now per-thread, so in a thread-per-request pool the winning thread may never run the circuit again and recovery waits for the probe lease to expire before takeover. Correctness over recovery latency; documented in the new "Thread safety" note. Related, pre-existing and unchanged:
_LOCAL_SUCCESSESis keyed by circuit, so a same-process lease takeover inherits the previous owner's success streak — the same fuzziness the lease protocol already has cross-environment.User experience
Before (8 worker threads sharing one circuit, downstream recovering):
After: exactly one probe per recovery window across all threads and environments, and the circuit trips at exactly
failure_thresholdwith a single persisted transition. Single-threaded functions (the normal Lambda model) take the same path as before — the per-thread id is indistinguishable from the per-process id when there's only one thread.Tests
Five new functional tests following the deterministic thread-safe pattern (#6889):
threading.Barrier-forced interleavings, no sleeps. The election tests fail 10/10 with the per-process id reinstated; the counter test parks the tripping thread mid-persist until its siblings finish, and fails deterministically against the pre-fix code (unsynchronized increments pass a naive version of this test — the GIL makes the racy window a few bytecodes).FakePersistencegained a class-level lock so its conditional put is atomic like the real store.The cross-process guarantee is exercised by two persistence instances sharing one backing store (what distinct environments look like to the election); happy to add a real multi-process e2e following
tests/e2e/idempotency/function_thread_safety_handler.pyas a follow-up if you'd like one.pytest(unit + functional, full suite),make mypy,ty, ruff, markdownlint, and the security/complexity baselines all pass.By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.