Skip to content

fix(circuit_breaker): make probe election per-thread and synchronize local counters#8323

Open
Iamrodos wants to merge 4 commits into
aws-powertools:developfrom
Iamrodos:fix/circuit-breaker-thread-safety
Open

fix(circuit_breaker): make probe election per-thread and synchronize local counters#8323
Iamrodos wants to merge 4 commits into
aws-powertools:developfrom
Iamrodos:fix/circuit-breaker-thread-safety

Conversation

@Iamrodos

@Iamrodos Iamrodos commented Jul 3, 2026

Copy link
Copy Markdown
Contributor

Issue number: closes #8320

Summary

Changes

The half-open probe owner id (_ENVIRONMENT_ID) was minted once per process at import, so every thread in a process carried the same id. The DynamoDB conditional write elected a single owner correctly, but the record.half_open_owner == id check then passed for every sibling thread — all of them probed, which is the thundering herd the election exists to prevent.

The id is now a per-thread uuid held in threading.local(), minted on first use. The existing conditional write then elects one prober across threads and processes — no protocol change, no schema change; per-thread granularity just changes what the owner string identifies. Two deliberate details:

  • a uuid rather than threading.get_ident(), since the OS reuses thread ids and a recycled id would let an unrelated thread pass the owner check — the same bug at lower frequency
  • the id embeds os.getpid() and re-mints after fork, since forked children inherit the forking thread's locals — the same bug again, cross-process, for multiprocessing worker pools

As requested, the counter fixes cover all three module-level dicts in one pass: _LOCAL_FAILURES, _LOCAL_SUCCESSES, and _LAST_OBSERVED_STATE now sit behind one lock, with increment and threshold-crossing detected atomically — exactly one thread observes the crossing, fixing both the lost updates (tripping late) and the redundant save_open/save_closed writes. The lock is never held across persistence I/O or user callbacks (on_circuit_open/on_transition hooks may now run concurrently, noted in the docs).

Two more shared-state issues surfaced while in there:

  • configure() wrote local_cache_max_age/recovery_timeout onto the shared persistence instance and read them later, so two circuits with different configs on one store could stamp each other's TTL and probe lease. A lock can't restore that association across the window, so settings are now keyed per circuit name (a plain dict — evicting a live circuit's settings would silently swap in defaults, unlike the read cache where eviction just costs a re-read). The write-only circuit_name attribute is removed; a subclass setting these attributes directly would now create dead attributes (alpha, and nothing in the repo or docs references them).
  • the local cache's expired-entry del could raise KeyError through get_state() into the protected call under concurrent expiry — the one path where the breaker could become the outage. Cache access is lock-guarded and uses pop.

One trade-off worth stating explicitly: probe ownership is now per-thread, so in a thread-per-request pool the winning thread may never run the circuit again and recovery waits for the probe lease to expire before takeover. Correctness over recovery latency; documented in the new "Thread safety" note. Related, pre-existing and unchanged: _LOCAL_SUCCESSES is keyed by circuit, so a same-process lease takeover inherits the previous owner's success streak — the same fuzziness the lease protocol already has cross-environment.

User experience

Before (8 worker threads sharing one circuit, downstream recovering):

recovery window elapses → 8 threads all pass the owner check → 8 probes
hit the recovering backend; failure counter can trip late and persist the
OPEN transition several times

After: exactly one probe per recovery window across all threads and environments, and the circuit trips at exactly failure_threshold with a single persisted transition. Single-threaded functions (the normal Lambda model) take the same path as before — the per-thread id is indistinguishable from the per-process id when there's only one thread.

Tests

Five new functional tests following the deterministic thread-safe pattern (#6889): threading.Barrier-forced interleavings, no sleeps. The election tests fail 10/10 with the per-process id reinstated; the counter test parks the tripping thread mid-persist until its siblings finish, and fails deterministically against the pre-fix code (unsynchronized increments pass a naive version of this test — the GIL makes the racy window a few bytecodes). FakePersistence gained a class-level lock so its conditional put is atomic like the real store.

The cross-process guarantee is exercised by two persistence instances sharing one backing store (what distinct environments look like to the election); happy to add a real multi-process e2e following tests/e2e/idempotency/function_thread_safety_handler.py as a follow-up if you'd like one.

pytest (unit + functional, full suite), make mypy, ty, ruff, markdownlint, and the security/complexity baselines all pass.


By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

…local counters

The half-open probe owner id was minted once per process, so every thread
in a process passed the owner check once any sibling won the election and
all of them probed the recovering backend. The id is now a per-thread
uuid (pid-aware for forked workers); the existing DynamoDB conditional
write then elects one prober across threads and processes with no
protocol change.

The in-memory failure/success counters and observed-state map are now
guarded by a lock, with threshold crossings detected atomically so a
trip is persisted exactly once. Persistence settings are keyed per
circuit name instead of living as shared instance attributes, and the
local cache no longer raises into the protected call on concurrent
expiry.
@Iamrodos Iamrodos requested a review from a team as a code owner July 3, 2026 13:16
@Iamrodos Iamrodos requested a review from svozza July 3, 2026 13:16
@boring-cyborg boring-cyborg Bot added documentation Improvements or additions to documentation tests labels Jul 3, 2026
@powertools-for-aws-oss-automation powertools-for-aws-oss-automation Bot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 3, 2026
@codecov

codecov Bot commented Jul 4, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 96.65%. Comparing base (bc78b7f) to head (9f547d8).

Additional details and impacted files
@@             Coverage Diff             @@
##           develop    #8323      +/-   ##
===========================================
+ Coverage    96.63%   96.65%   +0.01%     
===========================================
  Files          296      296              
  Lines        14730    14756      +26     
  Branches      1241     1243       +2     
===========================================
+ Hits         14235    14262      +27     
  Misses         360      360              
+ Partials       135      134       -1     

☔ View full report in Codecov by Harness.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

…election

The takeover's failure arm was untested: when another environment wins
the conditional election, the caller must get the open-circuit response
and the protected function must not run.
@powertools-for-aws-oss-automation powertools-for-aws-oss-automation Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Jul 4, 2026
@powertools-for-aws-oss-automation powertools-for-aws-oss-automation Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 4, 2026

@leandrodamascena leandrodamascena left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @Iamrodos. I put the per-process id back locally and the election tests fail 10/10, and I couldn't find an interleaving that breaks single-prober election. Good catch on the per-circuit settings and the cache KeyError too.

Two things before I approve:

  1. The counter test doesn't catch the bug it's guarding. I removed the lock from _call_closed and it still passed 10/10 because the GIL hides the lost update unless you force a switch between the read and the write. Worth forcing that, or dropping the 'deterministic' claim for this one.

  2. The docs say the hooks can run concurrently but don't tell people to make their callbacks thread-safe. That's the change most likely to bite someone, so I'd spell it out.

Rest is nits below. Thanks again.

assert store.db["c"].half_open_owner is not None


def test_threads_trip_at_exactly_the_failure_threshold(store):

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This passes 10/10 even with the lock removed — the GIL makes the read-modify-write effectively atomic, so the lost update never shows. It proves 'persist once' but not the atomic increment. Forcing a yield between read and write catches it:

import aws_lambda_powertools.utilities.circuit_breaker_alpha.base as cb_base

real_get = cb_base._LOCAL_FAILURES.get
def racy_get(name, default=0):
    value = real_get(name, default)
    time.sleep(0)  # force a switch mid-increment
    return value
monkeypatch.setattr(cb_base._LOCAL_FAILURES, "get", racy_get)

Comment thread docs/utilities/circuit_breaker.md Outdated
!!! note "Thread safety"
The utility is safe to share across threads: within a multi-threaded environment the probe election picks a single
thread, so the single-prober guarantee spans threads as well as environments, and the in-memory failure counter is
synchronized. Single-threaded functions (the normal Lambda model) are unaffected. `on_circuit_open` and

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add the actionable bit here — people need to know their callbacks must be thread-safe:

+    !!! warning
+        If your function runs multiple threads, make these callbacks thread-safe.
+        They can run concurrently for the same circuit.

def _settings_for(self, name: str) -> _CircuitSettings:
"""Return a circuit's configured settings, or the defaults if never configured."""
with self._lock:
return self._settings.get(name) or _DEFAULT_SETTINGS

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: works today since the tuple is always truthy, but .get(name, _DEFAULT_SETTINGS) says 'default when absent' without leaning on truthiness.

-        return self._settings.get(name) or _DEFAULT_SETTINGS
+        return self._settings.get(name, _DEFAULT_SETTINGS)

@powertools-for-aws-oss-automation powertools-for-aws-oss-automation Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 4, 2026
- Make the counter race test catch a missing lock: park readers mid-increment so an unsynchronized read-modify-write deterministically loses updates (fails 10/10 unlocked, passes 10/10 locked)
- Document that on_circuit_open and on_transition callbacks can run concurrently and must be thread-safe
- Use dict.get's default instead of `or` for the per-circuit settings lookup
@Iamrodos Iamrodos force-pushed the fix/circuit-breaker-thread-safety branch from f48b3b5 to be27f1b Compare July 4, 2026 21:35
@powertools-for-aws-oss-automation powertools-for-aws-oss-automation Bot added size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. and removed size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Jul 4, 2026
@sonarqubecloud

sonarqubecloud Bot commented Jul 4, 2026

Copy link
Copy Markdown

@Iamrodos

Iamrodos commented Jul 4, 2026

Copy link
Copy Markdown
Contributor Author

Thanks for the review @leandrodamascena — all three items are addressed in be27f1b.

First, an apology for the churn on this PR: the replies that appeared and then vanished, and the force-push. My coding agent went rogue and pushed a commit and posted comments without me reviewing them first. I normally inspect and validate everything in detail before anything gets posted; this time it went nuclear before I looked up. It has been fined 1M tokens and had its push access revoked pending a performance review (it's going on the dev list too). I've deleted the auto-posted replies and replaced the commit. What's on the branch now is what I stand behind.

On the substance:

  1. Race test — you were right, it passed 10/10 with the lock removed. Your snippet needed two tweaks to work: monkeypatch.setattr can't patch a method on a dict instance, so the test swaps _LOCAL_FAILURES for a dict subclass overriding get; and sleep(0) only got the failure ~3/10, so it parks readers for 5ms mid-increment instead, so every sibling reads the same stale value. Verified: fails 10/10 without the lock, passes 10/10 with it.

  2. Docs — added a !!! warning telling users the on_circuit_open/on_transition callbacks can run concurrently for the same circuit and must be thread-safe.

  3. Settings default — switched to .get(name, _DEFAULT_SETTINGS) as suggested.

Ready for another look whenever you have time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

documentation Improvements or additions to documentation size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. tests

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Bug (circuit_breaker): single-prober election is per-process — concurrent threads all probe on recovery (the thundering herd it exists to prevent)

2 participants