Summary
Add a circuit breaker mechanism to Fedify's activity delivery pipeline. When a remote server repeatedly fails to receive activities, Fedify should stop hammering it with retries and instead hold outbound activities until the server shows signs of recovery.
Problem
Fedify's current retry logic treats every delivery failure the same way: it retries with exponential backoff until the retry limit is exhausted, regardless of whether the remote server is temporarily slow or has been unreachable for weeks. This has a few consequences:
- Worker queues accumulate a long tail of hopeless retry attempts to dead instances, consuming resources and obscuring genuinely actionable failures.
- A server with many followers on a struggling instance generates a disproportionate amount of noise.
- There is no way to distinguish “this delivery failed once” from “this server has not responded in two weeks.”
Proposed solution
Circuit breaker states
The circuit breaker for each remote host transitions between three states:
- Closed — normal operation; deliveries proceed as usual.
- Open — the host is considered unreachable; deliveries are held rather than attempted.
- Half-open — a recovery probe is sent; if it succeeds, the circuit closes, otherwise it reopens.
Default transition conditions
| Transition |
Condition |
| Closed → Open |
5 consecutive failures within a 10-minute window |
| Open → Half-open |
30 minutes have elapsed since the circuit opened |
| Half-open → Closed |
Probe delivery succeeds |
| Half-open → Open |
Probe delivery fails |
These defaults are intentionally conservative. A short-lived outage should not trip the circuit.
Configuration
All parameters are overridable via createFederation():
const federation = createFederation<void>({
kv: ...,
queue: ...,
circuitBreaker: {
failureThreshold: 5,
failureWindow: Temporal.Duration.from({ minutes: 10 }),
recoveryDelay: Temporal.Duration.from({ minutes: 30 }),
},
});
Passing circuitBreaker: false disables the feature entirely for users who prefer to manage retry behavior themselves.
Activity handling when the circuit is open
Activities destined for an open-circuit host are not discarded. Instead, they are requeued with a deferred delivery time corresponding to the next half-open probe window. If the circuit remains open past a configurable TTL (default: 7 days), held activities are dropped and the permanent failure handler is invoked, consistent with how exhausted retries are handled today.
State persistence
Circuit breaker state is stored per remote host in the KvStore. This has two benefits:
- State survives process restarts. Without this, a server restart would reset all open circuits, immediately resuming delivery attempts against hosts that were already known to be unreachable.
- In deployments with multiple worker nodes, all nodes share the same circuit state, preventing one node from opening a circuit while another continues attempting delivery.
The KV key structure follows the existing Fedify conventions, e.g. ["_fedify", "circuit", "mastodon.social"].
Observability
When the OpenTelemetry metrics support (tracked in #619) is in place, circuit breaker state transitions will emit span events on the active outbox span:
activitypub.circuit_breaker.open — with attributes activitypub.remote.host and activitypub.circuit_breaker.failure_count
activitypub.circuit_breaker.half_open — with attribute activitypub.remote.host
activitypub.circuit_breaker.closed — with attributes activitypub.remote.host and activitypub.circuit_breaker.recovery_duration_ms
A counter metric activitypub.circuit_breaker.state_change with a activitypub.circuit_breaker.state attribute (open, half_open, closed) will also be recorded, making it straightforward to alert on a sudden spike in circuit openings.
onCircuitBreakerStateChange callback
For users not using OpenTelemetry, a callback hook provides an integration point for custom logging or alerting:
const federation = createFederation<void>({
kv: ...,
queue: ...,
circuitBreaker: {
onStateChange(remoteHost, previousState, newState) {
logger.warn(
`Circuit breaker for ${remoteHost}: ${previousState} → ${newState}`
);
},
},
});
Dependency
This feature depends on #619 (OpenTelemetry metrics and span events) for the observability layer, though the core circuit breaker logic can be implemented independently.
Scope
Changes are limited to @fedify/fedify. The KvStore interface requires no modification; the circuit breaker uses the existing API.
Summary
Add a circuit breaker mechanism to Fedify's activity delivery pipeline. When a remote server repeatedly fails to receive activities, Fedify should stop hammering it with retries and instead hold outbound activities until the server shows signs of recovery.
Problem
Fedify's current retry logic treats every delivery failure the same way: it retries with exponential backoff until the retry limit is exhausted, regardless of whether the remote server is temporarily slow or has been unreachable for weeks. This has a few consequences:
Proposed solution
Circuit breaker states
The circuit breaker for each remote host transitions between three states:
Default transition conditions
These defaults are intentionally conservative. A short-lived outage should not trip the circuit.
Configuration
All parameters are overridable via
createFederation():Passing
circuitBreaker: falsedisables the feature entirely for users who prefer to manage retry behavior themselves.Activity handling when the circuit is open
Activities destined for an open-circuit host are not discarded. Instead, they are requeued with a deferred delivery time corresponding to the next half-open probe window. If the circuit remains open past a configurable TTL (default: 7 days), held activities are dropped and the permanent failure handler is invoked, consistent with how exhausted retries are handled today.
State persistence
Circuit breaker state is stored per remote host in the
KvStore. This has two benefits:The KV key structure follows the existing Fedify conventions, e.g.
["_fedify", "circuit", "mastodon.social"].Observability
When the OpenTelemetry metrics support (tracked in #619) is in place, circuit breaker state transitions will emit span events on the active outbox span:
activitypub.circuit_breaker.open— with attributesactivitypub.remote.hostandactivitypub.circuit_breaker.failure_countactivitypub.circuit_breaker.half_open— with attributeactivitypub.remote.hostactivitypub.circuit_breaker.closed— with attributesactivitypub.remote.hostandactivitypub.circuit_breaker.recovery_duration_msA counter metric
activitypub.circuit_breaker.state_changewith aactivitypub.circuit_breaker.stateattribute (open,half_open,closed) will also be recorded, making it straightforward to alert on a sudden spike in circuit openings.onCircuitBreakerStateChangecallbackFor users not using OpenTelemetry, a callback hook provides an integration point for custom logging or alerting:
Dependency
This feature depends on #619 (OpenTelemetry metrics and span events) for the observability layer, though the core circuit breaker logic can be implemented independently.
Scope
Changes are limited to
@fedify/fedify. TheKvStoreinterface requires no modification; the circuit breaker uses the existing API.