Circuit Breaker

A Circuit Breaker blocks calls heading toward a failing dependency, so the caller’s resources do not stay occupied by a dependency that is failing. When dependency call failures accumulate on the caller’s threads and connections, the accumulated failures degrade the caller’s own state, and one dependency’s outage spreads as a cascade along the call chain.

The trip criterion and the recovery method look like separate decisions, but when the two sides do not align, the system oscillates between trip and recovery. A precise trip combined with a simplistic recovery is the canonical example of the cycling anti-pattern.

State Model

A Circuit Breaker has three states.

Closed: Normal operation. Calls pass through.
Open: Tripped. Calls fail immediately (fail-fast).
Half-Open: Recovery probe. Only limited calls pass through to test the dependency.

Transition triggers sit at three points: Closed → Open (trip criterion), Open → Half-Open (recovery probe criterion), and Half-Open → Closed/Open (recovery verification criterion). Every library follows this model.

Trip Triggers

Three criteria commonly drive the Closed → Open transition.

Failure rate based trips when the failure rate in a sliding window exceeds the threshold. It fits stable traffic where statistical judgment carries weight. The window has to be large enough not to swing with noise.

Latency or slow-call based trips on the ratio of calls whose response time exceeds the threshold. It addresses the case where the dependency is alive but slow. Responses still come back, so the failure rate stays low, but the caller’s resources stay occupied longer and the effect ends up the same. It fits environments where fail-fast matters.

Count based trips when consecutive failures exceed the threshold. It is the simplest and reacts the fastest. It is the first candidate when traffic is low and statistical judgment is hard, or when the failure event itself is a clearer signal than response time.

The trigger is decided by the shape of the failure signal, but the protection that tripping was meant to deliver only becomes complete when the recovery strategy is designed alongside it.

Recovery Strategies

Two criteria commonly drive the Open → Half-Open → Closed transition.

Timeout based transitions to Half-Open automatically after a set time. It attempts recovery by looking at time alone. It fits dependencies with a self-recovery pattern (temporary GC pressure, short network drops, and the like).

Gradual pass rate lets a subset of calls through in Half-Open and watches the success rate. Above a threshold it returns to Closed; below, it returns to Open. It is a way to verify the recovery. More accurate, more complex to implement, and it allows a small probe load on the dependency.

Pick simplicity and you pick Timeout; pick accuracy and you pick gradual. The two strategies are not mutually exclusive — a hybrid that enters Half-Open on a timeout and then uses the pass rate to decide Closed/Open is common in practice. Which one fits is decided inside the pair with the trip trigger.

Pair Matrix

The combination of trip trigger and recovery strategy decides a Circuit Breaker’s identity. Some combinations fit; some do not.

Trip Trigger	Recovery Strategy	Fit	Notes
Failure rate	Gradual pass rate	Dominant	Both share a statistical frame
Latency / slow-call	Gradual pass rate	Dominant	Re-verifies whether latency has cleared
Count based	Timeout	Acceptable	Both favor simple, fast reaction
Failure rate	Timeout	Risky	Precise trip ↔ simplistic recovery asymmetry → cycling

Failure rate × Gradual is dominant because both sides judge statistically and stay consistent. With the trip criterion as the failure rate in a window and the recovery criterion as the success rate of Half-Open passes, entry and recovery share the same statistical frame.

Latency × Gradual is also dominant. When the dependency is alive but slow, the recovery point has to verify whether it is still slow. Resuming full load on a timeout alone leaves a high chance of returning to Open the moment latency returns. For the same reason, Latency × Timeout also belongs in the risky category.

Count × Timeout is an acceptable combination. Both sides favor simplicity and quick reaction, so they line up with operational simplicity. For dependencies with a self-recovery pattern in an environment that tolerates short cycles, it is enough.

Failure rate × Timeout is risky. The trip side decides statistically and cautiously, while the recovery side resumes full load on time alone, without verification. If the dependency has not recovered, the failure rate crosses the threshold again immediately and the breaker returns to Open. The result is meaningless cycling — the breaker oscillates between Closed and Open while the caller pays the fail-fast cost on each cycle. This cycling shows up when the precision on the trip side and the precision on the recovery side do not match.

Tool Mapping

Real tools have settled on specific combinations of the two.

Tool	Default Trip Trigger	Default Recovery	Why It Settled
Resilience4j (Java)	Failure rate + slow-call	Gradual pass rate	Statistical precision for business-unit protection
Polly v8 (.NET)	Failure rate (FailureRatio)	Timeout (BreakDuration)	.NET resilience standard integration
Istio / Envoy	Count based (consecutive 5xx)	Timeout (ejection time)	The sidecar has no business context

Resilience4j defaults to the failure rate + gradual pair because method-level protection in business logic needs precise triggering and verified recovery together. The actual behavior combines both recovery strategies — waitDurationInOpenState triggers the automatic Open → Half-Open transition (Timeout based), and the failure rate of permittedNumberOfCallsInHalfOpenState passes decides Closed/Open (gradual pass rate).

Polly v8 defaults to the failure rate + timeout pair because, as a standard component of .NET resilience, statistical judgment became the default. The FailureRatio threshold combined with MinimumThroughput filters noise before tripping, and BreakDuration triggers the transition to Half-Open afterward (up to v7 it was based on consecutive failure counts; v8 changed this).

Istio / Envoy default to count + timeout because in the sidecar environment, consecutive 5xx responses are the clearest failure signal for an external call. The sidecar lacks business context, so it trips on a simple signal instead of a statistical judgment. outlier detection’s consecutive_5xx and base_ejection_time expose that pair directly.

All three tools have settled on the pair that fits their environment. Choosing a tool becomes a matter of picking the row whose pair fits your own environment, among the ones already settled.

Bulkhead

A Circuit Breaker alone cannot block a cascade. Even with the breaker tripped on one dependency, if other dependency calls share the same resource pool (threads, connections), the already-occupied resources do not get released. The resource pool can run out before the trip takes effect.

The Bulkhead pattern solves this by isolating resources per dependency. Allocating an independent thread pool (or semaphore slot) to each dependency means one dependency’s failure does not consume resources used by other dependency calls. Circuit Breaker and Bulkhead are used together as a combined pattern. Applying only one leaves cascade blocking incomplete.

Decision Order

A Circuit Breaker’s decision is not a single decision but a paired design. Decide the trip trigger first by the shape of the failure signal (failure rate / latency / count), decide the recovery strategy as a pair with the trip trigger (simplicity → Timeout, accuracy → gradual), and combine with Bulkhead at the end to isolate resources per dependency.

A Circuit Breaker’s trip trigger and recovery strategy cannot be decided separately. If the trip side is designed with precision, the recovery side needs matching verification; if the trip is kept simple, a simple recovery cycle is enough. The weight on both sides has to match for a Circuit Breaker to deliver the protection it was designed for.

References

Rate Limiting — The previous post that covers Rate Limit with the same two-decision constraint relationship between protection layer and algorithm.
Ad system outage retrospective — shared dependencies and a single point of failure — A cascade case where Circuit Breaker and Bulkhead were both needed.

State Model#

Trip Triggers#

Recovery Strategies#

Pair Matrix#

Tool Mapping#

Bulkhead#

Decision Order#

References#