Skip to main content

Circuit Breaker

Preventing cascade failures

The Electrical Breaker Analogy

Your house has circuit breakers:

  • Normal: Electricity flows
  • Overload: Breaker trips, stops flow
  • Prevents: Fire, damaged appliances

Instead of letting damage spread, the breaker cuts off the problem.

The circuit breaker pattern works the same way for software. When a service is failing, stop calling it to prevent cascade failures.


What Problem Does It Solve?

Without Circuit Breaker

Service A → Service B (unhealthy)

Request 1 → Wait... Wait... Timeout
Request 2 → Wait... Wait... Timeout
Request 3 → Wait... Wait... Timeout

Enough requests can get stuck waiting that Service A becomes slow or unresponsive.
Now Service C that calls A is stuck...

Cascade failure across the system!

With Circuit Breaker

Service A → Circuit Breaker → Service B (unhealthy)

Request 1 → Timeout → Failure recorded
Request 2 → Timeout → Failure recorded
Request 3 → Timeout → Circuit opens

Request 4 → Fails fast (no waiting)
Request 5 → Fails fast

Circuit open = Don't even try.
Service A is more likely to stay responsive.

The Three States

        ┌───────────────────────────────────────┐
        │                                       │
        ▼           threshold reached           │
    ┌───────┐     ┌───────────────────┐         │
    │ CLOSED │────→│      OPEN         │        │
    └───────┘     └─────────┬─────────┘         │
        ▲                   │                   │
        │                   │ timeout           │
        │                   │ expires           │
        │                   ▼                   │
        │         ┌───────────────────┐         │
        │         │    HALF-OPEN      │         │
        │         └─────────┬─────────┘         │
        │                   │                   │
        │    success        │       failure     │
        └───────────────────┘                   │
                                                │
                                                │
        back to OPEN ──────────────────────────┘

CLOSED (Normal Operation)

Requests flow through normally.
Track success and failure counts.

If failure threshold reached → OPEN

OPEN (Failing Fast)

Requests fail immediately.
Don't call the unhealthy service.
Return error or fallback.

After timeout period → HALF-OPEN

HALF-OPEN (Testing Recovery)

Allow limited requests through.
Testing if service recovered.

If success → CLOSED (recovered!)
If failure → OPEN (still unhealthy)

Configuration Options

Failure Threshold

How many failures before opening?

Low (3 failures) → Quick to protect, but sensitive
High (10 failures) → More tolerant, slower protection

Tune based on your service's normal error rate.

Timeout Duration

How long before testing recovery?

Short cooldown → Quick recovery detection
Long cooldown → Less pressure on recovering service

Balance between recovery speed and load.

Success Threshold

Half-open: How many successes to close?

Fewer successes → Fast, but riskier
More successes → More confidence the service is healthy

What Happens When Open?

Option 1: Return Error

"Service temporarily unavailable"

Client knows something is wrong.
Can retry later.

Option 2: Fallback

Return cached data.
Return default values.
Return degraded functionality.

User experience maintained!

Option 3: Alternative Service

Call backup service.
Use secondary provider.

Ensure business continuity.

Real-World Example

E-commerce Checkout

Checkout → Payment Service (down)

Without breaker:
  Every checkout hangs until it times out.
  Checkout service overwhelmed.
  Entire site becomes slow.

With breaker:
  Some payment attempts fail.
  Circuit opens.
  "Payment unavailable, try again later"
  Checkout service is more likely to stay healthy.
  Other site features work fine.

Metrics to Monitor

Track These

Circuit state (closed/open/half-open)
Failure count
Success count
Request count
Fallback count
Time in open state

Alert On

Circuit opens (service is failing)
Circuit stays open long (service not recovering)
High fallback rate (degraded experience)

Best Practices

1. Separate Circuits per Dependency

Avoid sharing a single circuit across unrelated dependencies.

Payment Service → Circuit A
Inventory Service → Circuit B

Payment down? Inventory still works.

2. Meaningful Fallbacks

Not just "Error!"

Cached data: "Prices as of 5 min ago"
Degraded: "We'll confirm availability by email"
Queue: "Order received, processing async"

3. Health Endpoint

Service should expose health endpoint.
Circuit can check before requests.
Quicker recovery detection.

4. Tune Thresholds

Monitor and adjust:
  Too sensitive? Opens too often.
  Too tolerant? Doesn't protect.

Base on actual failure patterns.

Common Mistakes

1. Too Aggressive Settings

Opens on 1 failure?
Normal network blips trigger it.
Service "unavailable" when it's fine.

2. No Fallback

Circuit opens → Error → User frustrated

Have a fallback plan when it makes sense (even if it's just a clear error message and retry guidance).

3. Ignoring Timeouts

Circuit doesn't help if requests don't have timeouts.
Set reasonable timeouts first!

4. Not Monitoring

Circuit opening repeatedly?
Root cause: the service is actually unhealthy!
Fix the service, not just the circuit.

Libraries and Tools

LanguageLibrary
JavaResilience4j, Hystrix (legacy)
JavaScriptopossum
Pythonpybreaker
Gogobreaker
.NETPolly

FAQ

Q: Circuit breaker vs retry?

Retry: Try again after a failure (often with backoff) Circuit breaker: Stop or limit attempts when a dependency appears unhealthy

Use both! Retry for transient issues, breaker for extended outages.

Tip: retries without limits/backoff can create "retry storms" that make outages worse.

Q: How is this different from rate limiting?

Rate limiting: Limit requests to healthy service Circuit breaker: Stop requests to unhealthy service

Q: Should every call have a circuit breaker?

External dependencies: Often a good idea Internal, same-process calls: Often not needed

Q: What about partial failures?

Some endpoints fail, others work? Consider per-endpoint circuits.


Summary

Circuit breaker prevents cascade failures by stopping calls to failing services, allowing them to recover.

Key Takeaways:

  • Three states: Closed → Open → Half-Open
  • Fails fast when service is down
  • Prevents resource exhaustion
  • Provides fallback options
  • Configure threshold, timeout, success count
  • Monitor circuit state changes
  • Use with retry for resilience

Circuit breakers let failing services fail gracefully!

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.