The Electrical Breaker Analogy
Your house has circuit breakers:
- Normal: Electricity flows
- Overload: Breaker trips, stops flow
- Prevents: Fire, damaged appliances
Instead of letting damage spread, the breaker cuts off the problem.
The circuit breaker pattern works the same way for software. When a service is failing, stop calling it to prevent cascade failures.
What Problem Does It Solve?
Without Circuit Breaker
Service A → Service B (unhealthy)
Request 1 → Wait... Wait... Timeout
Request 2 → Wait... Wait... Timeout
Request 3 → Wait... Wait... Timeout
Enough requests can get stuck waiting that Service A becomes slow or unresponsive.
Now Service C that calls A is stuck...
Cascade failure across the system!
With Circuit Breaker
Service A → Circuit Breaker → Service B (unhealthy)
Request 1 → Timeout → Failure recorded
Request 2 → Timeout → Failure recorded
Request 3 → Timeout → Circuit opens
Request 4 → Fails fast (no waiting)
Request 5 → Fails fast
Circuit open = Don't even try.
Service A is more likely to stay responsive.
The Three States
┌───────────────────────────────────────┐
│ │
▼ threshold reached │
┌───────┐ ┌───────────────────┐ │
│ CLOSED │────→│ OPEN │ │
└───────┘ └─────────┬─────────┘ │
▲ │ │
│ │ timeout │
│ │ expires │
│ ▼ │
│ ┌───────────────────┐ │
│ │ HALF-OPEN │ │
│ └─────────┬─────────┘ │
│ │ │
│ success │ failure │
└───────────────────┘ │
│
│
back to OPEN ──────────────────────────┘
CLOSED (Normal Operation)
Requests flow through normally.
Track success and failure counts.
If failure threshold reached → OPEN
OPEN (Failing Fast)
Requests fail immediately.
Don't call the unhealthy service.
Return error or fallback.
After timeout period → HALF-OPEN
HALF-OPEN (Testing Recovery)
Allow limited requests through.
Testing if service recovered.
If success → CLOSED (recovered!)
If failure → OPEN (still unhealthy)
Configuration Options
Failure Threshold
How many failures before opening?
Low (3 failures) → Quick to protect, but sensitive
High (10 failures) → More tolerant, slower protection
Tune based on your service's normal error rate.
Timeout Duration
How long before testing recovery?
Short cooldown → Quick recovery detection
Long cooldown → Less pressure on recovering service
Balance between recovery speed and load.
Success Threshold
Half-open: How many successes to close?
Fewer successes → Fast, but riskier
More successes → More confidence the service is healthy
What Happens When Open?
Option 1: Return Error
"Service temporarily unavailable"
Client knows something is wrong.
Can retry later.
Option 2: Fallback
Return cached data.
Return default values.
Return degraded functionality.
User experience maintained!
Option 3: Alternative Service
Call backup service.
Use secondary provider.
Ensure business continuity.
Real-World Example
E-commerce Checkout
Checkout → Payment Service (down)
Without breaker:
Every checkout hangs until it times out.
Checkout service overwhelmed.
Entire site becomes slow.
With breaker:
Some payment attempts fail.
Circuit opens.
"Payment unavailable, try again later"
Checkout service is more likely to stay healthy.
Other site features work fine.
Metrics to Monitor
Track These
Circuit state (closed/open/half-open)
Failure count
Success count
Request count
Fallback count
Time in open state
Alert On
Circuit opens (service is failing)
Circuit stays open long (service not recovering)
High fallback rate (degraded experience)
Best Practices
1. Separate Circuits per Dependency
Avoid sharing a single circuit across unrelated dependencies.
Payment Service → Circuit A
Inventory Service → Circuit B
Payment down? Inventory still works.
2. Meaningful Fallbacks
Not just "Error!"
Cached data: "Prices as of 5 min ago"
Degraded: "We'll confirm availability by email"
Queue: "Order received, processing async"
3. Health Endpoint
Service should expose health endpoint.
Circuit can check before requests.
Quicker recovery detection.
4. Tune Thresholds
Monitor and adjust:
Too sensitive? Opens too often.
Too tolerant? Doesn't protect.
Base on actual failure patterns.
Common Mistakes
1. Too Aggressive Settings
Opens on 1 failure?
Normal network blips trigger it.
Service "unavailable" when it's fine.
2. No Fallback
Circuit opens → Error → User frustrated
Have a fallback plan when it makes sense (even if it's just a clear error message and retry guidance).
3. Ignoring Timeouts
Circuit doesn't help if requests don't have timeouts.
Set reasonable timeouts first!
4. Not Monitoring
Circuit opening repeatedly?
Root cause: the service is actually unhealthy!
Fix the service, not just the circuit.
Libraries and Tools
| Language | Library |
|---|---|
| Java | Resilience4j, Hystrix (legacy) |
| JavaScript | opossum |
| Python | pybreaker |
| Go | gobreaker |
| .NET | Polly |
FAQ
Q: Circuit breaker vs retry?
Retry: Try again after a failure (often with backoff) Circuit breaker: Stop or limit attempts when a dependency appears unhealthy
Use both! Retry for transient issues, breaker for extended outages.
Tip: retries without limits/backoff can create "retry storms" that make outages worse.
Q: How is this different from rate limiting?
Rate limiting: Limit requests to healthy service Circuit breaker: Stop requests to unhealthy service
Q: Should every call have a circuit breaker?
External dependencies: Often a good idea Internal, same-process calls: Often not needed
Q: What about partial failures?
Some endpoints fail, others work? Consider per-endpoint circuits.
Summary
Circuit breaker prevents cascade failures by stopping calls to failing services, allowing them to recover.
Key Takeaways:
- Three states: Closed → Open → Half-Open
- Fails fast when service is down
- Prevents resource exhaustion
- Provides fallback options
- Configure threshold, timeout, success count
- Monitor circuit state changes
- Use with retry for resilience
Circuit breakers let failing services fail gracefully!
Related Concepts
Leave a Comment
Comments (0)
Be the first to comment on this concept.
Comments are approved automatically.