The Backup Band Analogy
A lead singer performing solo:
Without backup:
- Singer gets sick = Show canceled
- All eyes on one person
- No margin for error
With backup singers:
- Lead singer sick = Backup takes the spotlight
- Harmonies enrich the sound
- Show goes on!
Database replication is your backup band. Copies of your data on multiple servers ensure the show goes on even when a server fails.
Why Replicate?
Single Server Problems:
┌─────────────────────────────────────────┐
│ ONE DATABASE │
│ All reads | All writes | All risk │
└─────────────────────────────────────────┘
↓
Server dies = potential outage
Replicated Solution:
┌──────────┐ ┌──────────┐ ┌──────────┐
│ Primary │ │ Replica │ │ Replica │
│ (copy) │ │ (copy) │ │ (copy) │
└──────────┘ └──────────┘ └──────────┘
↓
One dies = Others continue
Key Benefits
| Benefit | How It Helps |
|---|---|
| High Availability | Server dies? Replica takes over |
| Read Performance | Spread reads across replicas |
| Geographic Reach | Data closer to global users |
| Disaster Recovery | Helps with redundancy (not a full backup) |
| Maintenance | Update replicas one at a time |
Note: replication helps you survive some server failures, but it doesn't protect you from every kind of data loss (for example: accidental deletes, bad deploys, or corrupted data). Backups and point-in-time recovery are still important.
Replication Architectures
Primary-Replica (Leader-Follower)
┌─────────────┐
│ Primary │ ← All WRITES go here
│ (Leader) │
└──────┬──────┘
│ Replication (changes flow down)
┌───────────┼───────────┐
↓ ↓ ↓
┌───────┐ ┌───────┐ ┌───────┐
│Replica│ │Replica│ │Replica│ ← READS can go here
│ 1 │ │ 2 │ │ 3 │
└───────┘ └───────┘ └───────┘
How it works:
- Typically all writes go to primary
- Primary logs changes
- Replicas pull and apply changes
- Reads can go to any replica
Best for: Most applications, read-heavy workloads
Multi-Primary (Multi-Leader)
┌──────────┐ ┌──────────┐
│ Primary │ ←──→ │ Primary │
│ US │ │ EU │
└──────────┘ └──────────┘
↓ ↓
┌──────────┐ ┌──────────┐
│ Replica │ │ Replica │
└──────────┘ └──────────┘
Both primaries accept writes.
Changes sync between them.
Pros: No single write bottleneck, geographic writes Cons: Conflict resolution complexity
Synchronous vs Asynchronous Replication
Synchronous (Strong Consistency)
Client: "Write this data"
↓
Primary: "Got it, writing..."
↓
Primary: "Waiting for replica confirmation..."
↓
Replica: "I have it too!"
↓
Primary → Client: "Success!"
Write isn't confirmed until replica has it.
Pros: Replicas stay closely in sync; reduces the chance of losing acknowledged writes during failover Cons: Slower writes; if required replicas are down or slow, writes can block or fail
Asynchronous (Eventual Consistency)
Client: "Write this data"
↓
Primary: "Got it, writing..."
↓
Primary → Client: "Success!" (immediate)
↓
Primary → Replica: "Here's the change" (later)
Pros: Fast writes; primary doesn't wait for replicas to acknowledge Cons: Replicas may lag behind (stale reads possible); a primary crash can lose the most recent writes that weren't replicated yet
Comparison
| Aspect | Synchronous | Asynchronous |
|---|---|---|
| Write Speed | Slower | Faster |
| Data Safety | Higher | Risk of loss |
| Availability | Replica failure = blocked | Replica failure = OK |
| Use When | Data is critical | Performance matters |
Replication Lag
The delay between a write on primary and its visibility on replicas.
Timeline:
Write to Primary (user updates profile)
Replica 1 applies change later
Replica 2 applies change later
Replica 3 applies change later
User reads from Replica 3 before it catches up:
→ Sees OLD data (change hasn't arrived yet)
Managing Lag
Read-Your-Writes:
User writes → Force their reads to primary
They see their own changes immediately
Monotonic Reads:
Track which replica version user saw
Avoid showing older data than they've already seen
Read from Primary for Critical Data:
Account balance → Primary
Notifications → Replica (OK if slightly stale)
Failover: When Primary Dies
1. DETECTION
Monitoring detects the primary is unreachable
Cluster decides the primary is unhealthy
2. ELECTION
A leader election chooses a new primary
(often preferring a replica that is most up-to-date)
3. PROMOTION
Replica 1 becomes new Primary
DNS/routing updated
4. RECOVERY
Other replicas follow new Primary
Old Primary (when recovered) becomes Replica
Failover Types
| Type | Speed | Data Safety | Use Case |
|---|---|---|---|
| Manual | Slower | Higher | Critical review needed |
| Automatic | Faster | Risk of split-brain | High availability |
| Semi-Auto | In-between | Middle ground | Many production setups |
Common Patterns
Read Replicas for Scale
Application is read-heavy
Without replicas:
Primary handles reads + writes → can become overloaded
With read replicas:
Reads are spread across replicas
→ Load distributed
Geographic Distribution
Primary in US-East
├── Replica in US-West (low latency for US West users)
├── Replica in EU-West (low latency for European users)
└── Replica in AP-South (low latency for Asian users)
Each user reads from nearest replica.
Common Mistakes
1. Assuming Replicas Are Instant
User: Updates profile → Refreshes page → Old profile!
Why? Read went to replica that hadn't received update yet.
Fix: Read-your-writes consistency or read from primary for own data.
2. Not Monitoring Lag
Lag creeps up
Eventually: replicas become much less useful and reads get stale
Fix: Alert when lag exceeds threshold.
3. Forgetting About Failover Testing
"We have replication, we're covered!"
(Not tested: Does automated failover actually work?)
Fix: Chaos testing - kill primary on purpose, verify recovery.
4. All Replicas in Same Datacenter
Datacenter loses power → Primary AND all replicas down
Fix: Geographic distribution - replicas in different regions.
FAQ
Q: Sharding vs Replication?
Replication: Same data on multiple servers (copies) Sharding: Different data on different servers (splits)
Often used together: Each shard has its own replicas.
Q: How many replicas should I have?
Typically 2-3 for high availability. More for global distribution or extreme read loads. Each replica adds infrastructure cost.
Q: What's replication lag tolerance?
Depends on application. Social media: seconds OK. Banking: milliseconds matter. Real-time gaming: lag might be unacceptable.
Q: Can I write to replicas?
Primary-Replica: No, only reads from replicas. Multi-Primary: Yes, with conflict resolution.
Q: What happens to in-flight writes during failover?
May be lost if using async replication. Sync replication prevents this at the cost of performance.
Summary
Database replication copies data across servers, providing high availability, read scaling, and disaster recovery.
Key Takeaways:
- Primary handles writes, replicas serve reads
- Synchronous: consistent but slower
- Asynchronous: faster but potential lag
- Monitor replication lag actively
- Test failover before you need it
- Common in production systems where availability matters
- Often combined with sharding for full scalability
Replication is your database's insurance policy - invest in it before disaster strikes!
Related Concepts
Leave a Comment
Comments (0)
Be the first to comment on this concept.
Comments are approved automatically.