Skip to main content

📋 Database Replication

Copying data to multiple servers

The Backup Band Analogy

A lead singer performing solo:

Without backup:

  • Singer gets sick = Show canceled
  • All eyes on one person
  • No margin for error

With backup singers:

  • Lead singer sick = Backup takes the spotlight
  • Harmonies enrich the sound
  • Show goes on!

Database replication is your backup band. Copies of your data on multiple servers ensure the show goes on even when a server fails.


Why Replicate?

Single Server Problems:
┌─────────────────────────────────────────┐
│            ONE DATABASE                  │
│    All reads | All writes | All risk    │
└─────────────────────────────────────────┘
      ↓
   Server dies = potential outage

Replicated Solution:
┌──────────┐  ┌──────────┐  ┌──────────┐
│ Primary  │  │ Replica  │  │ Replica  │
│  (copy)  │  │  (copy)  │  │  (copy)  │
└──────────┘  └──────────┘  └──────────┘
      ↓
   One dies = Others continue

Key Benefits

BenefitHow It Helps
High AvailabilityServer dies? Replica takes over
Read PerformanceSpread reads across replicas
Geographic ReachData closer to global users
Disaster RecoveryHelps with redundancy (not a full backup)
MaintenanceUpdate replicas one at a time

Note: replication helps you survive some server failures, but it doesn't protect you from every kind of data loss (for example: accidental deletes, bad deploys, or corrupted data). Backups and point-in-time recovery are still important.


Replication Architectures

Primary-Replica (Leader-Follower)

         ┌─────────────┐
         │   Primary   │ ← All WRITES go here
         │   (Leader)  │
         └──────┬──────┘
                │ Replication (changes flow down)
    ┌───────────┼───────────┐
    ↓           ↓           ↓
┌───────┐   ┌───────┐   ┌───────┐
│Replica│   │Replica│   │Replica│ ← READS can go here
│   1   │   │   2   │   │   3   │
└───────┘   └───────┘   └───────┘

How it works:

  • Typically all writes go to primary
  • Primary logs changes
  • Replicas pull and apply changes
  • Reads can go to any replica

Best for: Most applications, read-heavy workloads

Multi-Primary (Multi-Leader)

┌──────────┐      ┌──────────┐
│ Primary  │ ←──→ │ Primary  │
│   US     │      │   EU     │
└──────────┘      └──────────┘
     ↓                 ↓
┌──────────┐      ┌──────────┐
│ Replica  │      │ Replica  │
└──────────┘      └──────────┘

Both primaries accept writes.
Changes sync between them.

Pros: No single write bottleneck, geographic writes Cons: Conflict resolution complexity


Synchronous vs Asynchronous Replication

Synchronous (Strong Consistency)

Client: "Write this data"
    ↓
Primary: "Got it, writing..."
    ↓
Primary: "Waiting for replica confirmation..."
    ↓
Replica: "I have it too!"
    ↓
Primary → Client: "Success!"

Write isn't confirmed until replica has it.

Pros: Replicas stay closely in sync; reduces the chance of losing acknowledged writes during failover Cons: Slower writes; if required replicas are down or slow, writes can block or fail

Asynchronous (Eventual Consistency)

Client: "Write this data"
    ↓
Primary: "Got it, writing..."
    ↓
Primary → Client: "Success!" (immediate)
    ↓
Primary → Replica: "Here's the change" (later)

Pros: Fast writes; primary doesn't wait for replicas to acknowledge Cons: Replicas may lag behind (stale reads possible); a primary crash can lose the most recent writes that weren't replicated yet

Comparison

AspectSynchronousAsynchronous
Write SpeedSlowerFaster
Data SafetyHigherRisk of loss
AvailabilityReplica failure = blockedReplica failure = OK
Use WhenData is criticalPerformance matters

Replication Lag

The delay between a write on primary and its visibility on replicas.

Timeline:
Write to Primary (user updates profile)
Replica 1 applies change later
Replica 2 applies change later
Replica 3 applies change later

User reads from Replica 3 before it catches up:
  → Sees OLD data (change hasn't arrived yet)

Managing Lag

Read-Your-Writes:
  User writes → Force their reads to primary
  They see their own changes immediately

Monotonic Reads:
  Track which replica version user saw
  Avoid showing older data than they've already seen

Read from Primary for Critical Data:
  Account balance → Primary
  Notifications → Replica (OK if slightly stale)

Failover: When Primary Dies

1. DETECTION
  Monitoring detects the primary is unreachable
  Cluster decides the primary is unhealthy

2. ELECTION
  A leader election chooses a new primary
  (often preferring a replica that is most up-to-date)

3. PROMOTION
   Replica 1 becomes new Primary
   DNS/routing updated

4. RECOVERY
   Other replicas follow new Primary
   Old Primary (when recovered) becomes Replica

Failover Types

TypeSpeedData SafetyUse Case
ManualSlowerHigherCritical review needed
AutomaticFasterRisk of split-brainHigh availability
Semi-AutoIn-betweenMiddle groundMany production setups

Common Patterns

Read Replicas for Scale

Application is read-heavy

Without replicas:
  Primary handles reads + writes → can become overloaded

With read replicas:
  Reads are spread across replicas
  → Load distributed

Geographic Distribution

Primary in US-East
├── Replica in US-West (low latency for US West users)
├── Replica in EU-West (low latency for European users)
└── Replica in AP-South (low latency for Asian users)

Each user reads from nearest replica.

Common Mistakes

1. Assuming Replicas Are Instant

User: Updates profile → Refreshes page → Old profile!

Why? Read went to replica that hadn't received update yet.
Fix: Read-your-writes consistency or read from primary for own data.

2. Not Monitoring Lag

Lag creeps up
Eventually: replicas become much less useful and reads get stale

Fix: Alert when lag exceeds threshold.

3. Forgetting About Failover Testing

"We have replication, we're covered!"
(Not tested: Does automated failover actually work?)

Fix: Chaos testing - kill primary on purpose, verify recovery.

4. All Replicas in Same Datacenter

Datacenter loses power → Primary AND all replicas down

Fix: Geographic distribution - replicas in different regions.

FAQ

Q: Sharding vs Replication?

Replication: Same data on multiple servers (copies) Sharding: Different data on different servers (splits)

Often used together: Each shard has its own replicas.

Q: How many replicas should I have?

Typically 2-3 for high availability. More for global distribution or extreme read loads. Each replica adds infrastructure cost.

Q: What's replication lag tolerance?

Depends on application. Social media: seconds OK. Banking: milliseconds matter. Real-time gaming: lag might be unacceptable.

Q: Can I write to replicas?

Primary-Replica: No, only reads from replicas. Multi-Primary: Yes, with conflict resolution.

Q: What happens to in-flight writes during failover?

May be lost if using async replication. Sync replication prevents this at the cost of performance.


Summary

Database replication copies data across servers, providing high availability, read scaling, and disaster recovery.

Key Takeaways:

  • Primary handles writes, replicas serve reads
  • Synchronous: consistent but slower
  • Asynchronous: faster but potential lag
  • Monitor replication lag actively
  • Test failover before you need it
  • Common in production systems where availability matters
  • Often combined with sharding for full scalability

Replication is your database's insurance policy - invest in it before disaster strikes!

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.