Skip to main content

🍕 Database Sharding

Splitting data across servers

The Library Branches Analogy

One massive library becomes hard to manage:

  • Too many visitors
  • Shelves overflowing
  • Staff overwhelmed

Solution: Split into branches!

Central Library (overwhelmed)
      ↓ Split by topic
┌─────────────┬─────────────┬─────────────┐
│ Branch A    │ Branch B    │ Branch C    │
│ Fiction     │ Non-Fiction │ Reference   │
│ 10K books   │ 10K books   │ 10K books   │
└─────────────┴─────────────┴─────────────┘

Each branch handles its own section.
Together, they hold everything.
Visitors go to the branch they need.

Database sharding splits data across multiple databases (shards).


Why Sharding Matters

The Single Database Problem

One database, growing (illustrative numbers):
  Year 1: smaller dataset + low QPS → Fast!
  Year 2: bigger dataset + more QPS → Slower
  Year 3: very large + very high QPS → hardware pressure
  Year 4: keeps growing → operational pain 🔥

Vertical Scaling (The Old Way)

"Just buy a bigger server!"

More RAM and bigger disks → $$$$$

Problems:
  - Expensive hardware
  - Can still be a single point of failure unless you add replication/failover
  - Eventually hits limits

Horizontal Scaling with Sharding

"Split the data across multiple servers!"

┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Shard A  │ │ Shard B  │ │ Shard C  │ │ Shard D  │
│ data     │ │ data     │ │ data     │ │ data     │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
     ↓           ↓           ↓           ↓
  load        load        load        load

Same total data, distributed load.
Add more shards as you grow.

Sharding Strategies

1. Range-Based Sharding

Divide data by ranges of the shard key.

Shard by user_id:

Shard 1: user_id 1 - 1,000,000
Shard 2: user_id 1,000,001 - 2,000,000
Shard 3: user_id 2,000,001 - 3,000,000
Shard 4: user_id 3,000,001+

User 500,000 → Shard 1
User 1,500,000 → Shard 2

Pros:

  • Easy to understand
  • Range queries efficient (users 1-1000 all on same shard)

Cons:

  • Hot spots (new users all on latest shard)
  • Uneven distribution if ranges differ in activity

2. Hash-Based Sharding

Use a hash function to determine shard.

shard_id = hash(user_id) % num_shards

User 12345: hash("12345") = 98765432
            98765432 % 4 = 0 → Shard 0

User 67890: hash("67890") = 54321098
            54321098 % 4 = 2 → Shard 2

Pros:

  • Even distribution (no hot spots)
  • Load balanced naturally

Cons:

  • Range queries across all shards
  • Resharding is complex (hash changes)

3. Directory-Based Sharding

Lookup table maps keys to shards.

Lookup Service:
┌───────────┬─────────┐
│ Key       │ Shard   │
├───────────┼─────────┤
│ user_123  │ Shard 2 │
│ user_456  │ Shard 1 │
│ user_789  │ Shard 3 │
└───────────┴─────────┘

Query: "Where is user_456?"
Answer: "Shard 1"

Pros:

  • Maximum flexibility
  • Easy to move data between shards

Cons:

  • Lookup table is a bottleneck
  • Must be highly available

Choosing a Shard Key

The shard key determines data distribution. Critical decision!

Good Shard Key Properties

✓ High Cardinality
  Many unique values = even distribution
  Good: user_id (millions of unique values)
  Bad: country (only ~200 values)

✓ Even Distribution
  Data and queries spread equally
  No "hot" shards

✓ Matches Query Patterns
  Related data queried together stays together

Common Shard Key Examples

ApplicationShard KeyWhy
Social Mediauser_idUser's data accessed together
E-Commerceseller_idSeller's products on same shard
SaaS Platformtenant_idCustomer data isolated
Chat Appconversation_idConversation history together
Gamingplayer_idPlayer state on same shard

Real-World Examples

Common Industry Patterns

Many multi-tenant products shard by tenant_id/workspace_id.
Many consumer products shard by user_id.

Goal: keep "data you fetch together" on the same shard.

Sharding Challenges

Cross-Shard Queries

Query: "Find all users who bought Product X"

Without sharding:
  SELECT * FROM users
  JOIN orders ON users.id = orders.user_id
  WHERE orders.product = 'X'
  → Fast, single database

With sharding:
  Query Shard 1, 2, 3, 4...
  Merge results in application
  → Slow, network overhead

Solutions:

  • Denormalize data
  • Pre-compute aggregations
  • Design queries to hit single shard

Resharding (Adding Shards)

Before: 3 shards with hash(key) % 3
After:  4 shards with hash(key) % 4

Problem: hash("12345") % 3 = 1
         hash("12345") % 4 = 2
         → Data needs to move!

Solutions (common approaches):
  - Consistent hashing / hash rings: reduce movement when adding/removing shards
  - Virtual shards (many small logical shards mapped to fewer physical nodes)
  - Range splitting with planned rebalancing

Maintaining Transactions

User transfers money from Account A to Account B

If accounts on different shards:
  - A single-shard database transaction may not be enough
  - Options include distributed transactions (e.g., 2PC) in some systems,
    or application-level patterns like sagas/compensation + eventual consistency

Sharding vs Partitioning

AspectPartitioningSharding
ScopeUsually within one logical databaseAcross multiple database instances/nodes
WhereCan be on one server (or within a clustered system)Typically different nodes/instances
ComplexityLowerHigher
Partitioning:
┌─────────────────────────────────────┐
│           ONE DATABASE              │
│ ┌─────────┬─────────┬─────────┐    │
│ │ Table-A │ Table-B │ Table-C│    │
│ └─────────┴─────────┴─────────┘    │
└─────────────────────────────────────┘

Sharding:
┌───────────┐   ┌───────────┐   ┌───────────┐
│ Database 1│   │ Database 2│   │ Database 3│
│   Shard   │   │   Shard   │   │   Shard   │
└───────────┘   └───────────┘   └───────────┘

Common Mistakes

1. Sharding Too Early

"We need to shard for scale!"
(Currently have a small user base)

PostgreSQL handles millions of rows just fine.
Shard when you have actual problems.

2. Poor Shard Key Choice

Shard key: created_date

Result: All new data → latest shard (hot spot)
        Old shards sit idle

Better: user_id or tenant_id (spread across shards)

3. Ignoring Cross-Shard Operations

"We'll just JOIN across shards"

Reality: per-shard latency adds up quickly as shard count grows
         Network calls, data transfer, result merging

Design to avoid cross-shard queries.

4. Not Planning for Resharding

"4 shards should be enough forever"

Growth happens. Plan how you'll add shards.
Use consistent hashing from the start.

FAQ

Q: When should I shard?

When a single database (plus indexing/caching/replication) can no longer meet your load, storage, or latency requirements. Exact thresholds vary a lot—avoid sharding preemptively.

Q: Can I shard later?

Yes, but it's painful. Requires data migration, application changes, and careful testing. Easier if designed for it early.

Q: How is sharding different from replication?

Sharding: Different data on different servers (splits) Replication: Same data on multiple servers (copies) Often used together: Each shard has replicas.

Q: What about joins across shards?

Avoid them. Denormalize data, use application-level joins, or accept that cross-shard operations are expensive.

Q: What databases support sharding?

Some systems have built-in sharding / distributed partitioning (e.g., MongoDB, Cassandra, CockroachDB). For MySQL, tools like Vitess provide sharding via a middleware/control plane. Many other databases can be sharded at the application layer with proper routing.


Summary

Database sharding splits data across multiple databases for horizontal scaling, enabling systems to grow beyond single-server limits.

Key Takeaways:

  • Distribute data using a shard key
  • Strategies: range-based, hash-based, directory-based
  • Keep related data on the same shard
  • Cross-shard operations are expensive
  • Plan for even distribution
  • Often combined with replication
  • Don't shard until you need to

Sharding can unlock much larger scale — but with significant complexity.

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.