The Library Branches Analogy
One massive library becomes hard to manage:
- Too many visitors
- Shelves overflowing
- Staff overwhelmed
Solution: Split into branches!
Central Library (overwhelmed)
↓ Split by topic
┌─────────────┬─────────────┬─────────────┐
│ Branch A │ Branch B │ Branch C │
│ Fiction │ Non-Fiction │ Reference │
│ 10K books │ 10K books │ 10K books │
└─────────────┴─────────────┴─────────────┘
Each branch handles its own section.
Together, they hold everything.
Visitors go to the branch they need.
Database sharding splits data across multiple databases (shards).
Why Sharding Matters
The Single Database Problem
One database, growing (illustrative numbers):
Year 1: smaller dataset + low QPS → Fast!
Year 2: bigger dataset + more QPS → Slower
Year 3: very large + very high QPS → hardware pressure
Year 4: keeps growing → operational pain 🔥
Vertical Scaling (The Old Way)
"Just buy a bigger server!"
More RAM and bigger disks → $$$$$
Problems:
- Expensive hardware
- Can still be a single point of failure unless you add replication/failover
- Eventually hits limits
Horizontal Scaling with Sharding
"Split the data across multiple servers!"
┌──────────┐ ┌──────────┐ ┌──────────┐ ┌──────────┐
│ Shard A │ │ Shard B │ │ Shard C │ │ Shard D │
│ data │ │ data │ │ data │ │ data │
└──────────┘ └──────────┘ └──────────┘ └──────────┘
↓ ↓ ↓ ↓
load load load load
Same total data, distributed load.
Add more shards as you grow.
Sharding Strategies
1. Range-Based Sharding
Divide data by ranges of the shard key.
Shard by user_id:
Shard 1: user_id 1 - 1,000,000
Shard 2: user_id 1,000,001 - 2,000,000
Shard 3: user_id 2,000,001 - 3,000,000
Shard 4: user_id 3,000,001+
User 500,000 → Shard 1
User 1,500,000 → Shard 2
Pros:
- Easy to understand
- Range queries efficient (users 1-1000 all on same shard)
Cons:
- Hot spots (new users all on latest shard)
- Uneven distribution if ranges differ in activity
2. Hash-Based Sharding
Use a hash function to determine shard.
shard_id = hash(user_id) % num_shards
User 12345: hash("12345") = 98765432
98765432 % 4 = 0 → Shard 0
User 67890: hash("67890") = 54321098
54321098 % 4 = 2 → Shard 2
Pros:
- Even distribution (no hot spots)
- Load balanced naturally
Cons:
- Range queries across all shards
- Resharding is complex (hash changes)
3. Directory-Based Sharding
Lookup table maps keys to shards.
Lookup Service:
┌───────────┬─────────┐
│ Key │ Shard │
├───────────┼─────────┤
│ user_123 │ Shard 2 │
│ user_456 │ Shard 1 │
│ user_789 │ Shard 3 │
└───────────┴─────────┘
Query: "Where is user_456?"
Answer: "Shard 1"
Pros:
- Maximum flexibility
- Easy to move data between shards
Cons:
- Lookup table is a bottleneck
- Must be highly available
Choosing a Shard Key
The shard key determines data distribution. Critical decision!
Good Shard Key Properties
✓ High Cardinality
Many unique values = even distribution
Good: user_id (millions of unique values)
Bad: country (only ~200 values)
✓ Even Distribution
Data and queries spread equally
No "hot" shards
✓ Matches Query Patterns
Related data queried together stays together
Common Shard Key Examples
| Application | Shard Key | Why |
|---|---|---|
| Social Media | user_id | User's data accessed together |
| E-Commerce | seller_id | Seller's products on same shard |
| SaaS Platform | tenant_id | Customer data isolated |
| Chat App | conversation_id | Conversation history together |
| Gaming | player_id | Player state on same shard |
Real-World Examples
Common Industry Patterns
Many multi-tenant products shard by tenant_id/workspace_id.
Many consumer products shard by user_id.
Goal: keep "data you fetch together" on the same shard.
Sharding Challenges
Cross-Shard Queries
Query: "Find all users who bought Product X"
Without sharding:
SELECT * FROM users
JOIN orders ON users.id = orders.user_id
WHERE orders.product = 'X'
→ Fast, single database
With sharding:
Query Shard 1, 2, 3, 4...
Merge results in application
→ Slow, network overhead
Solutions:
- Denormalize data
- Pre-compute aggregations
- Design queries to hit single shard
Resharding (Adding Shards)
Before: 3 shards with hash(key) % 3
After: 4 shards with hash(key) % 4
Problem: hash("12345") % 3 = 1
hash("12345") % 4 = 2
→ Data needs to move!
Solutions (common approaches):
- Consistent hashing / hash rings: reduce movement when adding/removing shards
- Virtual shards (many small logical shards mapped to fewer physical nodes)
- Range splitting with planned rebalancing
Maintaining Transactions
User transfers money from Account A to Account B
If accounts on different shards:
- A single-shard database transaction may not be enough
- Options include distributed transactions (e.g., 2PC) in some systems,
or application-level patterns like sagas/compensation + eventual consistency
Sharding vs Partitioning
| Aspect | Partitioning | Sharding |
|---|---|---|
| Scope | Usually within one logical database | Across multiple database instances/nodes |
| Where | Can be on one server (or within a clustered system) | Typically different nodes/instances |
| Complexity | Lower | Higher |
Partitioning:
┌─────────────────────────────────────┐
│ ONE DATABASE │
│ ┌─────────┬─────────┬─────────┐ │
│ │ Table-A │ Table-B │ Table-C│ │
│ └─────────┴─────────┴─────────┘ │
└─────────────────────────────────────┘
Sharding:
┌───────────┐ ┌───────────┐ ┌───────────┐
│ Database 1│ │ Database 2│ │ Database 3│
│ Shard │ │ Shard │ │ Shard │
└───────────┘ └───────────┘ └───────────┘
Common Mistakes
1. Sharding Too Early
"We need to shard for scale!"
(Currently have a small user base)
PostgreSQL handles millions of rows just fine.
Shard when you have actual problems.
2. Poor Shard Key Choice
Shard key: created_date
Result: All new data → latest shard (hot spot)
Old shards sit idle
Better: user_id or tenant_id (spread across shards)
3. Ignoring Cross-Shard Operations
"We'll just JOIN across shards"
Reality: per-shard latency adds up quickly as shard count grows
Network calls, data transfer, result merging
Design to avoid cross-shard queries.
4. Not Planning for Resharding
"4 shards should be enough forever"
Growth happens. Plan how you'll add shards.
Use consistent hashing from the start.
FAQ
Q: When should I shard?
When a single database (plus indexing/caching/replication) can no longer meet your load, storage, or latency requirements. Exact thresholds vary a lot—avoid sharding preemptively.
Q: Can I shard later?
Yes, but it's painful. Requires data migration, application changes, and careful testing. Easier if designed for it early.
Q: How is sharding different from replication?
Sharding: Different data on different servers (splits) Replication: Same data on multiple servers (copies) Often used together: Each shard has replicas.
Q: What about joins across shards?
Avoid them. Denormalize data, use application-level joins, or accept that cross-shard operations are expensive.
Q: What databases support sharding?
Some systems have built-in sharding / distributed partitioning (e.g., MongoDB, Cassandra, CockroachDB). For MySQL, tools like Vitess provide sharding via a middleware/control plane. Many other databases can be sharded at the application layer with proper routing.
Summary
Database sharding splits data across multiple databases for horizontal scaling, enabling systems to grow beyond single-server limits.
Key Takeaways:
- Distribute data using a shard key
- Strategies: range-based, hash-based, directory-based
- Keep related data on the same shard
- Cross-shard operations are expensive
- Plan for even distribution
- Often combined with replication
- Don't shard until you need to
Sharding can unlock much larger scale — but with significant complexity.
Related Concepts
Leave a Comment
Comments (0)
Be the first to comment on this concept.
Comments are approved automatically.