The Health Dashboard Analogy
Hospital patient monitors show:
- Heart rate (too fast? too slow?)
- Blood pressure (too high? too low?)
- Oxygen levels (dropping?)
- Temperature (fever?)
Doctors instantly know if something's wrong. Alarms sound before it's critical.
Monitoring gives your servers a health dashboard. Track vital signs, get alerted when things go wrong.
Why Monitoring Matters
Without Monitoring
User: "The site is super slow!"
You: "Really? Let me check..."
(SSH into servers, check manually...)
"Oh wow, CPU is at 100%!"
Hours of damage before you even knew!
With Monitoring
Alert: "CPU > 90% for a sustained period"
You: (Check dashboard, see spike, investigate)
"Found the runaway process, killing it now"
Minutes to resolution. Users barely noticed.
What Gets Monitored
Infrastructure Metrics
| Metric | What It Tells You | Warning Sign |
|---|---|---|
| CPU % | Processing load | > 80% sustained |
| Memory % | RAM usage | > 85% |
| Disk % | Storage space | > 80% |
| Network I/O | Data in/out | Unusual spikes |
Application Metrics
| Metric | What It Tells You | Warning Sign |
|---|---|---|
| Request rate | Traffic volume | Sudden drops |
| Error rate | Things failing | > 1% |
| Latency (p50, p95, p99) | Response time | Increasing trend |
| Throughput | Requests/second | Below baseline |
Business Metrics
Signups per hour
Orders per minute
Revenue per hour
Cart abandonment rate
Sometimes the BUSINESS metric catches problems first!
"Signups dropped 50%" → investigate technical cause
The RED Method (Services)
For monitoring services:
R - Rate
How many requests per second?
E - Error Rate
What percentage are failing?
D - Duration
How long do requests take? (p50, p95, p99)
Dashboard Example
┌────────────────────────────────────────┐
│ API Service - RED Metrics │
│ │
│ Rate: (requests/sec) │
│ Errors: (low / medium / high) │
│ Duration: (p50 / p95 / p99 latency) │
└────────────────────────────────────────┘
The USE Method (Resources)
For monitoring resources (CPU, memory, disk):
U - Utilization
What percentage busy?
S - Saturation
How much work is waiting (queue length)?
E - Errors
How many errors encountered?
When to Worry
CPU Utilization: 90% → High, but not critical
CPU Saturation: High → Processes waiting, bad user experience!
CPU Errors: Any → Hardware problem?
Alerting
Setting Up Alerts
Define conditions:
- What metric to watch
- What threshold triggers alert
- How long before alerting
- Who gets notified
Example:
Metric: CPU usage
Threshold: > 90%
Duration: a sustained period
Notify: #oncall-slack, PagerDuty
Good vs Bad Alerts
Good Alert:
"API error rate > 5% for a sustained period"
Actionable! Someone should investigate.
Bad Alert:
"CPU briefly hit 80%"
Not actionable. Creates alert fatigue.
Rule: Alert on things that need human action.
Alert Fatigue
Too many alerts = people ignore them
"Oh, another CPU warning, it's probably fine..."
[Actual critical problem ignored]
[Outage]
Be ruthless about removing noisy alerts!
Dashboards
What to Display
┌─────────────────────────────────────────────────┐
│ Production Dashboard │
│ │
│ Request Rate: (requests/sec) [▁▂▃▄▅▆▇█▇▆▅▄▃▂▁] │
│ Error Rate: (low) [▁▁▁▁▁▂▁▁▁▁▁▁] │
│ P99 Latency: (target) [▂▂▃▃▂▂▂▃▇▃▂▂] │
│ │
│ Top 5 Errors: │
│ 1. Connection timeout (23) │
│ 2. Invalid input (12) │
│ 3. Rate limited (8) │
└─────────────────────────────────────────────────┘
Dashboard Tips
✓ Show key metrics at a glance
✓ Time ranges (last hour, day, week)
✓ Comparison to baseline/yesterday
✓ Drill-down capability
✓ Clear indication of good/bad
Popular Monitoring Tools
| Tool | Type | Often Used For |
|---|---|---|
| Prometheus | Metrics collection | Kubernetes-native |
| Grafana | Visualization | Dashboards |
| Datadog | All-in-one | Full observability |
| New Relic | APM | Application performance |
| PagerDuty | Alerting | On-call management |
| CloudWatch | AWS native | AWS environments |
The Observability Stack
Logging + Monitoring + Tracing = Observability
Logging: What happened? (events)
Monitoring: How is it performing? (metrics)
Tracing: How do requests flow? (distributed tracing)
All three together = full visibility.
SLIs, SLOs, and SLAs
Definitions
SLI (Service Level Indicator):
The metric you measure
"API latency p99"
SLO (Service Level Objective):
Your internal target
"p99 latency under a chosen target"
SLA (Service Level Agreement):
Your customer promise
"high uptime, and latency under an agreed target"
Error Budgets
SLO: high availability (an agreed target)
Error budget: a small amount of downtime is allowed within the SLO window
Use the budget wisely!
Deploy new features (might cause issues)
vs. freeze changes (burn no budget)
Common Mistakes
1. Not Monitoring
"We'll add monitoring later"
Later = after the first outage you can't diagnose.
Add monitoring from the start.
2. Alerting on Everything
Alert fatigue → ignored alerts → missed real problems
3. No Historical Data
"Is today's traffic normal?"
Without history, you can't know!
Keep metrics for weeks/months/years.
4. Dashboard Overload
50 metrics on one screen = can't see anything
Focus on what matters. Drill down when needed.
FAQ
Q: Monitoring vs Logging?
Monitoring: aggregated numbers (metrics, trends) Logging: individual events (text records)
Use both!
Q: How often to collect metrics?
Every few seconds to about a minute is common. More frequent = more cost.
Q: What about costs?
Metrics storage and monitoring tools can get expensive at scale. Plan for it.
Q: Should I monitor in development?
Basic monitoring, yes. Catches performance problems before production.
Summary
Monitoring tracks system health through metrics and alerts you when things go wrong.
Key Takeaways:
- Track CPU, memory, disk, latency, errors
- RED for services, USE for resources
- Set alerts on actionable thresholds
- Avoid alert fatigue (be selective!)
- Use dashboards for visibility
- SLOs define your targets
- Logging + Monitoring + Tracing = Observability
Good monitoring means problems wake you up at 3am, not angry customers at 9am.
Related Concepts
Leave a Comment
Comments (0)
Be the first to comment on this concept.
Comments are approved automatically.