Skip to main content

📊 Monitoring

Heart rate monitor for servers

The Health Dashboard Analogy

Hospital patient monitors show:

  • Heart rate (too fast? too slow?)
  • Blood pressure (too high? too low?)
  • Oxygen levels (dropping?)
  • Temperature (fever?)

Doctors instantly know if something's wrong. Alarms sound before it's critical.

Monitoring gives your servers a health dashboard. Track vital signs, get alerted when things go wrong.


Why Monitoring Matters

Without Monitoring

User:  "The site is super slow!"
You:   "Really? Let me check..."
       (SSH into servers, check manually...)
       "Oh wow, CPU is at 100%!"

Hours of damage before you even knew!

With Monitoring

Alert: "CPU > 90% for a sustained period"
You:   (Check dashboard, see spike, investigate)
       "Found the runaway process, killing it now"

Minutes to resolution. Users barely noticed.

What Gets Monitored

Infrastructure Metrics

MetricWhat It Tells YouWarning Sign
CPU %Processing load> 80% sustained
Memory %RAM usage> 85%
Disk %Storage space> 80%
Network I/OData in/outUnusual spikes

Application Metrics

MetricWhat It Tells YouWarning Sign
Request rateTraffic volumeSudden drops
Error rateThings failing> 1%
Latency (p50, p95, p99)Response timeIncreasing trend
ThroughputRequests/secondBelow baseline

Business Metrics

Signups per hour
Orders per minute
Revenue per hour
Cart abandonment rate

Sometimes the BUSINESS metric catches problems first!
"Signups dropped 50%" → investigate technical cause

The RED Method (Services)

For monitoring services:

R - Rate
    How many requests per second?

E - Error Rate
    What percentage are failing?

D - Duration
    How long do requests take? (p50, p95, p99)

Dashboard Example

┌────────────────────────────────────────┐
│  API Service - RED Metrics             │
│                                        │
│  Rate:     (requests/sec)              │
│  Errors:   (low / medium / high)       │
│  Duration: (p50 / p95 / p99 latency)   │
└────────────────────────────────────────┘

The USE Method (Resources)

For monitoring resources (CPU, memory, disk):

U - Utilization
    What percentage busy?

S - Saturation
    How much work is waiting (queue length)?

E - Errors
    How many errors encountered?

When to Worry

CPU Utilization: 90% → High, but not critical
CPU Saturation: High → Processes waiting, bad user experience!
CPU Errors: Any → Hardware problem?

Alerting

Setting Up Alerts

Define conditions:
  - What metric to watch
  - What threshold triggers alert
  - How long before alerting
  - Who gets notified

Example:
  Metric: CPU usage
  Threshold: > 90%
  Duration: a sustained period
  Notify: #oncall-slack, PagerDuty

Good vs Bad Alerts

Good Alert:
  "API error rate > 5% for a sustained period"
  Actionable! Someone should investigate.

Bad Alert:
  "CPU briefly hit 80%"
  Not actionable. Creates alert fatigue.

Rule: Alert on things that need human action.

Alert Fatigue

Too many alerts = people ignore them

"Oh, another CPU warning, it's probably fine..."
[Actual critical problem ignored]
[Outage]

Be ruthless about removing noisy alerts!

Dashboards

What to Display

┌─────────────────────────────────────────────────┐
│  Production Dashboard                            │
│                                                  │
│  Request Rate: (requests/sec) [▁▂▃▄▅▆▇█▇▆▅▄▃▂▁] │
│  Error Rate:   (low)         [▁▁▁▁▁▂▁▁▁▁▁▁]     │
│  P99 Latency:  (target)      [▂▂▃▃▂▂▂▃▇▃▂▂]     │
│                                                  │
│  Top 5 Errors:                                   │
│  1. Connection timeout (23)                      │
│  2. Invalid input (12)                           │
│  3. Rate limited (8)                             │
└─────────────────────────────────────────────────┘

Dashboard Tips

✓ Show key metrics at a glance
✓ Time ranges (last hour, day, week)
✓ Comparison to baseline/yesterday
✓ Drill-down capability
✓ Clear indication of good/bad

ToolTypeOften Used For
PrometheusMetrics collectionKubernetes-native
GrafanaVisualizationDashboards
DatadogAll-in-oneFull observability
New RelicAPMApplication performance
PagerDutyAlertingOn-call management
CloudWatchAWS nativeAWS environments

The Observability Stack

Logging + Monitoring + Tracing = Observability

Logging:    What happened? (events)
Monitoring: How is it performing? (metrics)
Tracing:    How do requests flow? (distributed tracing)

All three together = full visibility.

SLIs, SLOs, and SLAs

Definitions

SLI (Service Level Indicator):
  The metric you measure
  "API latency p99"

SLO (Service Level Objective):
  Your internal target
  "p99 latency under a chosen target"

SLA (Service Level Agreement):
  Your customer promise
  "high uptime, and latency under an agreed target"

Error Budgets

SLO: high availability (an agreed target)

Error budget: a small amount of downtime is allowed within the SLO window

Use the budget wisely!
  Deploy new features (might cause issues)
  vs. freeze changes (burn no budget)

Common Mistakes

1. Not Monitoring

"We'll add monitoring later"

Later = after the first outage you can't diagnose.
Add monitoring from the start.

2. Alerting on Everything

Alert fatigue → ignored alerts → missed real problems

3. No Historical Data

"Is today's traffic normal?"

Without history, you can't know!
Keep metrics for weeks/months/years.

4. Dashboard Overload

50 metrics on one screen = can't see anything

Focus on what matters. Drill down when needed.

FAQ

Q: Monitoring vs Logging?

Monitoring: aggregated numbers (metrics, trends) Logging: individual events (text records)

Use both!

Q: How often to collect metrics?

Every few seconds to about a minute is common. More frequent = more cost.

Q: What about costs?

Metrics storage and monitoring tools can get expensive at scale. Plan for it.

Q: Should I monitor in development?

Basic monitoring, yes. Catches performance problems before production.


Summary

Monitoring tracks system health through metrics and alerts you when things go wrong.

Key Takeaways:

  • Track CPU, memory, disk, latency, errors
  • RED for services, USE for resources
  • Set alerts on actionable thresholds
  • Avoid alert fatigue (be selective!)
  • Use dashboards for visibility
  • SLOs define your targets
  • Logging + Monitoring + Tracing = Observability

Good monitoring means problems wake you up at 3am, not angry customers at 9am.

Related Concepts

Leave a Comment

Comments (0)

Be the first to comment on this concept.

Comments are approved automatically.