Monitoring | Notion

Monitoring Anti-Patterns

"No silver bullets", that is, it's impossible to find a single tool for monitoring.
- Use APM tools to monitor applications at the code level.
- Use modern server monitoring solutions to monitor performance of a cloud infrastructure.
- Use network monitoring tools for spanning tree topology changes or routing updates.
- Sometimes, you really do have to build one.
"The single pane of class is a myth." Feed multiple dashboards by multiple tools.
Monitoring is not a job - it's a skill. Having everyone shirk the responsibility of monitoring at all by resting it solely on the shoulders of a single person.
Checkbox monitoring is when you have monitoring systems for the sole sake of saying you have them. You have monitoring system, but you still don't know what's goin' on, or constantly ignore alerts, etc.
- UNDERSTAND what it is you're monitoring. Perform high level checks. Work on things indicating something could be wrong, though they don't necessarily tell you what's wrong.
- OS Metrics are useful for diagnostics and performance analysis, but not useful for alerting.
- Collecting metrics in a high granularity can catch issues. Meanwhile, configure a retention policy that makes sense for metrics.
Don't forget the next step for monitoring is FIXING THE PROBLEMS. Monitoring doesn't fix a broken system.
Automation is important. Monitoring becomes an exercise in analyzing the aggregate of entire groups of systems rather than one or two.

Monitoring Design Patterns

Composable Monitoring - use multiple specialized tools and couple them loosely together, forming a monitoring platform. A monitoring service has five primary facets:
- Data Collection
  - Push-based v/s Pull-based are two ways for data collection. SNMP & /health endpoint pattern fits pull-based. A pull model can be difficult to scale as it requires central systems to keep track of all known clients, handle scheduling and parse returning data. A push model is easier to scale in a distributed architecture.
  - Metrics representations: Counter (ever-increasing metrics) / Gauge (point-in-time values
  - Structured (JSON) v/s Unstructured Logs.
- Data Storage
  - Use Time Series Database (TSDB) for metrics. As the data gets older, roll up (or age out) multiple data points into a single data point.
  - Storing logs can get expensive. Compression & Retention policies can help.
- Visualization
  - Use dashboard products and frameworks, such as Grafana, ...
  - The best dashboards focus on displaying the status of a single service or one product.
- Analytics and reporting
  - Report Service-Level Availability (SLA), a.k.a, the number of nines ,availability = uptime / total time.
  - In a complex architecture, computing SLA for each and every component that app depends on can be tricky, considering redundancy. Instead, just calculating the availability of the component as a whole.
  - 100% availability is unrealistic.
- Alerting
  - Not every metric need to have a corresponding alert.
Monitor from the User Perspective
- The best place to add monitoring first is at the point user interacts with app.
- Start from monitoring HTTP response codes, request time (a.k.a latency).
- When adding metrics, ask yourself, "How will these metrics show me the user impact"
Buy, Not Build
- Opt for buying monitoring tools if possible, either SaaS services or FOSS monitoring tools, instead of building them yourself.
- It costs less comparing to FTEs (Full-time employees).
- You're (Probably) not an expert at architecting monitoring tools.
Always be improving

Monitoring & Alerts

Monitoring is the action of observing and checking the behavior and outputs of a system and its components over time.
Alerting on raw data points could produce lots of false alarms as system metrics tend to be spiky. Rolling average is often applied to the data to smooth it out.
Good Alerts
- Alerts means to wake someone up, not just as an FYI (it's a message).
- Some practises
  - Wisely choose alert targets (SMS, PagerDuty for Response/Action required immediately, Slack or IRC for Awareness needed but immediate action not required, Logs for historical/diagnose purposes, just not email).
  - Write runbooks. Answer these questions: Introduction, Service Owner, Architecture Diagram, Dependencies, Meaning of metrics, logs and alerts.
  - Arbitrary static thresholds aren't the only way. Try moving averages, confidence bands, standard deviation, etc.
  - Delete and tune alerts. This could avoid alert fatigue.
  - Use maintenance periods.
  - Attempt self-healing first. Let monitoring system execute the script to fix problem instead of notifying a human. If the problem wasn't resolved via an auto-healing attempt, then send an alert.

Monitoring & On-Calls

Fix false alarms. Or, at least, cut false alarms by an appreciable amount.
Explicitly plan for systems resiliency and stability work during team meeting / sprint planning.
Build on-call rotation.