Incident Management: Monitoring

When running a production system, one of the main responsibilities is being able to respond when things go wrong. This is especially important for newly-launched or rapidly-changing systems where incidents are guaranteed, usually due to defect leakage or performance/scaling challenges.

Incident readiness typically involves the following capabilities:

Monitoring (a computer is aware of your system’s health)
An escalation path (when monitoring doesn’t work)
Alerting (how to notify when something breaks)
An on-call rotation (who to notify when something breaks)

This post is the first of a series on incident readiness, starting with some tips on how to best implement the first step: monitoring!

Gathering metrics from your system and representing them in a way that is easy to interpret is key to minimizing the time it will take to remediate an incident.

Types of Metrics

Top-Level Metrics (SLIs)

When starting the process of building monitoring for your service, ask yourself the following questions:

What does it mean to be ‘up’?
What aspects of my service do customers care about the most that they can observe themselves?

That should reveal a small set of metrics that can be used to quickly assess system health. In the SRE world, we call these Service Level Indicators(SLIs).

Examples of SLIs can be:

percentage of successful requests to a specific API endpoint over the past 30 days;
95th percentile of end-to-end latency for work submitted to a data processing pipeline over the past 7 days.

Defining SLIs and collecting the necessary data to calculate them is an iterative process. When starting, just use the data that you have available. As your overall monitoring matures, so can your SLIs.

Operational Metrics

Next, think of useful metrics that will provide a clear explanation for the top-level metrics. For example, app latency can be explained by:

Request rate
The aggregate CPU utilization of the app containers
Request queue length on the app containers
Disk I/O performance of the database servers
etc

A good place to start is the “4 Golden Signals” from the Google SRE book:

Traffic
Latency
Errors
Saturation

A more exhaustive approach is the USE Method from Brendan Gregg:

“For every resource, check utilization. saturation, and errors.”

A well-curated collection of operational metrics will make it possible to quickly identify which resources are related to an incident.

Data Sources

Where specifically do top-level metrics and operational metrics come from? Here are the common sources:

Time Series

Time series is a data type where quantitative measurements and corresponding timestamps are taken periodically from your app and infrastructure and then stored so that they may be queried later.

Metrics tend to be represented in the following ways:

Counter: the frequency of an event over time. Can be used to calculate rates. Always increasing. Useful for things like request or error volume.
Gauge: a value that can fluctuate between a lower and higher value. Useful for metrics like resource utilization or temperature.

There are numerous open-source time series databases available. Influxdb and Graphite for example are push-based systems that allow for arbitrary metrics to be sent to them using HTTP POST, making it very easy to get started.

Prometheus is a pull-based system that scrapes metrics from remote systems using special host agents (exporters). It enables you to assign tags known as ‘dimensions’ to your data to aid in querying, whereas Graphite does not (it uses dot-separated namespacing instead).

Logs

Your apps and infrastructure do generate logs, right? Right?!

Logs are a great source of low-level data for investigating all kinds of problems, such as:

Application bugs (manifesting as exceptions or crashes)
Hardware failures
Security events

Similar to time series databases, there are log management systems available that enable long-term retention and querying, such as the ELK Stack, Greylog, and Loki. One thing to keep in mind is that for larger systems, it may be necessary to send and store only a percentage of log events in order to keep storage costs down (the technique is known as ‘sampling’).

Traces

Traces are an advanced topic, but let’s discuss them anyway. They are quite useful in distributed systems based on Service-Oriented Architecture(SOA) or microservices. In essence, traces record metrics throughout the end-to-end lifecycle of a given request as it propagates through the system by means of a unique request identifier.

That identifier can then be used to retrieve all of the events for that request to reveal all kinds of interesting data, such as which systems added the most latency or which services are hard dependencies for others.

Tracing requires a protocol for the trace data themselves (eg: Zipkin, Jaeger, OpenTelemetry), as well as storage mechanisms.

Dashboarding/Querying

All of this data is great- but how do you consume it to draw accurate conclusions about production?

Dashboarding apps allow you to query your data and create bar/line charts and tables. They can then be assembled into dashboards to create a ‘single pane of glass’ to enable quick assessment of system health. Many teams display their dashboards on large mounted monitors near their desks in an office setting.

Grafana is an excellent open-source webapp that does exactly this and supports many data sources.

Note that each data source has its own query language, which will require time to learn.

There are alternative fully-integrated paid solutions such as Datadog or Splunk, however, you’re locked into what their observability suite offers. Grafana offers a hosted platform as well.

SaaS or Self-Hosted?

A major question is: do you host and operate the monitoring stack yourself, or pay a SaaS provider to do it for you?

The answer is: it depends.

For example: if you work for a startup and the team is busy focusing on functional requirements, it may be more advantageous to pay a SaaS to take on this responsibility, especially if the amount of data to store and query is modest.

In contrast: if you are responsible for thousands of servers across multiple data centers, the cost of hosting and operating these services internally may be a fraction of what a vendor would charge.

The decision ultimately comes down to the impact on your team’s time versus the monetary cost of a managed service.

(Note: I built an Ansible playbook to build a single-host monitoring system containing Grafana, Prometheus, and Loki. If you’re new to monitoring systems, this is a great place to start!)

Conclusion

Setting up monitoring can feel like a daunting task. To keep it simple, start by building a few top-level metrics supported by a set of operational metrics. These metrics can be derived from time series and logs (and tracing in the case of SOA/microservice architectures).

An effective monitoring setup can make the difference between spending minutes or hours when troubleshooting an incident. I’ve helped numerous teams with maturing their monitoring, including those in Big Tech! If you are looking for quick results for your engineering team, let’s schedule an intro call!

(Image credit: Burak the Weekender)

CERTO MODO

Incident Management: Monitoring