Incident Management: Alerting

Our Incident Management series discussed so far the importance of monitoring and a solid escalation policy in the swift detection of production issues. Both of them depend on a third capability that we will go over today: alerting.

Alerting notifies the engineering team to appropriately and timely respond to problems in production based on their severity.

They tend to fall into three categories:

Page: meets the definition of an emergency and requires immediate response
Ticket: requires best effort during business hours
Log: requires no response

Let’s discuss how to set up an alerting mechanism for your system that provides a high signal-to-noise and a shorter time to troubleshoot and resolve corresponding incidents. In this post, we will be discussing pages as they best pertain to incident response.

Author Detectors Based on Metrics

Detectors are simply code or configuration that periodically queries your monitoring system (or production directly) for conditions that aren’t desirable, such as downtime, high latency, or errors. Here are some examples of how to set up detectors across multiple open-source monitoring solutions:

Alert On Symptoms, Not Causes

Which specific metrics should detectors query? Ideally, use top-level metrics discussed in the monitoring article in this series. (From the SRE standpoint, we would use service level indicators.) Again, top-level metrics will be observable and important to the customer. Sending alerts in this fashion is described as ‘alerting on symptoms’. For example, alert on the error rate of customer requests to your service(symptom) and not health metrics for the underlying database(cause). The result of this practice is sending an alert for a single symptom rather than many alerts for all of the potential underlying causes.

Make Alerts Actionable Through Enrichment

Well-crafted alerts should be able to answer three questions:

What is the specific failure? (What?)
What is the impact of this failure on production and customers? (Why?)
How do I triage, troubleshoot, and respond to this alert? (How?)

‘Alert enrichment’ adds additional context to the alert that better answers the above questions. That can be achieved via embedding:

graphs or links to dashboards that can explain the symptoms. For example, you can explain the error rate of your app by visualizing the crash rate of your containers or rate of specific error types in your application logs.
log entries that contributed to the top-level-metric (errors, exceptions, etc) to aid in troubleshooting
links to runbooks on how to further troubleshoot and remediate

Test New Alerts Before Routing to the On-call

Before releasing new detectors/alerts, be sure to test and vet them first to ensure they don’t create unnecessary noise for the on-call. Failures to avoid:

The alert fires and clears repeatedly over time (‘flapping’).
The alert doesn’t fire with a 1-to-1 correspondence to past confirmed emergencies
The alert doesn’t fire soon enough, creating a slow response time

Some alerting tools support testing of new changes against past time series to validate their quality. For those that don’t, simply route your alert to a chatroom or email address intended for testing. That will allow you to observe its behavior prior to adding it to the set of alerts that can notify the on-call.

Set up Paging For Escalations

Finally, provide a simple mechanism to generate a page for situations that alerting failed to detect via the escalation policy. That can be achieved through:

Monitoring your ticket system for items filed with high urgency
Sending an email (see this document on email integration with Pagerduty alerting)

Regardless of the solution used, ensure that the ticket/alert is filed in a consistent and complete manner so that triage can immediately take place once the on-call gets involved.

Conclusion

Effective alerting is made possible by consulting the right set of top-level metrics, providing a combination of descriptions, enrichment, and runbooks for each alert, and having a well-defined manual test process. These in combination will create alerts that are understandable, actionable, and easy to respond to.

Incident response is essential for any technology business but can also be really complicated and frustrating. Schedule a call with me if you would like some help tackling the complexity!

(Image based on photo from Terje Sollie)

CERTO MODO

Incident Management: Alerting