Last time in our Incident Management series we discussed how monitoring is essential to responding quickly when things go wrong with your app’s availability or performance.
However, monitoring won’t be able to successfully detect every failure. That is especially true for newly-launched services where monitoring is based on theory and not experience.
How do you prepare for the situation where another team or even a customer (heaven forbid) reports a production issue?
The answer is a well-defined escalation policy in your organization that can effectively reach your team.
An escalation policy is simply a set of guidelines of what to do when a customer reports a production issue and the immediate on-call (eg: a support engineer) cannot resolve it on their own.
The policy can be as simple as a wiki document that everyone has access to that provides instructions on which team to notify and how depending on the nature of the incident.
Let’s discuss how to create an escalation policy for your team from scratch.
Create a Team Page
Assuming that there is a wiki that stores the company’s policies and procedures, create a page for your team. On it, make sure that the following three pieces of information are present:
- How to ask for help;
- The definition of an emergency;
- Your team’s responsibility (scope).
This will provide clear guidance to determine if you’re the right team to reach and whether or not this issue is urgent enough to wake the on-call at 3 AM.
The instructions on how to ask for help will usually involve creating a ticket in your team’s issue queue, with special instructions for emergencies. Whatever solution you choose, emergencies must trigger an alert to your team’s on-call rotation (we’ll talk specifics about alerting and on-call rotations in future posts).
Centralize Policy in Case of Multiple Teams
For larger engineering organizations, the support engineer on-call will need to select which team(s) to escalate to. Create a centralized document that lists all of the teams containing link(s) to their corresponding pages. This allows the support engineer to quickly choose which team to escalate to depending on the incident.
Create Common Process/Tooling for Escalations
A potential risk is that the support engineer on-call will need to use different methods to notify each engineering team based on their unique instructions, which creates confusion and increases the time to resolution. This problem compounds if multiple teams require escalation.
All engineering teams should be escalated to in a similar fashion using the same tools. Oncall rotation tools such as Pagerduty or full-on incident management tools such as FireHydrant or incident.io are great solutions. Large companies (eg: Big Tech) tend to develop their own internal tools for this purpose to make team selection easy.
Identify Monitoring Improvements
Customers should never be your monitoring system.
If your team receives an escalation, discuss as part of the postmortem process if there are improvements that can be made to monitoring. The ideal state is that your team never receives escalations because monitoring consistently detects and triggers the incident process.
We depend on our monitoring systems to quickly notify us when things go wrong to minimize MTTR and the general business impact.
It is important to prepare your team to effectively respond to incidents that monitoring doesn’t catch. That is achievable through a well-defined escalation policy, especially for larger organizations where operational responsibility is spread out.
I have over a decade of experience being on-call and making robust incident management processes for software engineering and IT teams. Reach out if you want to learn more!
(Image credit: Jan van der Wolf)