Does your team conduct postmortems as part of their incident response process? It’s a great way to learn from failure and find opportunities to make your systems more reliable.
One piece of advice: make sure they are BLAMELESS.
This creates an environment of psychological safety, enabling your team to be more forthcoming about the factors that triggered or contributed to the incident- allowing them to be tracked and addressed.
In contrast, if your team feels they will be chastised or punished when revealing such details, they will hide them, preventing you from avoiding a recurrence.
To keep postmortems blameless, they need to be moderated. Here’s some practices that worked for me:
-
Say the following at the beginning of every meeting: “Welcome to today’s incident review. We are presenting X incidents. Remember: this review is blameless- we focus on how process and technology failed us, not people.” This sets the tone of the meeting for attendees right from the start.
- Pay attention to how the incident is being discussed:
- If a team member apologizes during the meeting, gently remind them publicly that apologies aren’t necessary- the meeting is blameless, we are all humans, and we are here to learn.
- Similarly, if a team member directs blame to another person or team, interject and gently provide the same reminder. Note that this can happen unintentionally and subtly.
- Don’t settle for ‘human error’ as a ‘root cause’ for an incident. Ask probing questions to reveal the underlying failure. For example, if data loss on a database took place due to an UPDATE statement performed by an engineer without a WHERE clause, explore why raw database commands were being performed in the first place. Is there a lack of automation or missing safety features in existing tooling?
Being consistent in these practices creates an engineering culture where failure is seen as an opportunity to learn and improve- which will directly benefit your products and customers over time!
Is your team struggling to build a healthy and effective incident response process? Let’s talk- I’m passionate about making oncall life as pain-free as possible!
(Photo Credits: Rodolpho Zanardo)