This article continues the discussion on how your team can learn from failure after a production incident. While write-ups are very important in capturing and documenting what took place, the real value is created from an open and deliberate conversation with the team to identify the lessons learned and the improvements needed to create a more reliable system. That conversation is the post-mortem.
Post-mortems are the primary mechanism for teams to learn from failure. It is a great starting point for introducing DevOps/SRE culture to a team unaccustomed to it. Don’t be surprised if other teams start to adopt this practice when they see it being done!
I’ve written an article previously about how to make a post-mortem blameless, and this time we’ll be discussing the actual process and mechanics for organizing, moderating, and presenting one for your team or department.
Schedule A Recurring Meeting
Block off 30m-1h of the team’s calendar on a weekly basis for presenting and discussing post-mortems. When getting started, expect each incident to require up to 30 minutes for presentation and discussion. As your process matures, each incident can be covered in as little as 15 minutes.
It is very important that the entire team consistently attends this meeting as it will keep everyone continuously in touch with the realities of the system and helps foster a culture of continuous improvement.
Select a Moderator
When introducing post-mortems to the team, it is important to wisely choose the first moderator that will run the meetings. This person will set the tone and process for everyone else to follow.
Prime candidates are Site Reliability or DevOps Engineers, or a senior engineer who has an interest in reliability. They will of course need the ability to facilitate live discussions.
The moderator’s role is to make sure that the meeting is well planned, well executed, and creates value for the team.
Prepare For The Meeting
The moderator will have some prep work to do before the meeting.
- Select and schedule incidents to present. Does the incident have material that the team can learn from? Are there aspects of the incident that are inherently interesting? Will the right people be available to present and answer questions?
- Invite guests from other teams as needed. Examples of guests include members of other teams that were involved in the incident, or stakeholders that may have been affected.
- Help the presenter prepare for the meeting. Is their incident write-up clear and complete? Will the presenter be able to go through their material in the time allotted? Do they need to do a dry run to gather feedback?
Run The Meeting
The moderator will provide structure and guardrails for the meeting. Here is a format that has worked well in my personal experience:
- Start the actual agenda 5 minutes in. This gives a chance for attendees to join the call or find the meeting room and relax before the post-mortems are presented and discussed. Explicitly state on the call that you will begin 5 minutes into the meeting. People can feel tense when attending these meetings for the first time, so let attendees have a little bit of unstructured banter to relieve that stress.
Introduce the agenda in a clear and consistent way. I tend to use the following: “Welcome to today’s post-mortem meeting. We have two incidents to present. Remember that post-mortems are a blameless process, we focus on process and technology, not people. Let’s start with incident TITLE, presented by NAME.” Starting the post-mortem meeting the same way every time is very important. It reminds everyone that the meeting is blameless and creates a setting for psychological safety from the start.
- Moderate Q&A. At the conclusion of the presentation, prompt the group to ask questions. Have the presenter answer questions in the order that they are asked. For video calls, the ‘raise hand’ feature is useful for this, and for hybrid office setups I tend to use a document to collect questions from attendees. Don’t be afraid to interrupt and table a conversation if it is taking too much time or is going off-topic. For remaining time once all questions have been answered, allow for a more open discussion.
- Take Notes. Pay close attention to interesting talking points or suggestions made by the presenter or attendees, especially regarding process or technology gaps that weren’t already identified by the presenter.
File Follow-up Tasks
How do we ensure that lessons learned from the meeting translates to actual business value?
Using the notes from the meeting, file tasks in the ticketing system to track identified improvements to the reliability of the system. Things to keep in mind:
- Does the task prevent the incident from happening again or reduce its impact or duration?
- Does it have clear acceptance criteria?
- What should be the urgency of the task?
- Who is a subject matter expert that has additional context to make this task actionable?
It’s also recommended to send an email to attendees with the list of tasks filed, and make sure they are discussed in sprint/project planning to ensure they are prioritized.
Hand Off The Moderator Role To Others
Once the post-mortem meeting has matured enough, document the process and share the responsibility with the rest of the team on a rotating basis. This is a great way to get the team directly involved with the reliability process outside of on-call.
The presenter’s role is to share the impact, top-level narrative, relevant details, and lessons learned from an incident. The ideal presenter will be the person who was on-call for the incident, a subject-matter expert for the affected systems, or perhaps both!
Do The Write-up
Provide yourself sufficient time to author the incident write-up which will provide the basis of your presentation. Share it with the moderator at least several days before the scheduled meeting so that you have the opportunity to address feedback.
Dry-run the Presentation
Are you going to summarize the top-level details from the write-up directly, or will you create a slide deck instead? Regardless of the medium used to deliver the presentation, it will be useful to rehearse it a couple of times to make sure that it fits in the time allotted and will be understandable by attendees. Protip: consider the audience you are presenting to. If there are attendees that aren’t members of your team, be sure to explain terms and concepts that would be foreign to them. Good examples of that are team or service-specific acronyms.
Present and Answer Questions
A few pieces of advice:
- Take a deep breath and remember that post-mortems are blameless! You are not on trial! Your insight is welcome and needed to improve the system so be as open as possible.
- You won’t be able to answer every question- and that’s ok! Make sure you have subject matter experts in the meeting with you so that you can delegate some tough questions to them.
- Slow down. It is very likely that you’ll be a little nervous if it’s your first time presenting an incident so you might start to talk fast when presenting.
Post-mortems are the primary vehicle for teams to learn from failure. The moderator and the presenter both have responsibilities before, during, and after the meeting to ensure that the teams fully benefit from the experience. Placing an emphasis on making this process blameless creates an environment that supports open communication and therefore discovery of reliability improvements.
I love post-mortems, and have moderated and presented many of them over the years. If you’re interested in introducing the practice to your team or department, schedule a call with me, and let’s get started!
(Image based on photo from Diva Plavalaguna)