Last time I shared my thoughts on blameless postmortems and how they create a safe space for revealing process and technology gaps contributing to past incidents.
Today I want to introduce another opportunity for teams to learn and improve from: the ‘oncall retrospective’, which:
- Keeps the team in touch with the operational reality of their service(s);
- Reveals opportunities to improve the oncall experience.
I was introduced to this practice by Jos Visser while onboarding at Meta- which inspired me to make it my own and then implement it in every oncall team I’m on. Here’s how it works:
Once a week, have a scheduled 30m meeting, ideally attended by everyone on the team. Designate someone to moderate the meeting and take notes (I call them the ‘scribe’).
The agenda will look like this:
- First, the scribe will present top-level metrics from the previous week:
- If you have Service Level Objectives (and I hope you do :-) ), how did they perform?
- How many alerts did the oncall rotation receive? How many of them were critical(meaning: they were tied to an emergency)?
- How many helpdesk requests did the team get (if applicable)?
Compare these metrics with past periods to see if there is an improvement or regression in general oncall health. Ideally this data is automatically collected and made available on a dashboard to make that a trivial process.
Next: The scribe will debrief the previous week’s oncalls on their experience. They should be able to present the following information:
- What incident(s) they were involved in, and their general impact(s).
- WIP handed off to the next shift (incidents, alerts, and support requests still-in-flight). (IMO: an oncall should work the alerts/support requests they received to completion. It’s totally reasonable to hand off incidents, however.)
- How painful was oncall, from a scale of 1-5 (1: no impact at all, 5: took up all of my time, as well as after-hours)
At Meta we actually put this information in a weekly report to be read before the meeting.
Finally: discuss specifics about the shift’s experience, such as:
- Were there any alerts that were noisy or unactionable?
- Are the alerts runbooks out of-date?
- Can an alert remediation be better performed with a tool or automated away entirely?
- Is there a bug that we need to bump the priority on?
- How can we make this support request self service?
This is the moment where the scribe is doing the most important work: taking notes on what can be improved for future oncall shifts.
Immediately after the meeting, the scribe converts those notes into tasks in the issue queue. I also suggest assigning the tasks to the oncall who was debriefed so that they may provide additional details.
If you and your team then prioritize and tackle those tasks, oncall shifts will be more pleasant and your team will be happier and more productive in doing the things that provide value!
Struggling with oncall burden on your team? Let’s talk!
(Photo Credits: fauxels)