Article Blog Image

Thoughts On The First SEV0 Conference

Events

As systems grow larger and more complex, mastering incident response isn’t just a necessity— it’s critical for a tech company’s survival.

SEV0, hosted by incident.io in San Francisco a few days ago, tackled this head-on, bringing together thought leaders and practitioners to share best practices, hard-earned lessons, and bold new ideas in the world of incident management.

As you know, I’m pretty obsessive about the end-to-end process of incident response, so of course I had to attend- and it’s super refreshing to meet a whole community of people who are just as passionate about it as I am.

Just an aside: my service SEV0.help and the SEV0.com conference were announced in the same week, so of course I had to troll CEO Stephen Whitworth just a little bit. I even wore a sev0.help T-shirt to the conference!

My LinkedIn post

All in all, I think the subject matter provided tech leaders with a great refresher on the best practices of incident management while providing some new insights for seasoned practitioners. They fulfilled their promise to not do any sales pitches, which is pretty rare for a company-sponsored event.

Thoughts from the day’s agenda

Check the SEV0 website for the full event list and presenters/panelists!

Maintaining blameless incident culture when everyone knows whodunnit

Andrew Guenther from OpenAI provided a solid introduction to the philosophy around blameless postmortems while acknowledging the realities of human nature. This adds nuance to the decision of when to do them and who should do them.

Let’s discuss the topic of when to do them. When all we had was Benjamin Treynor Sloss’ talk at Usenix in 2014 in terms of SRE practices, one of them was to ‘perform a postmortem for every event’.

When I initially launched my SRE program at Acquia and tried to implement postmortems for every incident, they were quickly perceived as punitive by engineering teams. It was awesome that Andrew validated that experience.

Teams should be free to set their own criteria. For immature teams with a high volume of alerts, it might be reserved for major incidents only. However, I do recommend that teams aspire to postmortem every incident once the reliability posture matures.

In terms of who should do them for a given incident, he suggested those who can teach the team the most. In my experience, that will be the combination of the on-call that managed the incident (to discuss how the incident was handled) and the SME for the service/component that experienced a failure (to discuss how the incident was triggered).

Some additional points:

  • Ensure psychological safety in the group, yes- but allow venting in private. People will still feel angry, annoyed, and frustrated when incidents happen.
  • Incentivise people to participate in this work- have it count towards good performance ratings and promotion cases. (At Meta, these activities fell under the ‘engineering excellence’ section of a performance review.)
  • Celebrate when action items from post-mortems are completed- especially if they prevented subsequent incidents

Organization-aware incident response

Laurence Jones from host incident.io introduced the concept of using a service catalog as a means to make incidents a company-wide function.

One of the massive benefits of creating one is that it aids in incident escalation- with a dataset containing all of the services at a company, you now know which on-call rotation to page- and teams are automatically incentivized to provide accurate information to avoid getting paged needlessly.

With an up-to-date service catalog, you can then annotate incidents with services/teams that were involved, revealing common themes or pain points in past incidents based on time spent or by the number of events.

These practices I have seen in action when using Meta’s internal incident tooling.

A novel concept he also discussed was including Salesforce data in incidents as part of the service catalog- including things like account managers, size of deal in dollars, and even customer sentiment when understanding customers affected by an incident- which previously was a homegrown capability from my experience. Having that information provides the means to triage recovery on a per-customer basis and ensure the revenue team is aware of potential obstacles to renewing contracts.

Building processes that survive contact with reality

This panel explored how the reality of incident response can fall short of ideals and how to address that. Highlights:

Collette Alexander from Hashicorp provided a brief introduction to Resilience Engineering. Seeing practitioners in the wild is very rare as the concepts behind Resilience Engineering require rigor beyond the simple ‘5 Whys’, and most engineers would rather focus on tools and technology.

One of the spiciest takes was the statement that we track post-mortem follow-up tasks to ‘make us feel better’. From experience, it is a huge struggle to propose, capture, prioritize, and then perform improvements identified from an incident. I wish they explored how to address this problem further.

Finally, they discussed how mean time to resolution(MTTR), albeit intuitive for executives to understand, isn’t a good metric to assess reliability or incident response capability. I’ll interject here and state that more engineering organizations should look at Service Level Objectives(SLO) as they directly assess performance against customer expectations and can be affected by things like time-to-acknowledge, or time-to-keyboard. Colette introduced the idea of ‘time-to-muster’, which is the time required to get the necessary people to successfully remediate the incident.

Stop, Drop, and SEV4: Why small incidents are a big deal

Plaid’s Derek Brown’s short and sweet talk explored using low-impact incidents (SEV4) to exercise and hone your incident response and investigations process.

One of the big takeaways is the following: if you’re questioning whether to file an incident, just do it.

I also liked his introduction to Type 1 and 2 Decisions, or ‘one-way’ and ‘two-way’ decisions, respectively.

The 9.1 magnitude meltdown at Fukushima

Nickolas Means from MedScout provided a highly detailed and well-told narrative of the Fukushima Daiichi disaster- particularly how the many safeguards to prevent meltdown were circumvented by the tsunami- such as how the backup generators and batteries were below the flood line and therefore vulnerable.

It was also a great case study for the Westrum organizational model, particularly comparing the ‘pathological’ and ‘generative’ team cultures as exemplified by Japanese Prime Minister Naoto Kan and nuclear plant manager Masao Yoshida. It was a great reminder that leaders exert both authority and influence when working in different proportions, and that control is ultimately an illusion.

What I learned was if it wasn’t for Yoshida’s leadership and deferring to his experience rather than the PM’s authority, the scope of the disaster would have been orders of magnitude worse, dwarfing the scope of Chernobyl.

There is no such thing as a free lunch. How Slack runs their incident lunch exercise

In a more lighthearted tone, Scott Windels from Slack introduced how they do incident management training in the form of a tabletop exercise held during lunch.

The ‘incident’ is that lunch did not arrive for the event and participants were tasked with ordering and picking it up, abiding by a set of constraints that make the game interesting.

He even distributed decks of cards after his talk containing random events and conditions the participants must navigate throughout the exercise.

I was so impressed with the idea, that I’ll adopt it as part of my Reliability Bootcamp when performed with on-site teams.

Balancing reliability whilst scaling fast

This panel covered top-level values regarding incidents that I agree with when attempting to balance concerns of reliability and beating the competition.

  • Some of the biggest incidents are caused by external dependencies. Sometimes the appropriate response is to ‘build your own’ to have full control over your reliability.
  • Sales and revenue organizations need to care about incidents as they affect the ability to grow the company. Collaboration between Tech and GTM makes you the right partner for your customers. (Oh, and btw: that’s what DevOps is all about- it’s not just about collaboration between operators and engineers, but the entire organization.)
  • Incidents, when handled well, can create positive outcomes in terms of building trust.

Culture over chaos: Fostering a positive On-call environment

Ryan Schroeder from Netflix wrapped up the day by providing a solid set of best practices from his 13 years of experience around how to best manage an on-call rotation.

One of the points that jumped out at me was being thoughtful about when to hand off the pager to the next on-call. Doing so during the day ensures an actual transfer of responsibility and context.

He also reminded me that performing retrospectives on the on-call experience in general is valuable. Pain points from the previous shift should be addressed, even if no incidents took place!

Conclusion

The SEV0 conference fills a huge need. As incidents have ever-increasing impact on our lives, we need to have discussions and advance our knowledge on how to effectively manage them and learn and grow from failure.

It was a great reminder that incidents are a multidisciplinary activity that requires constant practice and retrospection to do well.

If you’re passionate about evolving your incident response practices, keep an eye out for next year’s SEV0. They are just getting started!

I told CEO Stephen Whitworth that I’d be first in line to submit a talk next year when a call for papers is announced!

Tags: