oncall

On-Call Stories: Flying Blind

Anecdotes

19

Jul

20:00

Let’s try something new and recall one of my most memorable production incidents!

Earlier in my career, I managed an operations team at a medium-sized tech company. The main revenue-generating product consisted of thousands of EC2 instances, all depending on Puppet for configuration management.

Puppet, unlike recent CM systems like Ansible, used a centralized server to store config manifests and required authentication in order to apply them to clients. We configured our hosts...

Oncall Retrospectives

Best Practices

01

Feb

15:00

Last time I shared my thoughts on blameless postmortems and how they create a safe space for revealing process and technology gaps contributing to past incidents.

Today I want to introduce another opportunity for teams to learn and improve from: the ‘oncall retrospective’, which:

Keeps the team in touch with the operational reality of their service(s);
Reveals opportunities to improve the oncall experience.

I was introduced to this practice by Jos Visser while...

Blameless Postmortems

Best Practices

27

Jan

10:00

Does your team conduct postmortems as part of their incident response process? It’s a great way to learn from failure and find opportunities to make your systems more reliable.

One piece of advice: make sure they are BLAMELESS.

This creates an environment of psychological safety, enabling your team to be more forthcoming about the factors that triggered or contributed to the incident- allowing them to be tracked and addressed.

In contrast, if your team feels...

CERTO MODO

tag

oncall

On-Call Stories: Flying Blind

Anecdotes

Oncall Retrospectives

Best Practices

Blameless Postmortems

Best Practices