Article Blog Image

Parallel Distributed Shell

Tools

When operating large-scale production systems, we rely on infrastructure-as-code (IaC) to keep the state of servers and cloud resources consistent over time. We also benefit from observability platforms to automatically gather metrics from these resources.

However, there are times when both of these essential mechanisms are rendered ineffective during a production incident. How can an engineer troubleshoot and implement a fix successfully across a large number of hosts?

Enter the parallel distributed shell.

A parallel...

Article Blog Image

Running Post-Mortems

Best Practices

This article continues the discussion on how your team can learn from failure after a production incident. While write-ups are very important in capturing and documenting what took place, the real value is created from an open and deliberate conversation with the team to identify the lessons learned and the improvements needed to create a more reliable system. That conversation is the post-mortem.

Post-mortems are the primary mechanism for teams to learn from failure....

Article Blog Image

Incident Write-ups

Best Practices

When is an incident considered ‘done’? Is it when the production impact has been addressed and the on-call goes back to bed? If that were true, teams would pass up a huge opportunity to learn and improve from what the incident can teach them, and the on-call (and more importantly, customers) would continue to have a sub-par experience from repeat incidents.

This post discusses the importance and process of the write-up, which is documenting an...

Article Blog Image

Incident Management: On-Call

Best Practices

In our Incident Management series, we’ve talked about how mature monitoring, escalation policies, and alerting enable a swift response when things go wrong in production. Let’s now talk about the people and processes that actually do the responding: the on-call rotation.

Simply put, an on-call rotation is a group of people that share the responsibility of being available to respond to emergencies on short notice, typically on a 24/7 basis. This practice...

Article Blog Image

Incident Management: Alerting

Best Practices

Our Incident Management series discussed so far the importance of monitoring and a solid escalation policy in the swift detection of production issues. Both of them depend on a third capability that we will go over today: alerting.

Alerting notifies the engineering team to appropriately and timely respond to problems in production based on their severity.

They tend to fall into three categories:

  • Page: meets the definition of an emergency and requires...
Article Blog Image

Incident Management: Escalation Policy

Best Practices

Last time in our Incident Management series we discussed how monitoring is essential to responding quickly when things go wrong with your app’s availability or performance.

However, monitoring won’t be able to successfully detect every failure. That is especially true for newly-launched services where monitoring is based on theory and not experience.

How do you prepare for the situation where another team or even a customer (heaven forbid) reports a production issue?

The answer...

Article Blog Image

Incident Management: Monitoring

Best Practices

When running a production system, one of the main responsibilities is being able to respond when things go wrong. This is especially important for newly-launched or rapidly-changing systems where incidents are guaranteed, usually due to defect leakage or performance/scaling challenges.

Incident readiness typically involves the following capabilities:

  • Monitoring (a computer is aware of your system’s health)
  • An escalation path (when monitoring doesn’t work)
  • Alerting (how to notify when something breaks)
  • An on-call rotation (who...
Article Blog Image

Blameless Postmortems

Best Practices

Does your team conduct postmortems as part of their incident response process? It’s a great way to learn from failure and find opportunities to make your systems more reliable.

One piece of advice: make sure they are BLAMELESS.

This creates an environment of psychological safety, enabling your team to be more forthcoming about the factors that triggered or contributed to the incident- allowing them to be tracked and addressed.

In contrast, if your team feels...