Article Blog Image

In Defense of Shell Scripts

Best Practices

Throughout my career, I’ve been known as the engineer who solves a LOT of problems using shell scripts. With it, I’ve built:

  • a tool that creates and destroys test services at Meta to help new engineers learn how to use the company’s planetary-scale CD system
  • a parallel distributed shell that integrates with a company’s asset management system to generate the list of hosts to connect to
  • plugins for an ‘Ops API’ that enables...
Article Blog Image

How to Show Your Value In DevOps/SRE

Best Practices

Since you are reading this post, I am sure that you can relate to the classic plight of the IT, sysadmin, or Operations team: They are invisible until things go wrong.

For practitioners of DevOps and Site Reliability Engineering, that can also be true, especially for teams where the low-hanging fruit has already been addressed.

When the big outage happens, it’s all too common for management to have the kneejerk reaction to ask questions like...

Article Blog Image

Automate Production In Three Steps!

Best Practices

An important process when running a production system is automating manual tasks. This is especially important for fast-growing companies as the engineering team’s time can easily be eaten up by the toil involved in incident response, testing/releasing, etc.- preventing them from implementing the features and improvements that enable further growth and revenue.

It is possible for a product to be a victim of its own success. Don’t let this happen to you!

This article provides...

Article Blog Image

Kanban Quickstart

Best Practices

This article introduces Kanban(看板), a very effective process for organizing your team’s work and driving improvements, especially if you are on an interrupt-driven team such as Site Reliability Engineering, Operations, IT, or Customer Support.

The essential part of the process is the kanban board, which consists of cards representing each work item. Cards are moved between columns representing the state that the work item is in usually from left to right, such as:

  • Backlog
  • ...
Article Blog Image

Cross-Functional Collaboration

Best Practices

The most valuable and impactful work is done through others and not through the strivings of just one person. In the tech industry, creating customer value is a really complicated process and involves the efforts of different people, teams, and perspectives.

Consider a SaaS company: in order for it to be successful, groups like Engineering, Customer Success, Sales, Marketing, and Finance all need to exist and work together in tandem to create a product that...

Article Blog Image

Running Post-Mortems

Best Practices

This article continues the discussion on how your team can learn from failure after a production incident. While write-ups are very important in capturing and documenting what took place, the real value is created from an open and deliberate conversation with the team to identify the lessons learned and the improvements needed to create a more reliable system. That conversation is the post-mortem.

Post-mortems are the primary mechanism for teams to learn from failure....

Article Blog Image

Incident Write-ups

Best Practices

When is an incident considered ‘done’? Is it when the production impact has been addressed and the on-call goes back to bed? If that were true, teams would pass up a huge opportunity to learn and improve from what the incident can teach them, and the on-call (and more importantly, customers) would continue to have a sub-par experience from repeat incidents.

This post discusses the importance and process of the write-up, which is documenting an...

Article Blog Image

Why Adopt DevOps & SRE?

Best Practices

This article explains why practices like DevOps and Site Reliability Engineering are essential for a successful technology business. Sure, they are touted as a way to change company culture and improve collaboration between teams, but what specific business value should you expect from investing in these capabilities?

Let’s start by remembering the goal of every business:

to make money by increasing throughput while simultaneously reducing inventory and operational expense.

In software development, let’s clarify the...

Article Blog Image

Incident Management: On-Call

Best Practices

In our Incident Management series, we’ve talked about how mature monitoring, escalation policies, and alerting enable a swift response when things go wrong in production. Let’s now talk about the people and processes that actually do the responding: the on-call rotation.

Simply put, an on-call rotation is a group of people that share the responsibility of being available to respond to emergencies on short notice, typically on a 24/7 basis. This practice...

Article Blog Image

Incident Management: Alerting

Best Practices

Our Incident Management series discussed so far the importance of monitoring and a solid escalation policy in the swift detection of production issues. Both of them depend on a third capability that we will go over today: alerting.

Alerting notifies the engineering team to appropriately and timely respond to problems in production based on their severity.

They tend to fall into three categories:

  • Page: meets the definition of an emergency and requires...
Article Blog Image

Incident Management: Escalation Policy

Best Practices

Last time in our Incident Management series we discussed how monitoring is essential to responding quickly when things go wrong with your app’s availability or performance.

However, monitoring won’t be able to successfully detect every failure. That is especially true for newly-launched services where monitoring is based on theory and not experience.

How do you prepare for the situation where another team or even a customer (heaven forbid) reports a production issue?

The answer...

Article Blog Image

Incident Management: Monitoring

Best Practices

When running a production system, one of the main responsibilities is being able to respond when things go wrong. This is especially important for newly-launched or rapidly-changing systems where incidents are guaranteed, usually due to defect leakage or performance/scaling challenges.

Incident readiness typically involves the following capabilities:

  • Monitoring (a computer is aware of your system’s health)
  • An escalation path (when monitoring doesn’t work)
  • Alerting (how to notify when something breaks)
  • An on-call rotation (who...
Article Blog Image

Running Successful Engagements

Best Practices

Previously we discussed several types of engagement models that SRE can use when collaborating with software engineering teams, as well as their tradeoffs. Let’s go over some ways in which SRE managers or team leads can successfully start and run an engagement!

To refresh, an SRE engagement can take the form of: taking on operational ownership of a service from an engineering team, embedding SREs on an engineering team, or providing a set of...

Article Blog Image

SRE Engagement Models

Best Practices

Last time we went over the basics of what it means to run an SRE team based on the original ideas that came from Google. Let’s talk about the ‘engagement model’, which describes the way that an individual SRE or team works with software engineering organizations to help them achieve their goals.

The SRE Workbook describes the various types of activities at length— in my experience individual SRE engagements tend to fall into...

Article Blog Image

SRE Essentials

Best Practices

Interested in launching a Site Reliability Engineering(SRE) team? They have been gaining in popularity at tech companies for the past decade— and for good reason! They drive higher levels of operational maturity, remove sources of toil and incidents that slow the pace of feature delivery, and help make services more reliable(hence the name).

However, just because you commission a team of engineers with the job title, doesn’t guarantee you’ll reap the rewards! How do you...

Article Blog Image

Hidden Benefits Of SLOs

Best Practices

There are many articles online about Service Level Objectives(SLOs), particularly on the value they provide to customers as part of a Service Level Agreement(SLA).

Let’s discuss some of the benefits of SLOs that aren’t apparent at first glance.

Before we do, let’s quickly review the terminology from the source:

  • SLI: a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
  • SLO: is a service...
Article Blog Image

Production Readiness Review

Best Practices

Imagine: Your team has designed and developed the initial version of an amazing product with market fit, and you wish to offer it to paying customers as soon as possible. It’s time to prepare for launch!

Product launches exist on a razor’s edge between excitement and terror. They very much depend on first impressions that customers get when using your product:

  • If successful, you win the return on investment and the credibility needed to...
Article Blog Image

Oncall Retrospectives

Best Practices

Last time I shared my thoughts on blameless postmortems and how they create a safe space for revealing process and technology gaps contributing to past incidents.

Today I want to introduce another opportunity for teams to learn and improve from: the ‘oncall retrospective’, which:

  • Keeps the team in touch with the operational reality of their service(s);
  • Reveals opportunities to improve the oncall experience.

I was introduced to this practice by Jos Visser while...

Article Blog Image

Blameless Postmortems

Best Practices

Does your team conduct postmortems as part of their incident response process? It’s a great way to learn from failure and find opportunities to make your systems more reliable.

One piece of advice: make sure they are BLAMELESS.

This creates an environment of psychological safety, enabling your team to be more forthcoming about the factors that triggered or contributed to the incident- allowing them to be tracked and addressed.

In contrast, if your team feels...