An important process when running a production system is automating manual tasks. This is especially important for fast-growing companies as the engineering team’s time can easily be eaten up by the toil involved in incident response, testing/releasing, etc.- preventing them from implementing the features and improvements that enable further growth and revenue.
It is possible for a product to be a victim of its own success. Don’t let this happen to you!
This article provides a simple framework on how to iterate on sources of manual effort until they are performed automatically by software.
Where We Begin
To start, let’s describe where most teams are at the beginning of their automation journey. You can probably relate to this:
- The most experienced members of the team are often interrupted by manual tasks.
- They are the only ones on the team who do this work because the process is in their head (tribal knowledge).
- Manual work is done inconsistently by the rest of the team, or not at all out of fear of causing a production incident.
- Therefore, the senior teammates become single points of failure and a bottleneck for getting work done.
How do we start to address this problem? We begin by documenting every manual task in a runbook: step-by-step instructions on how to do something.
In other words: runbooks are code for human beings.
Runbooks, especially in their earlier versions, can contain screenshots and even videos to make it easy for the team to understand.
The process is simple:
- Have senior teammates write a runbook for every undocumented task they are asked to do.
- If they are asked to do something that has already been documented in a runbook, delegate that work to someone else on the team.
- If anyone is having trouble following the runbook, escalate to its author to provide improvements.
- Repeat until everyone on the team can go on vacation without any impact or delay to business operations.
- Google Docs are a great place to get started. They are automatically versioned, allow the embedding of media, and require SSO authentication to access.
- Github/Gitlab is also a useful tool for storing runbooks, especially when they are written in Markdown, as they are automatically rendered to HTML in the web interface. Proposed changes can be discussed in a pull request, and each teammate can store a local copy to consult in case the repository is unavailable.
At this stage, manual work is being done more consistently and readily thanks to runbooks. However:
- The time required for each task can still be significant.
- There is still a possibility of making mistakes.
What do we do next? It’s time to make our runbooks tool-assisted by taking sections from them and rewriting them as scripts. Ideally, all of the steps would be converted into a single script.
To decide which runbook(s) to automate first, analyze the issue queue. Which runbooks are being used the most? (You are tracking work in an issue queue, right?)
Which software languages are best for this purpose? Common scripting languages are most useful (Python, Bash, etc.) as they are easier to learn, have a large community, and provide a lot of packages/libraries. For tools that require speed, consider the use of compiled languages such as Rust or Golang.
Once you have a tool written, change the runbook to reference when and how to use the tool for the steps that it replaces. Make sure that the runbook shows what the expected inputs/outputs should be.
These tools should provide appropriate input validation and error handling. Track and quickly resolve any bugs that are found.
Continue automating away portions of each runbook until they are consistently and safely performed with a single invocation of their corresponding tool. These tools should be fully trusted to be run by any member of the team.
Manual tasks are now done quickly thanks to a set of tools that we trust. Hurray!
However, they still require being run by a human with access to production. No matter how fast the script, human involvement adds time. For example, when responding to an alert in the middle of the night, the on-call will need to wake up, acknowledge the alert, find their laptop, log into production, read the alert details, then follow the corresponding runbook and run the tool.
We can do better. Ideally, these tasks should be performed instantly at the moment of need.
The final step is to convert these scripts into features of a full-fledged software service. Common patterns to achieve this are:
- Self-service interfaces: running code in response to a customer’s request
- Auto-remediation: event handling from the monitoring system
Like any other feature, make sure that they have sufficient testing, observability, and monitoring to ensure their health.
The toil involved in operating a production system adds up as the number of customers increases, which will eventually reduce your engineering team’s effectiveness. Taking a deliberate approach towards automation using the progression of writing runbooks, creating tools, and then full features of your services allows for maintaining the level of value delivery that your customers have grown to expect.
This is the method I use to drive dramatic improvements for engineering teams at Big Tech and enterprise SaaS companies. Schedule time with me to learn more!
(Image credit: Pavel Danilyuk)