Interested in launching a Site Reliability Engineering(SRE) team? They have been gaining in popularity at tech companies for the past decade— and for good reason! They drive higher levels of operational maturity, remove sources of toil and incidents that slow the pace of feature delivery, and help make services more reliable(hence the name).
However, just because you commission a team of engineers with the job title, doesn’t guarantee you’ll reap the rewards! How do you ensure your new team’s success?
Let’s start by going straight to the source and discuss the original SRE practices and how they differ from classic IT or Operations teams.
The following set of practices were presented in 2014 by Benjamin Treynor Sloss, the VP of SRE at Google:
- Hire only coders.
- Have an SLA for your service.
- Measure and report performance against the SLA.
- Use Error Budgets and gate launches on them.
- Have a common staffing pool for SRE and Developers.
- Have excess Ops work flow to the Dev team.
- Cap SRE operational load at 50%.
- Share 5% of Ops work with the Dev team.
- Oncall teams should have at least 8 people at one location, or 6 people at each of multiple locations.
- Aim for a maximum of two events per on-call shift.
- Do a postmortem for every event.
- Postmortems are blameless and focus on process and technology, not people.
At Google, a Site Reliability Engineering team would assume almost all of the operational responsibility for services on behalf of a software engineering team. So long as the service continued to meet reliability targets defined by SLOs and the operational load was sustainable, this partnership between SRE and SWE was equitable— freeing SWE time to write more features. If not, SRE handed the pager back to the SWE team, and in extreme cases, ended the engagement entirely.
This is a huge departure from the previous Ops team model— particularly around properly managing expectations and incentives towards ops work. SRE has the ability to vote with their feet and not run a service if it creates a poor work experience— forcing SWE teams to think about non-functional requirements as part of day-to-day work. Also, a small amount of ops work remains with SWE teams— just enough for them to remain in touch with the state of production. Finally, it also removes organizational dysfunction centered around the idea that ops is a ‘cost center’, instead it is a part of the software engineering discipline.
Of course, most companies aren’t as large as Google to have the budget to build entire SRE teams to operate in this fashion. What ultimately matters is that the following checks and balances are in place when SREs engage with a software engineering team:
- Operational responsibility is shared and managed via automation: incentivize the team to eliminate sources of toil.
- Customer success is quantitatively measured: Service Level Objectives(SLOs) provide a clear picture of how the team’s actions affect production which informs decision-making.
- Error budgets inform work prioritization and shipping new features: if production no longer performs at the level required for customer success, the team stops shipping new features until it does. Conversely, if the service is relatively healthy, the team is encouraged to take more risks.
- The team learns from failure in a blameless way: there is sufficent psychological safety to speak openly about the contributing factors to an incident.
- On-call rotations are properly staffed and humane: provide 24/7 availability for production incidents without creating a massive impact on the team’s work-life balance.
If your implementation has the above values built-in and actively practiced- you are running an SRE team.
Next time we’ll discuss the different ‘engagement models’ of how SRE interacts with software engineering teams, providing you a range of motion of how to make your products more reliable!
Want hands-on assistance with building an SRE competency at your company? Let’s schedule an introduction!
(Image credit: Marcus Spiske)