the Certain Way we do things

Assessing SLO Maturity

Events

Jul

10:00

Last year, I shared a framework for defining and using Effective SLOs- helping teams understand the health of their systems to guide real decision-making.

That’s great when you’re an SRE introducing SLOs for the first time. But what if you’re responsible for reliability across an entire organization? How do you assess whether a team’s SLOs actually set them up for success?

On August 20th at 14:00 ET, I’m teaming up with Nobl9 for...

SRE is Ops With Boundaries

Media

Feb

16:00

Join fellow SRE Paige Cruz and yours truly on an exploration of my history being on-call, using multiple generations of observability tools, and how to make the experience as painless as possible.

One of the major points of the discussion is how SRE sets boundaries around taking on-call burden on behalf of engineering teams in contrast to classic IT Operations teams.

I also share a funny story about the last page I received while...

2024 In Review

Announcements

Dec

12:00

2024 was a very eventful year in the tech industry, with Certo Modo being no exception to the rule.

The year started strong, with a full docket of engagements carried over from 2023. Many of these clients were former colleagues from my corporate days who needed a hand with culture, leadership and technical matters. Eventually, though, that stream of work tapered off. The major challenges my clients faced had been resolved, leaving them equipped to...

New Podcast: Reliability Rebels

Media

Nov

10:00

I’m pleased to announce that I’ve launched a new podcast: Reliability Rebels!

Over the past couple years, I’ve had the privilege to be a guest on several tech podcasts (and will continue to do so), however I decided to create my own.

(And yes, I produced the intro music!)

I wanted to explore how people in tech sometimes have to challenge the status quo to improve their systems, as that was definitely my experience across...

Thoughts On The First SEV0 Conference

Events

Sep

14:00

As systems grow larger and more complex, mastering incident response isn’t just a necessity— it’s critical for a tech company’s survival.

SEV0, hosted by incident.io in San Francisco a few days ago, tackled this head-on, bringing together thought leaders and practitioners to share best practices, hard-earned lessons, and bold new ideas in the world of incident management.

As you know, I’m pretty obsessive about the end-to-end process of incident response, so of course...

Effective SLOs Workshop

Media

Sep

20:00

A few months back, I presented the Effective SLOs webinar, where we discussed how to select, implement, and iterate on Service Level Objectives (SLOs)— a cornerstone of how we ensure the reliability of our systems.

(If you haven’t seen the recording yet, you can access it here.

Today, I’m excited to announce the release of a companion workshop, which is available for download.

This workshop offers hands-on experience, guiding participants through the...

In Defense of Time Tracking

Best Practices

Sep

18:00

Time tracking gets a bad rap. It’s easy to forget, adds extra steps to an already busy schedule, and often feels like micromanagement. But hear me out—when used strategically, it can be a game-changer for engineering teams.

Let me be clear: I’m not suggesting that you track every minute of your day. In knowledge work, a lot of time is spent in thinking and discussion. Engineers should be assessed by the impact they deliver, not...

Announcing: sev0.help

Announcements

Aug

12:00

I’m pleased to announce a new service as part of Certo Modo’s portfolio: sev0.help.

sev0.help offers ‘on-call as a service’, providing instant access to experienced Site Reliability Engineers (SREs). Whenever you experience a major outage that your team can’t address on their own, simply page us, and within 15 minutes, an SRE will join your chatroom, video call, or conference bridge to deliver top-notch incident management and distributed systems troubleshooting- all at a...

Read More

Webinar: Effective SLOs

Events

21

May

16:00

Let’s get real: Service Level Objectives are hard to get right. They are indeed a transformative technique in making services reliable, however there are many potential pitfalls and antipatterns when implementing them that can lead to frustration. Let’s explore several that I’ve observed in my career!

Common SLO Pitfalls

SLOs can be hard to explain

It can be a challenge to clearly articulate what they are and the value they provide, and quoting...

Read More

An Open Letter To Product Management

Media

05

Apr

13:00

Hey, product managers!

I’m an engineer. We need to talk! (I promise not to spout technical jargon at you.)

Let’s be honest: our two groups don’t see eye to eye as much as we should. Perhaps now is a chance to change that!

To start, we (as engineers) understand that your job is to take the product’s vision (informed by customer desire) and bring it into reality. We get that it can be...

Read More

Webinar: Lean SRE

Events

18

Mar

16:00

When we think about Site Reliability Engineering, we tend to associate it with large tech companies that have the budget to build entire departments to improve production. I think that smaller organizations and startups sadly avoid adopting these practices due to that misconception.

I argue that SRE can be implemented by much smaller companies and yield significant benefits in reduced operational costs and time savings, freeing them to build a more compelling product.

Nothing is...

Read More

In Defense of Shell Scripts

Best Practices

26

Feb

16:00

Throughout my career, I’ve been known as the engineer who solves a LOT of problems using shell scripts. With it, I’ve built:

a tool that creates and destroys test services at Meta to help new engineers learn how to use the company’s planetary-scale CD system

a parallel distributed shell that integrates with a company’s asset management system to generate the list of hosts to connect to

plugins for an ‘Ops API’ that enables...

Read More

Slight Reliability Ep82: CI/CD

Media

13

Feb

14:00

Another appearance on the Slight Reliability Podcast! This time we go over the basics of CI/CD, change management, my experience running a Change Advisory Board(CAB), testing in prod, and how to treat your test/deploy infrastructure!

Read More

How to Show Your Value In DevOps/SRE

Best Practices

14

Dec

07:00

Since you are reading this post, I am sure that you can relate to the classic plight of the IT, sysadmin, or Operations team: They are invisible until things go wrong.

For practitioners of DevOps and Site Reliability Engineering, that can also be true, especially for teams where the low-hanging fruit has already been addressed.

When the big outage happens, it’s all too common for management to have the kneejerk reaction to ask questions like...

Read More

Building DroneCI Pipelines

Tools

14

Nov

09:00

Last time I covered several tips on how to launch and operate a Drone CI installation. As promised, I will now reveal my hard-earned secrets on how to build, configure, and monitor DroneCI pipelines!

This assumes pipelines using the Docker runner which is the common use case (and the most useful!)

Each pipeline step can use entirely different container images

Yes, this is implicit with the use of the Docker runner, however, take a...

Read More

Running DroneCI

Tools

29

Oct

16:00

In a previous post, I explored why Jenkins should no longer be the default choice for CI/CD for new software projects. This time, let’s discuss an alternative that I’ve gotten quite familiar with recently: Drone CI.

Drone is simply described as a ‘self-service Continuous Integration platform for busy development teams’. Configuring a CI pipeline is as simple as activating the repo in the web UI and committing a .drone.yml file in the project’s...

Read More

Slight Reliability Ep70: Meta SRE

Media

08

Oct

16:00

I return to the Slight Reliability Podcast to discuss my experience in Meta’s Production Engineering… and tell a story about how I almost burnt down a server room early in my career! Don’t miss this one!

Read More

Ansible Tips and Tricks

Tools

02

Oct

16:00

Configuration management is an essential competency when running production systems. It enables you to define the intended state of your servers as code rather than through manual effort- saving a lot of time in the process.

Throughout my career, I’ve used Ruby-based configuration management tools like Puppet or Chef- however recently I have started to use Ansible for client projects.

Ansible is accessible to newcomers as:

no programming experience is required;
...

Read More

It's Time to Stop Using Jenkins

Tools

18

Sep

19:00

Jenkins is an ‘open source automation server’ commonly used for Continuous Integration and Continuous Delivery of software projects by many tech companies. It was first released in 2005 when it was originally known as Hudson. Its large collection of available plugins (~1800!), particularly Pipeline, enables teams to automate common operations tasks, particularly building, testing, and releasing.

All of that being said: it’s time to consider alternatives, especially for new software projects. This article outlines the...

Read More

Automate Production In Three Steps!

Best Practices

06

Sep

16:00

An important process when running a production system is automating manual tasks. This is especially important for fast-growing companies as the engineering team’s time can easily be eaten up by the toil involved in incident response, testing/releasing, etc.- preventing them from implementing the features and improvements that enable further growth and revenue.

It is possible for a product to be a victim of its own success. Don’t let this happen to you!

This article provides...

Read More

Video: Beating Big Tech Coding Interviews

Media

23

Aug

16:00

On Aug 19th I presented this talk at the monthly Vegas Programmers Meetup. This is an excellent followup to post “How to Get an SRE Role” as it goes in-depth on how to prepare for one of the most difficult parts of the process.

(Image Credit: This is Engineering)

Read More

Podcast Appearance: All Things Ops

Media

18

Aug

07:00

Another podcast! This week I’m a guest on All Things Ops from CheckMK!

(I used CheckMK years ago as it provided an improved interface and plugin system over stock Nagios.)

Host Elias Voelker and I discussed:

What makes the perfect Site Reliability Engineer?

The reasons for and benefits of a DevOps transformation

The most important tools for modern Site Reliability Engineering

Real behind-the-scenes stories of major outages

One of my most...

Read More

Cloud Lessons: Launching a K3S Cluster

Tools

03

Aug

16:00

I’m starting a new series where I share my experiences exploring cloud-native/platform engineering tools and technologies. starting with building the foundation: a Kubernetes installation in the cloud.

Why am I doing this?

to keep these skills sharp as I anticipate using them on client engagements!

to share lessons learned, for your benefit!

Because it’s fun! 😀

Today’s mission: get a simple Kubernetes cluster online in the cloud, using infrastructure-as-code! Since I’m an...

Read More

Kanban Quickstart

Best Practices

28

Jul

15:00

This article introduces Kanban(看板), a very effective process for organizing your team’s work and driving improvements, especially if you are on an interrupt-driven team such as Site Reliability Engineering, Operations, IT, or Customer Support.

The essential part of the process is the kanban board, which consists of cards representing each work item. Cards are moved between columns representing the state that the work item is in usually from left to right, such as:

Backlog
...

Read More

Podcast Appearance: Day Two Cloud

Media

20

Jul

07:00

I’m continuing my tour as a guest on tech podcasts! This time I’m on the Day Two Cloud podcast from Packet Pushers which focuses on the realities of cloud adoption.

I really enjoyed the conversation with hosts Ned Bellavance and Ethan Banks, who were both very insightful and funny!

Don’t miss this one as it was an action-packed discussion! Together, we covered:

What it means to be an SRE

How an SRE differs...

Read More

On-Call Stories: Flying Blind

Anecdotes

19

Jul

20:00

Let’s try something new and recall one of my most memorable production incidents!

Earlier in my career, I managed an operations team at a medium-sized tech company. The main revenue-generating product consisted of thousands of EC2 instances, all depending on Puppet for configuration management.

Puppet, unlike recent CM systems like Ansible, used a centralized server to store config manifests and required authentication in order to apply them to clients. We configured our hosts...

Read More

Podcast Appearance: Slight Reliability

Media

12

Jul

20:00

Another podcast guest appearance! This time I’m on the Slight Reliability podcast, which answers “what is site reliability engineering (SRE) really about?”.

(I’m on the road this week! Next week we’ll return to our usually-scheduled articles.)

In this episode, host Stephen Townshend and I cover a lot of ground including making ops work visible, measuring toil, the power of calculating the monetary value of work, getting developers on-call, the embedded model for SRE, SLOs,...

Read More

Podcast Appearance: Practical Operations

Media

28

Jun

10:00

This week I’m a guest on the Practical Operations podcast, which focuses on “systems, operations and scaling with a focus on real world use cases and solutions to common problems”.

We discuss my experience in DevOps transformations, running a Site Reliability Engineering team, and my experience as a consultant!

Episode 137 - Amin Astaneh

I highly recommend following this podcast as the hosts are very knowledgeable and are really entertaining to listen to!
...

Read More

Parallel Distributed Shell

Tools

22

Jun

18:00

When operating large-scale production systems, we rely on infrastructure-as-code (IaC) to keep the state of servers and cloud resources consistent over time. We also benefit from observability platforms to automatically gather metrics from these resources.

However, there are times when both of these essential mechanisms are rendered ineffective during a production incident. How can an engineer troubleshoot and implement a fix successfully across a large number of hosts?

Enter the parallel distributed shell.

A parallel...

Read More

Cross-Functional Collaboration

Best Practices

09

Jun

16:00

The most valuable and impactful work is done through others and not through the strivings of just one person. In the tech industry, creating customer value is a really complicated process and involves the efforts of different people, teams, and perspectives.

Consider a SaaS company: in order for it to be successful, groups like Engineering, Customer Success, Sales, Marketing, and Finance all need to exist and work together in tandem to create a product that...

Read More

System Call Tracing

Tools

09

Jun

16:00

I want to introduce one of the most powerful techniques in our arsenal when supporting production systems: system call tracing. But first: what is a system call?

Simply put, system calls are how programs interact with the operating system to request and manage resources like memory, files, network sockets, and hardware devices.

System call tracing allows you to observe the behavior of running processes and how they use those resources in real time.

Why is...

Read More

Video: SRE, Demystified

Media

05

Jun

10:00

On May 30th I presented this talk at the monthly Boston DevOps Meetup. It serves as an excellent introduction to the ideas and practices behind Site Reliability Engineering and provides food for thought when starting your own team. Enjoy!

(Image Credit: Kelvin Augustinus)

Read More

How To Get an SRE Role

Career

01

Jun

12:00

Are you a software engineer or an IT professional interested in transitioning to an SRE role? You’ve come to the right place! This article provides guidance on the skills and behaviors needed to apply for an SRE position at medium-to-large-sized tech companies successfully.

(This article was inspired by a discussion that took place in the Boston DevOps community chatroom.)

Before we begin, I want to immediately mention the “DevOps Roadmap” brought up in...

Read More

Running Post-Mortems

Best Practices

23

May

12:00

This article continues the discussion on how your team can learn from failure after a production incident. While write-ups are very important in capturing and documenting what took place, the real value is created from an open and deliberate conversation with the team to identify the lessons learned and the improvements needed to create a more reliable system. That conversation is the post-mortem.

Post-mortems are the primary mechanism for teams to learn from failure....

Read More

Incident Write-ups

Best Practices

12

May

12:00

When is an incident considered ‘done’? Is it when the production impact has been addressed and the on-call goes back to bed? If that were true, teams would pass up a huge opportunity to learn and improve from what the incident can teach them, and the on-call (and more importantly, customers) would continue to have a sub-par experience from repeat incidents.

This post discusses the importance and process of the write-up, which is documenting an...

Read More

Why Adopt DevOps & SRE?

Best Practices

05

May

10:00

This article explains why practices like DevOps and Site Reliability Engineering are essential for a successful technology business. Sure, they are touted as a way to change company culture and improve collaboration between teams, but what specific business value should you expect from investing in these capabilities?

Let’s start by remembering the goal of every business:

to make money by increasing throughput while simultaneously reducing inventory and operational expense.

In software development, let’s clarify the...

Read More

Incident Management: On-Call

Best Practices

28

Apr

10:00

In our Incident Management series, we’ve talked about how mature monitoring, escalation policies, and alerting enable a swift response when things go wrong in production. Let’s now talk about the people and processes that actually do the responding: the on-call rotation.

Simply put, an on-call rotation is a group of people that share the responsibility of being available to respond to emergencies on short notice, typically on a 24/7 basis. This practice...

Read More

Incident Management: Alerting

Best Practices

20

Apr

15:00

Our Incident Management series discussed so far the importance of monitoring and a solid escalation policy in the swift detection of production issues. Both of them depend on a third capability that we will go over today: alerting.

Alerting notifies the engineering team to appropriately and timely respond to problems in production based on their severity.

They tend to fall into three categories:

Page: meets the definition of an emergency and requires...

Read More

Incident Management: Escalation Policy

Best Practices

13

Apr

14:00

Last time in our Incident Management series we discussed how monitoring is essential to responding quickly when things go wrong with your app’s availability or performance.

However, monitoring won’t be able to successfully detect every failure. That is especially true for newly-launched services where monitoring is based on theory and not experience.

How do you prepare for the situation where another team or even a customer (heaven forbid) reports a production issue?

The answer...

Read More

Incident Management: Monitoring

Best Practices

10

Apr

09:00

When running a production system, one of the main responsibilities is being able to respond when things go wrong. This is especially important for newly-launched or rapidly-changing systems where incidents are guaranteed, usually due to defect leakage or performance/scaling challenges.

Incident readiness typically involves the following capabilities:

Monitoring (a computer is aware of your system’s health)

An escalation path (when monitoring doesn’t work)

Alerting (how to notify when something breaks)

An on-call rotation (who...

Read More

Running Successful Engagements

Best Practices

27

Mar

20:00

Previously we discussed several types of engagement models that SRE can use when collaborating with software engineering teams, as well as their tradeoffs. Let’s go over some ways in which SRE managers or team leads can successfully start and run an engagement!

To refresh, an SRE engagement can take the form of: taking on operational ownership of a service from an engineering team, embedding SREs on an engineering team, or providing a set of...

Read More

SRE Engagement Models

Best Practices

20

Mar

09:00

Last time we went over the basics of what it means to run an SRE team based on the original ideas that came from Google. Let’s talk about the ‘engagement model’, which describes the way that an individual SRE or team works with software engineering organizations to help them achieve their goals.

The SRE Workbook describes the various types of activities at length— in my experience individual SRE engagements tend to fall into...

Read More

SRE Essentials

Best Practices

06

Mar

11:00

Interested in launching a Site Reliability Engineering(SRE) team? They have been gaining in popularity at tech companies for the past decade— and for good reason! They drive higher levels of operational maturity, remove sources of toil and incidents that slow the pace of feature delivery, and help make services more reliable(hence the name).

However, just because you commission a team of engineers with the job title, doesn’t guarantee you’ll reap the rewards! How do you...

Read More

Free Reliability Coaching

Announcements

06

Mar

11:00

As part of my reliability coaching service, I’ve decided to make a bold decision: all new clients can schedule their first hour with me, absolutely free.

Here’s why I’m doing this:

On-call is a tough job.

I know engineers (including myself) that have spent many late nights, weekends, and even holidays away from their friends and family because they are busy on their computer or on a conference call addressing a production incident.

That...

Read More

Hidden Benefits Of SLOs

Best Practices

27

Feb

15:00

There are many articles online about Service Level Objectives(SLOs), particularly on the value they provide to customers as part of a Service Level Agreement(SLA).

Let’s discuss some of the benefits of SLOs that aren’t apparent at first glance.

Before we do, let’s quickly review the terminology from the source:

SLI: a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.

SLO: is a service...

Read More

Production Readiness Review

Best Practices

20

Feb

15:00

Imagine: Your team has designed and developed the initial version of an amazing product with market fit, and you wish to offer it to paying customers as soon as possible. It’s time to prepare for launch!

Product launches exist on a razor’s edge between excitement and terror. They very much depend on first impressions that customers get when using your product:

If successful, you win the return on investment and the credibility needed to...

Read More

Observability In A Box

Tools

14

Feb

09:00

I believe we’re entering a golden age of observability- we can gather metrics from our applications and infrastructure, better interpret them with query languages and pretty dashboards, and get notifications in chatrooms and our oncall systems. All of this technology at our fingertips- without any software licensing fees!

The challenge I see with these new tools is that they tend to assume ‘cloud-native’ infrastructure- the happy path for setup and configuration usually requires a container...

Read More

Emotional Intelligence

Soft Skills

09

Feb

11:00

When we discuss useful tools in the DevOps and SRE space, we tend to speak in terms of technology (eg: observability, configuration management, container orchestration, CI/CD). These tools enable us to be successful by introducing reliability and efficiency to the systems that support our products.

They are ubiquitous; discussed in places like Hacker News, supported by large communities, have meetups and conferences, and the enterprise versions are aggressively sold and advertised… even in unlikely places...

Read More

Oncall Retrospectives

Best Practices

01

Feb

15:00

Last time I shared my thoughts on blameless postmortems and how they create a safe space for revealing process and technology gaps contributing to past incidents.

Today I want to introduce another opportunity for teams to learn and improve from: the ‘oncall retrospective’, which:

Keeps the team in touch with the operational reality of their service(s);

Reveals opportunities to improve the oncall experience.

I was introduced to this practice by Jos Visser while...

Read More

Blameless Postmortems

Best Practices

27

Jan

10:00

Does your team conduct postmortems as part of their incident response process? It’s a great way to learn from failure and find opportunities to make your systems more reliable.

One piece of advice: make sure they are BLAMELESS.

This creates an environment of psychological safety, enabling your team to be more forthcoming about the factors that triggered or contributed to the incident- allowing them to be tracked and addressed.

In contrast, if your team feels...

Read More

Launch!

Announcements

28

Dec

16:00

Starting today I’ve officially hung up my virtual shingle and started this consulting business to help software engineering teams operate their products/services more efficiently and with less pain.

The road to getting here was full of interesting twists and turns!

Just a couple of months ago I was a Production Engineering manager at Meta, helping run one of their most important internal products. It basically streamlined how teams built, launched, and operated simple backend services...

Read More