I sat down with Better Stack’s podcast to discuss the AI velocity -> operational chaos thesis, my time at Meta, my SRE career in general, and a few tales from my life on the road!
Thanks Better Stack for an action-packed conversation!
I sat down with Better Stack’s podcast to discuss the AI velocity -> operational chaos thesis, my time at Meta, my SRE career in general, and a few tales from my life on the road!
Thanks Better Stack for an action-packed conversation!
Join fellow SRE Paige Cruz and yours truly on an exploration of my history being on-call, using multiple generations of observability tools, and how to make the experience as painless as possible.
One of the major points of the discussion is how SRE sets boundaries around taking on-call burden on behalf of engineering teams in contrast to classic IT Operations teams.
I also share a funny story about the last page I received while...
I’m pleased to announce that I’ve launched a new podcast: Reliability Rebels!
Over the past couple years, I’ve had the privilege to be a guest on several tech podcasts (and will continue to do so), however I decided to create my own.
(And yes, I produced the intro music!)
I wanted to explore how people in tech sometimes have to challenge the status quo to improve their systems, as that was definitely my experience across...
As systems grow larger and more complex, mastering incident response isn’t just a necessity— it’s critical for a tech company’s survival.
SEV0, hosted by incident.io in San Francisco a few days ago, tackled this head-on, bringing together thought leaders and practitioners to share best practices, hard-earned lessons, and bold new ideas in the world of incident management.
As you know, I’m pretty obsessive about the end-to-end process of incident response, so of course...
A few months back, I presented the Effective SLOs webinar, where we discussed how to select, implement, and iterate on Service Level Objectives (SLOs)— a cornerstone of how we ensure the reliability of our systems.
(If you haven’t seen the recording yet, you can access it here.
Today, I’m excited to announce the release of a companion workshop, which is available for download.
This workshop offers hands-on experience, guiding participants through the...
Hey, product managers!
I’m an engineer. We need to talk! (I promise not to spout technical jargon at you.)
Let’s be honest: our two groups don’t see eye to eye as much as we should. Perhaps now is a chance to change that!
To start, we (as engineers) understand that your job is to take the product’s vision (informed by customer desire) and bring it into reality. We get that it can be...
Another appearance on the Slight Reliability Podcast! This time we go over the basics of CI/CD, change management, my experience running a Change Advisory Board(CAB), testing in prod, and how to treat your test/deploy infrastructure!
I return to the Slight Reliability Podcast to discuss my experience in Meta’s Production Engineering… and tell a story about how I almost burnt down a server room early in my career! Don’t miss this one!
On Aug 19th I presented this talk at the monthly Vegas Programmers Meetup. This is an excellent followup to post “How to Get an SRE Role” as it goes in-depth on how to prepare for one of the most difficult parts of the process.
(Image Credit: This is Engineering)
Another podcast! This week I’m a guest on All Things Ops from CheckMK!
(I used CheckMK years ago as it provided an improved interface and plugin system over stock Nagios.)
Host Elias Voelker and I discussed:
One of my most...
I’m continuing my tour as a guest on tech podcasts! This time I’m on the Day Two Cloud podcast from Packet Pushers which focuses on the realities of cloud adoption.
I really enjoyed the conversation with hosts Ned Bellavance and Ethan Banks, who were both very insightful and funny!
Don’t miss this one as it was an action-packed discussion! Together, we covered:
Another podcast guest appearance! This time I’m on the Slight Reliability podcast, which answers “what is site reliability engineering (SRE) really about?”.
(I’m on the road this week! Next week we’ll return to our usually-scheduled articles.)
In this episode, host Stephen Townshend and I cover a lot of ground including making ops work visible, measuring toil, the power of calculating the monetary value of work, getting developers on-call, the embedded model for SRE, SLOs,...
This week I’m a guest on the Practical Operations podcast, which focuses on “systems, operations and scaling with a focus on real world use cases and solutions to common problems”.
We discuss my experience in DevOps transformations, running a Site Reliability Engineering team, and my experience as a consultant!
I highly recommend following this podcast as the hosts are very knowledgeable and are really entertaining to listen to!
...
On May 30th I presented this talk at the monthly Boston DevOps Meetup. It serves as an excellent introduction to the ideas and practices behind Site Reliability Engineering and provides food for thought when starting your own team. Enjoy!
(Image Credit: Kelvin Augustinus)