Article Blog Image

Parallel Distributed Shell

Tools

When operating large-scale production systems, we rely on infrastructure-as-code (IaC) to keep the state of servers and cloud resources consistent over time. We also benefit from observability platforms to automatically gather metrics from these resources.

However, there are times when both of these essential mechanisms are rendered ineffective during a production incident. How can an engineer troubleshoot and implement a fix successfully across a large number of hosts?

Enter the parallel distributed shell.

A parallel...

Article Blog Image

System Call Tracing

Tools

I want to introduce one of the most powerful techniques in our arsenal when supporting production systems: system call tracing. But first: what is a system call?

Simply put, system calls are how programs interact with the operating system to request and manage resources like memory, files, network sockets, and hardware devices.

System call tracing allows you to observe the behavior of running processes and how they use those resources in real time.

Why is...