Article Blog Image

System Call Tracing


I want to introduce one of the most powerful techniques in our arsenal when supporting production systems: system call tracing. But first: what is a system call?

Simply put, system calls are how programs interact with the operating system to request and manage resources like memory, files, network sockets, and hardware devices.

System call tracing allows you to observe the behavior of running processes and how they use those resources in real time.

Why is...

Article Blog Image

Incident Management: Monitoring

Best Practices

When running a production system, one of the main responsibilities is being able to respond when things go wrong. This is especially important for newly-launched or rapidly-changing systems where incidents are guaranteed, usually due to defect leakage or performance/scaling challenges.

Incident readiness typically involves the following capabilities:

  • Monitoring (a computer is aware of your system’s health)
  • An escalation path (when monitoring doesn’t work)
  • Alerting (how to notify when something breaks)
  • An on-call rotation (who...
Article Blog Image

Hidden Benefits Of SLOs

Best Practices

There are many articles online about Service Level Objectives(SLOs), particularly on the value they provide to customers as part of a Service Level Agreement(SLA).

Let’s discuss some of the benefits of SLOs that aren’t apparent at first glance.

Before we do, let’s quickly review the terminology from the source:

  • SLI: a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
  • SLO: is a service...
Article Blog Image

Observability In A Box


I believe we’re entering a golden age of observability- we can gather metrics from our applications and infrastructure, better interpret them with query languages and pretty dashboards, and get notifications in chatrooms and our oncall systems. All of this technology at our fingertips- without any software licensing fees!

The challenge I see with these new tools is that they tend to assume ‘cloud-native’ infrastructure- the happy path for setup and configuration usually requires a container...