Article Blog Image

Parallel Distributed Shell

Tools

When operating large-scale production systems, we rely on infrastructure-as-code (IaC) to keep the state of servers and cloud resources consistent over time. We also benefit from observability platforms to automatically gather metrics from these resources.

However, there are times when both of these essential mechanisms are rendered ineffective during a production incident. How can an engineer troubleshoot and implement a fix successfully across a large number of hosts?

Enter the parallel distributed shell.

A parallel...

Article Blog Image

Incident Management: Monitoring

Best Practices

When running a production system, one of the main responsibilities is being able to respond when things go wrong. This is especially important for newly-launched or rapidly-changing systems where incidents are guaranteed, usually due to defect leakage or performance/scaling challenges.

Incident readiness typically involves the following capabilities:

  • Monitoring (a computer is aware of your system’s health)
  • An escalation path (when monitoring doesn’t work)
  • Alerting (how to notify when something breaks)
  • An on-call rotation (who...
Article Blog Image

Hidden Benefits Of SLOs

Best Practices

There are many articles online about Service Level Objectives(SLOs), particularly on the value they provide to customers as part of a Service Level Agreement(SLA).

Let’s discuss some of the benefits of SLOs that aren’t apparent at first glance.

Before we do, let’s quickly review the terminology from the source:

  • SLI: a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
  • SLO: is a service...
Article Blog Image

Observability In A Box

Tools

I believe we’re entering a golden age of observability- we can gather metrics from our applications and infrastructure, better interpret them with query languages and pretty dashboards, and get notifications in chatrooms and our oncall systems. All of this technology at our fingertips- without any software licensing fees!

The challenge I see with these new tools is that they tend to assume ‘cloud-native’ infrastructure- the happy path for setup and configuration usually requires a container...