monitoring

Parallel Distributed Shell

Tools

Jun

18:00

When operating large-scale production systems, we rely on infrastructure-as-code (IaC) to keep the state of servers and cloud resources consistent over time. We also benefit from observability platforms to automatically gather metrics from these resources.

However, there are times when both of these essential mechanisms are rendered ineffective during a production incident. How can an engineer troubleshoot and implement a fix successfully across a large number of hosts?

Enter the parallel distributed shell.

A parallel...

Incident Management: Monitoring

Best Practices

Apr

09:00

When running a production system, one of the main responsibilities is being able to respond when things go wrong. This is especially important for newly-launched or rapidly-changing systems where incidents are guaranteed, usually due to defect leakage or performance/scaling challenges.

Incident readiness typically involves the following capabilities:

Monitoring (a computer is aware of your system’s health)
An escalation path (when monitoring doesn’t work)
Alerting (how to notify when something breaks)
An on-call rotation (who...

Hidden Benefits Of SLOs

Best Practices

Feb

15:00

There are many articles online about Service Level Objectives(SLOs), particularly on the value they provide to customers as part of a Service Level Agreement(SLA).

Let’s discuss some of the benefits of SLOs that aren’t apparent at first glance.

Before we do, let’s quickly review the terminology from the source:

SLI: a service level indicator—a carefully defined quantitative measure of some aspect of the level of service that is provided.
SLO: is a service...

Read More

Observability In A Box

Tools

14

Feb

09:00

I believe we’re entering a golden age of observability- we can gather metrics from our applications and infrastructure, better interpret them with query languages and pretty dashboards, and get notifications in chatrooms and our oncall systems. All of this technology at our fingertips- without any software licensing fees!

The challenge I see with these new tools is that they tend to assume ‘cloud-native’ infrastructure- the happy path for setup and configuration usually requires a container...

Read More

CERTO MODO

tag

monitoring

Parallel Distributed Shell

Tools

Incident Management: Monitoring

Best Practices

Hidden Benefits Of SLOs

Best Practices

Observability In A Box

Tools