System observability

WIP

🤔 Testing vs Observability vs Monitoring.

Testing is about verifying the correctness of a system, to ensure that it meets the functional and non-functional specification mentioned in a requirement document. Testing has historically been done before release and on non-production environments. This trend is slowly changing so fiddling with live traffic has become less of an anti-pattern and more of a “design for failure” good practice.

Monitoring is about defining, collecting and aggregating a set of metrics that describe how the system performs. For example, it could be metrics about the system load (CPU, memory), response time, number of 4xx/5xx errors per unit of time etc. These metrics can looked at almost real-time or queried afterwards. Actions such as alarms can be added whenever metrics go below or above a certain acceptable threshold. When defining metrics, we start with an already good idea about what we’re interested in monitoring (known unknowns).

Observability is a superset of monitoring and more about situations where we don’t know what to look for in our system (unknown unknowns). For example, we’re aware that our end users are experiencing a situation and we want to understand what’s happening in the system at that particular moment and why.

My analogy is that monitoring is like general medicine (‘Is the patient healthy?’) while observability is like forensics (‘Why is the patient having a seizure?’). Another is logging vs debugging/tracing.

👀 My favourite “observability manifesto” is from Distributed Systems Observability by Cindy Sridharan:

“In its most complete sense, observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained and evolved in acknowledgement of the following facts:

No complex system is ever fully healthy.

Distributed systems are pathologically unpredictable.

It’s impossible to predict the myriad states of partial failure various parts of the system might end up in.

Failure needs to be embraced at every phase, from system design to implementation, testing, deployment and finally, operation.

Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.”

Both observability and monitoring are useful to:

Understand the health of a system: Is the system functioning well, within acceptable load?
Support business decision-making: Do we understand enough about spike patterns to inform marketing/sales?
Help engineers debug production systems, in order to diagnose, solve and prevent problems: Is our payment service double-charging customers?
Provide auditing for security certifications and pen-testing: Are our systems vulnerable to attacks?

The more automation and observability, the less 🚨 panic and fear 🚨 when a problem occurs and the more ♻️ sustainable on-call rotations ♻️, because we’ll know:

Where to find the data we need for investigation.
Who has access, responsibility and ownership of that data.
How to react to incidents and escalate when a complex issue could impact customers.

It’s best to separate the code and tools for observability and monitoring, outside the main codebase, to avoid introducing side-effects and 🪲🐛.

💡 Implementing Monitoring and Observability

There are different ways to implement monitoring and observability:

White box monitoring
Black box monitoring
Instrumentation

The three pillars of observability

Logs: immutable, timestamped record of events in the system, in plaintext, binary (for replication) or structured (JSON).

Pros	Cons
Provides highly granular information in a single service	Not easy to track across a distributed system
Easy to generate	Enriched events can leak sensitive data
Many tools to collect, aggregate, filter and visualise e.g. Logstash, Kibana	Adds overhead, if too many, uncompressed or synchronous
	Order is not always guaranteed, so can’t rely on them to debug real-time

Traces: representation of related events that encode an end-to-end request through a distributed system.

A trace provides visibility into both the path and structure of a request identified by a global ID.
It’s represented as a directed graph of spans and edges called references.
When the execution of a request reaches a certain span (or hop), an async record is emitted with metadata to a collector which can reconstruct the entire flow of execution.

Pros	Cons
Understand the lifecycle of a request	Hard to retrofit into an existing system
Less heavy than logs because of sampling	Harder to integrate with external frameworks and libraries
Debug across services
Part of the data plane of service meshes

Metrics: numeric representation of data measured over intervals of time

Pros	Cons
Can help predict system behaviour	Harder to scope per request
Reflect historical trends	Hard to correlate with logs, unless using a UID (a high cardinality label), which affects database indices
Great for visualising in dashboards	Insufficient to understand the lifecycle of a request across multiple subsystems
Application scale doesn’t affect their overhead and storage
Aggregation, sampling, summarization gives a better picture of overall system health than logs
Actionable, e.g. with alerts

“The goal of observability is not to collect logs, metrics and traces. It’s to build a culture of engineering based on facts and feedback.” - Brian Knox, Digital Ocean

🛠 Observability and monitoring tools

This is, by no means, an exhaustive list, just some of the tools I know of.

Signals

There are many ways to define signals to be monitored and acted upon, like the ones based on Google’s SRE practices:

Four Golden Signals, by Rob Ewaschuk: Latency, Errors, Traffic and Saturation.
USE Methodology, by Brendan Gregg: Utilisation, Saturation and Errors
RED Method, by Tom Walkie: Request rate, Error rate and Duration of request

Ensure your system is designed to be observable and testable “in a realistic manner”.

The system is designed in a way that actionable failures can be discovered in testing.
The system can be deployed incrementally.
The system can be rolled back (and forward) if some key metrics deviate from the baseline.
Post-release, the system reports enough health and behaviour data so that it can be understood, debugged and evolved.

Ensure your system monitors the Four Golden Signals.

For more information, read the Google’s SRE book.

Latency: the time it takes to serve (or fail to serve) a request
Traffic: how much demand the system handles, e.g. requests per second
Errors: e.g. HTTP 5xx, 4xx or 200 but with the wrong response
Saturation: how ‘full’ the system is, based on the most constrained resource, e.g. memory or disk throughput

Ensure your system’s is monitored “in the simplest way, but no simpler”.

Instead of averaging, monitor the tail of latencies by bucketing them into a histogram, e.g. requests below 10ms, below 100ms, etc.
Set the right time frame granularity for each metric, e.g. per-second CPU measurements might be too granular and costly.
Avoid collecting non-actionable signals (noise) and regularly trim the ones that are rarely exercised.
Aim to decrease the variability of your latency if you can’t (and shouldn’t) decrease it everywhere

Goals and metrics

The Three Principles of Customer Reliability Engineering (CRE)

Reliability is the most important feature.
Users, not monitoring, decide reliability.
What reliability should aim for:
- Software -> 99.9%,
- Operations -> 99.99%
- Business -> 99.999%

Let’s just pause here and let the Google SRE wisdom sink in:

“We believe that architecting a system carefully and following best practices for reliability, like running in multiple globally distributed regions, means that the system has the potential to achieve three nines over a long time horizon.

To reach four nines, it’s not enough to have talented developers and well-engineered software. You also need an operations team whose primary goal is to sustain system reliability through both reactive, like well-trained incident response and proactive engineering, things like removing bottlenecks, automating processes, and isolating failure domains.

Everyone assumes that they’ll need five nines at first. After all, reliability is the most important feature of a system. But reaching these levels of reliability actually requires sacrificing many other aspects of the system, like flexibility and release velocity.

Every component must be automated such that changes are rolled out gradually and failures are detected quickly and rolled back without human involvement.

Each additional nine makes your system 10 times more reliable than before. But as a rough rule of thumb, it also costs your business 10 times more. Engineering a system to have reliability as its top priority means making many hard choices with wide ranging consequences to the business, and in most cases, the cost-benefit analysis just doesn’t add up.”

Goals and metrics come from empathy for the user

Every goal should be set with the user journey in mind and empathy for their feelings and behaviours. What do I mean by empathy? Become or imagine you’re one of your users.

They’re searching for something to buy on your website. Their excitement is at peak, they’re giddy like on Christmas Day, waiting to see the results (“rewards of the hunt”), so they can choose from your products. Therefore, it makes sense to prioritise reducing the latency and response time for searching and browsing pages.

Once they found what they’re looking for, their excitement and impatience starts to fade and by the time they’re on the checkout page, they’re actually slower and more cautions, so it might be acceptable and even unnoticeable to them if the checkout page is slightly slower than the search one.

This, of course, it’s just an example. You really need to understand your users.

Alarms and fatigue

Alerts are about systems and automation as much as they are about humans.

The more we unburden the humans at work, the more they can focus on writing exciting code and innovative solutions - which is what high performing teams do, as opposed to putting out infrastructure fires.