System observability
WIP
đ¤ Testing vs Observability vs Monitoring.
Testing is about verifying the correctness of a system, to ensure that it meets the functional and non-functional specification mentioned in a requirement document. Testing has historically been done before release and on non-production environments. This trend is slowly changing so fiddling with live traffic has become less of an anti-pattern and more of a âdesign for failureâ good practice.
Monitoring is about defining, collecting and aggregating a set of metrics that describe how the system performs. For example, it could be metrics about the system load (CPU, memory), response time, number of 4xx/5xx errors per unit of time etc. These metrics can looked at almost real-time or queried afterwards. Actions such as alarms can be added whenever metrics go below or above a certain acceptable threshold. When defining metrics, we start with an already good idea about what weâre interested in monitoring (known unknowns).
Observability is a superset of monitoring and more about situations where we donât know what to look for in our system (unknown unknowns). For example, weâre aware that our end users are experiencing a situation and we want to understand whatâs happening in the system at that particular moment and why.
My analogy is that monitoring is like general medicine (âIs the patient healthy?â) while observability is like forensics (âWhy is the patient having a seizure?â). Another is logging vs debugging/tracing.
đ My favourite âobservability manifestoâ is from Distributed Systems Observability by Cindy Sridharan:
âIn its most complete sense, observability is a property of a system that has been designed, built, tested, deployed, operated, monitored, maintained and evolved in acknowledgement of the following facts:
- No complex system is ever fully healthy.
- Distributed systems are pathologically unpredictable.
- Itâs impossible to predict the myriad states of partial failure various parts of the system might end up in.
- Failure needs to be embraced at every phase, from system design to implementation, testing, deployment and finally, operation.
- Ease of debugging is a cornerstone for the maintenance and evolution of robust systems.â
Both observability and monitoring are useful to:
- Understand the health of a system: Is the system functioning well, within acceptable load?
- Support business decision-making: Do we understand enough about spike patterns to inform marketing/sales?
- Help engineers debug production systems, in order to diagnose, solve and prevent problems: Is our payment service double-charging customers?
- Provide auditing for security certifications and pen-testing: Are our systems vulnerable to attacks?
The more automation and observability, the less đ¨ panic and fear đ¨ when a problem occurs and the more âťď¸ sustainable on-call rotations âťď¸, because weâll know:
- Where to find the data we need for investigation.
- Who has access, responsibility and ownership of that data.
- How to react to incidents and escalate when a complex issue could impact customers.
Itâs best to separate the code and tools for observability and monitoring, outside the main codebase, to avoid introducing side-effects and đŞ˛đ.
đĄ Implementing Monitoring and Observability
There are different ways to implement monitoring and observability:
- White box monitoring
- Black box monitoring
- Instrumentation
The three pillars of observability
-
Logs: immutable, timestamped record of events in the system, in plaintext, binary (for replication) or structured (
JSON
).Pros Cons Provides highly granular information in a single service Not easy to track across a distributed system Easy to generate Enriched events can leak sensitive data Many tools to collect, aggregate, filter and visualise e.g. Logstash, Kibana Adds overhead, if too many, uncompressed or synchronous  Order is not always guaranteed, so canât rely on them to debug real-time -
Traces: representation of related events that encode an end-to-end request through a distributed system.
- A trace provides visibility into both the path and structure of a request identified by a global ID.
- Itâs represented as a directed graph of spans and edges called references.
- When the execution of a request reaches a certain span (or hop), an async record is emitted with metadata to a collector which can reconstruct the entire flow of execution.
Pros Cons Understand the lifecycle of a request Hard to retrofit into an existing system Less heavy than logs because of sampling Harder to integrate with external frameworks and libraries Debug across services  Part of the data plane of service meshes  -
Metrics: numeric representation of data measured over intervals of time
Pros Cons Can help predict system behaviour Harder to scope per request Reflect historical trends Hard to correlate with logs, unless using a UID (a high cardinality label), which affects database indices Great for visualising in dashboards Insufficient to understand the lifecycle of a request across multiple subsystems Application scale doesnât affect their overhead and storage  Aggregation, sampling, summarization gives a better picture of overall system health than logs  Actionable, e.g. with alerts Â
âThe goal of observability is not to collect logs, metrics and traces. Itâs to build a culture of engineering based on facts and feedback.â - Brian Knox, Digital Ocean
đ Observability and monitoring tools
This is, by no means, an exhaustive list, just some of the tools I know of.
Signals
There are many ways to define signals to be monitored and acted upon, like the ones based on Googleâs SRE practices:
- Four Golden Signals, by Rob Ewaschuk: Latency, Errors, Traffic and Saturation.
- USE Methodology, by Brendan Gregg: Utilisation, Saturation and Errors
- RED Method, by Tom Walkie: Request rate, Error rate and Duration of request
Ensure your system is designed to be observable and testable âin a realistic mannerâ.
- The system is designed in a way that actionable failures can be discovered in testing.
- The system can be deployed incrementally.
- The system can be rolled back (and forward) if some key metrics deviate from the baseline.
- Post-release, the system reports enough health and behaviour data so that it can be understood, debugged and evolved.
Ensure your system monitors the Four Golden Signals.
For more information, read the Googleâs SRE book.
- Latency: the time it takes to serve (or fail to serve) a request
- Traffic: how much demand the system handles, e.g. requests per second
- Errors: e.g.
HTTP
5xx, 4xx or 200 but with the wrong response - Saturation: how âfullâ the system is, based on the most constrained resource, e.g. memory or disk throughput
Ensure your systemâs is monitored âin the simplest way, but no simplerâ.
- Instead of averaging, monitor the tail of latencies by bucketing them into a histogram, e.g. requests below 10ms, below 100ms, etc.
- Set the right time frame granularity for each metric, e.g. per-second CPU measurements might be too granular and costly.
- Avoid collecting non-actionable signals (noise) and regularly trim the ones that are rarely exercised.
- Aim to decrease the variability of your latency if you canât (and shouldnât) decrease it everywhere
Goals and metrics
The Three Principles of Customer Reliability Engineering (CRE)
- Reliability is the most important feature.
- Users, not monitoring, decide reliability.
-
What reliability should aim for:
- Software -> 99.9%,
- Operations -> 99.99%
- Business -> 99.999%
Letâs just pause here and let the Google SRE wisdom sink in:
âWe believe that architecting a system carefully and following best practices for reliability, like running in multiple globally distributed regions, means that the system has the potential to achieve three nines over a long time horizon.
To reach four nines, itâs not enough to have talented developers and well-engineered software. You also need an operations team whose primary goal is to sustain system reliability through both reactive, like well-trained incident response and proactive engineering, things like removing bottlenecks, automating processes, and isolating failure domains.
Everyone assumes that theyâll need five nines at first. After all, reliability is the most important feature of a system. But reaching these levels of reliability actually requires sacrificing many other aspects of the system, like flexibility and release velocity.
Every component must be automated such that changes are rolled out gradually and failures are detected quickly and rolled back without human involvement.
Each additional nine makes your system 10 times more reliable than before. But as a rough rule of thumb, it also costs your business 10 times more. Engineering a system to have reliability as its top priority means making many hard choices with wide ranging consequences to the business, and in most cases, the cost-benefit analysis just doesnât add up.â
Goals and metrics come from empathy for the user
Every goal should be set with the user journey in mind and empathy for their feelings and behaviours. What do I mean by empathy? Become or imagine youâre one of your users.
Theyâre searching for something to buy on your website. Their excitement is at peak, theyâre giddy like on Christmas Day, waiting to see the results (ârewards of the huntâ), so they can choose from your products. Therefore, it makes sense to prioritise reducing the latency and response time for searching and browsing pages.
Once they found what theyâre looking for, their excitement and impatience starts to fade and by the time theyâre on the checkout page, theyâre actually slower and more cautions, so it might be acceptable and even unnoticeable to them if the checkout page is slightly slower than the search one.
This, of course, itâs just an example. You really need to understand your users.
Alarms and fatigue
Alerts are about systems and automation as much as they are about humans.
The more we unburden the humans at work, the more they can focus on writing exciting code and innovative solutions - which is what high performing teams do, as opposed to putting out infrastructure fires.