Decide the desirable system attributes
After collecting and clarifying the requirements from the product and engineering team, list the desired system attributes. Consider the trade-offs for each.
One trade-off is whether we care to design a system which is scalable (high availability and throughput, low latency) or is strongly consistent and transactional.
Another trade-off is between maintainability and speed of iteration. That’s a measure of organisational and software delivery performance, e.g. team culture and handling tech debt, not only the choice of technology.
1. Reliability
The system functions correctly in case of failures and malicious attacks with very little or no downtime. It does so by self-healing in case of network errors, infrastructure problems, malicious attacks and system load.
2. Scalability
The system handles increased load without degraded performance. It does so by automatically scaling out for peak demands and scaling in when demand decreases, in all necessary components like infrastructure, application services and data stores.
3. Maintainability
The system is simple to operate and evolve over time without major code refactorings. It does so with a coherent design, reusability of components, simplified monitoring, deployment and administration.
4. Efficiency
The system makes use of available resources and optimises cost. It does so by handling scaling in and out according to demand, predicting usage patterns and achieving economies of scale with techniques like reserved infrastructure and scale-to-zero.
5. Security
The system guarantees confidenciality, integrity and protection against malicious attacks. It does so by being vigilant to quickly detect and remediate vulnerabilities.
6. Observability
The internal state of the system can be inferred from knowledge of its external outputs. It does so by providing metrics, traces, logs and alerts in an aggregated way to ease troubleshooting and allows replication and simulation of real-life failures.
7. Consistency (UX)
The system offers a consistent user experience on multiple platforms. When users switch devices, they can continue where they left off. It does so by maintaining the user state with the appropriate mechanisms like caching and real-time communication with client devices.
8. Idempotency
The system handles repeat requests without unwanted side-effects. For example, if a payment fails, it can be retried without losing money. It does so by using idempotency keys for state-changing requests and using appropriate techniques like exponential backoff and random jitter for handling retries.
9. Transactionality
The system must handle Online Transaction Processing (OLTP).
Transactions are logical units made of multiple read and write operations.
It must be in one of two states:
- Committed to the database in case it succeeds
- Rolled back in case of failure
OLTP vs OLAP
OLTP systems offer strong ACID guarantees.
- Atomicity: all transactions succeed or are rolled-back
- Consistency: after each transaction the system is structurally sound
- Isolation: transactions don’t interfere with one another and appear to run sequentially
- Durability: commited transactions are permanent even after system failures
By contrast, most Online Analytical Processing (OLAP) systems offer BASE guarantees.
- Basic Availability: the databases work most of the time
- Soft State: data stores don’t offer reading consistency across replicas
- Eventual Consistency: at some point the data stores become consistent
Consistency vs scalability
ACID vs BASE is mostly about consistency vs scalability.
Consistency means if the data on the nodes is the same at any point in time.
-
Strong consistency guarantees that when we read data after we successfully complete a write we see the value just written.
-
Weak consistency means that when the data is replicated on multiple nodes, depending which node we read from, we might not see that value immediately.
-
Eventual consistency is a type of weak consistency which guarantees that if we wait a while, eventually the data on the nodes will converge to be the same.
We cannot guarantee both because of the CAP theorem.
The CAP Theorem
A distributed system can provide only two out of three guarantees:
- Consistency: every read receives the most recent write or an error
- Availability: every request receives a response
- Partition tolerance: the system operates even if messages are dropped between its nodes