Decide the desirable system attributes

After collecting and clarifying the requirements from the product and engineering team, list the desired system attributes. Consider the trade-offs for each.

One trade-off is whether we care to design a system which is scalable (high availability and throughput, low latency) or is strongly consistent and transactional.

Another trade-off is between maintainability and speed of iteration. That’s a measure of organisational and software delivery performance, e.g. team culture and handling tech debt, not only the choice of technology.

1. Reliability

The system functions correctly in case of failures and malicious attacks with very little or no downtime. It does so by self-healing in case of network errors, infrastructure problems, malicious attacks and system load.

2. Scalability

The system handles increased load without degraded performance. It does so by automatically scaling out for peak demands and scaling in when demand decreases, in all necessary components like infrastructure, application services and data stores.

3. Maintainability

The system is simple to operate and evolve over time without major code refactorings. It does so with a coherent design, reusability of components, simplified monitoring, deployment and administration.

4. Efficiency

The system makes use of available resources and optimises cost. It does so by handling scaling in and out according to demand, predicting usage patterns and achieving economies of scale with techniques like reserved infrastructure and scale-to-zero.

5. Security

The system guarantees confidenciality, integrity and protection against malicious attacks. It does so by being vigilant to quickly detect and remediate vulnerabilities.

6. Observability

The internal state of the system can be inferred from knowledge of its external outputs. It does so by providing metrics, traces, logs and alerts in an aggregated way to ease troubleshooting and allows replication and simulation of real-life failures.

7. Consistency (UX)

The system offers a consistent user experience on multiple platforms. When users switch devices, they can continue where they left off. It does so by maintaining the user state with the appropriate mechanisms like caching and real-time communication with client devices.

8. Idempotency

The system handles repeat requests without unwanted side-effects. For example, if a payment fails, it can be retried without losing money. It does so by using idempotency keys for state-changing requests and using appropriate techniques like exponential backoff and random jitter for handling retries.

9. Transactionality

The system must handle Online Transaction Processing (OLTP).

Transactions are logical units made of multiple read and write operations.

It must be in one of two states:

Committed to the database in case it succeeds
Rolled back in case of failure

OLTP vs OLAP

OLTP systems offer strong ACID guarantees.

Atomicity: all transactions succeed or are rolled-back
Consistency: after each transaction the system is structurally sound
Isolation: transactions don’t interfere with one another and appear to run sequentially
Durability: commited transactions are permanent even after system failures

By contrast, most Online Analytical Processing (OLAP) systems offer BASE guarantees.

Basic Availability: the databases work most of the time
Soft State: data stores don’t offer reading consistency across replicas
Eventual Consistency: at some point the data stores become consistent

Consistency vs scalability

ACID vs BASE is mostly about consistency vs scalability.

Consistency means if the data on the nodes is the same at any point in time.

Strong consistency guarantees that when we read data after we successfully complete a write we see the value just written.
Weak consistency means that when the data is replicated on multiple nodes, depending which node we read from, we might not see that value immediately.
Eventual consistency is a type of weak consistency which guarantees that if we wait a while, eventually the data on the nodes will converge to be the same.

We cannot guarantee both because of the CAP theorem.

The CAP Theorem

A distributed system can provide only two out of three guarantees:

Consistency: every read receives the most recent write or an error
Availability: every request receives a response
Partition tolerance: the system operates even if messages are dropped between its nodes