Application services

Define the type of client-server communication.

AJAX Polling, HTTP Long-Polling, Server-Sent Events, WebSockets

The first gen AJAX apps worked on the basic idea that the client repeatedly asks (or polls) the server for data. Then the client would wait until a response is returned. The problems are HTTP overhead for empty responses and finding the right interval for polling is also challenging.

The next improvement is HTTP Long-Polling, where the client requests data just like above and the server keeps the connection ‘hanging’ until data becomes available. The problem is that the connection gets closed because of timeout, so the client must reconnect.

Server-side events are a persistent uni-directional connection between a server and a client.

WebSockets are a persistent bi-directional, full-duplex TCP connection between a client and a server. It removes the HTTP overhead (SSL handshake, content negotiation, lots of headers) for every message.

Decide between a monolith or a microservice architecture.

Conway’s Law:

Any organisation that designs a system will inevitably produce a design whose structure is a copy of the organisation’s communication structure.

Would you fight 100 duck-sized horses or one horse-sized duck?

All systems evolve from simple to complex, as business needs grow. Most products start with simple use cases and small teams where speed of iteration is more important than designing a scalable architecture.

As the system complexity increases, opting for a modular architecture has obvious benefits:

Component decoupling
Increased security
Increased resilience to defects
Easier to experiment and evolve
Faster versioning
Faster build and deploy times
Fewer conflicts

This article explains each of the benefits above as well as the disadvantages.

Independent Systems Architecture

The Independent Systems Architecture is a collection of best practices for microservices and self-contained systems, as well as challenges. In a nutshell, it advertises for two levels of architecture decisions:

Macro Architecture: decisions covering all services, e.g. communication, authentication, CI/CD
Micro Architecture: individual decisions at service level, e.g. runtime environments

Using ISAs principles, organisations can get the best of both worlds:

Scale the knowledge and the talent pool
Increase the number of components and complexity of systems
Handle different uses cases appropriately, e.g. web vs data science vs high-concurrency
Prevent scaling chaos and enthropy

Lambda vs Kappa architectures

This is really the knitty-gritty of data stores architecture.

A Lambda architecture handles both batch and streaming. When new data is ingested into a system, it goes through three layers:

The batch layer operates with append-only raw data and pre-computes batch views
The serving layer then indexes these batch views and opens them for querying
The speed layer works in parallel to answers queries directly for speed and real-time views

A Kappa architecture is a simplified version of Lambda, where data flows from a combined batch and speed pipeline then a serving layer. It’s useful when data is more homogenous, so the batch and stream data is fairly similar.

You can read more about it here and here.

What about the frontend?

Say you have a large and complicated checkout web page, which is changed by different functional teams, for example, Growth and Customer Satisfaction. Both teams want the same things:

Keep the page operational to ensure users can pay
Minimise the page latency to provide a great user experience
Deploy fast experiments with new features or variations

The biggest question in this scenario is Who owns what when things go bad?

Github found a way to do that in a monolith with file ownership.

Another way is to split the frontend as well. This could be done page by page or within a single page.

Each team owns components/fragments on the page, separated by responsibility
Teams share a common design language
Teams maintain a common UI library (very hard in practice, needs a coordinator)
Each component has its own CI/CD
Each component can be built with different tech stacks. We don’t all need to be React developers.

That’s the idea behind microfrontends. There are many ways to implement this. Each big tech claims they do it better with their own open source frameworks. One size doesn’t fit all.

These are not decisions that the tech team should make in isolation from the business and product, just because they favour one or another.

The communication should be both ways:

The business can think about the likelyhood of hitting gold quickly
Tech and product can think ahead how the system will handle an extra order of magnitude

Decide whether to monolith or not based on the system’s size and complexity and team distribution
Think ahead for at least one order of magnitude, some companies hit gold quickly :)

Decide between control and automation of services.

Server-based vs serverless

Cloud providers removed the need to provision and maintain expensive physical servers upfront and maintain those by offering Infrastructure-as-a-Service (IaaS) and Platform-as-a-Service (PaaS) at a fraction of the cost. Every business is nowadays a technology business.

Serverless and Function-as-a-Service (FaaS) go further by dynamically and transparently managing the allocation and provisioning of underlying servers. Application code runs in long-running stateless containers or event-triggered lambdas/functions. The cloud provider manages execution based on configurable runtime environments and dependencies. Examples: AWS BeanStalk, Fargate, Google Cloud Run, GAE, Google Cloud Functions, AWS Lambda, Kubeless, KNative, Serverless etc.

Serverless increases scalability and removes maintainance overhead. The trade-off is less control over the underlying host machines.

Scale-to-Zero reduces costs even more. There’s no need to pay for long-running servers when nobody is using the application or a particular component. This helps early stage companies cheaply test their product propositions and established ones do A/B testing.

Serverless trade-offs

You lose the ability to control resource pricing. For example, if your business needs are predictable you could negociate with the cloud provider to reserve instances up to three years in advance, giving you economies of scale. Or, if your system is flexible and fault-tolerant components, you can benefit from scrappy instances like AWS Spot.
You lose the ability to provide a consistent multi-cloud developer experience. I met many ML/AI companies that are multi-cloud because of better price negotiation and discounts. In that case, if you want to provide ready-made images for your teams to start new components e.g. database + server + cache, you want to create your own container images and orchestration which are not cloud-specific. You can’t easily do that with serverless, although an infrastructure-as-code tool helps a bit.
Serverless is scalable but not limitless. If your system has super high CPU needs, e.g. for transcoding videos, you might hit serverless limitations. For example, lambdas can only execute for 15 minutes at a time and you can’t have more than 3000 concurrent invocations, although these limits might be negotiable with the cloud provider.
Serverless has a lower learning curve. However, for some types of systems like real-time financial trading platforms, there might be a need to go really low level like making sure the clocks are synchronised with NTP across the cloud provider hosts. This level of tunning is not possible with serverless. This requires more DevOps expertise in the team.

Define the long running services
Define the short-lived services

Determine if services scales to zero.

This reduces costs for early stages of development.

Services supports scale-to-zero
Scale-to-zero is not as important

Determine if services are polyglot.

Services are created by multiple teams with different coding languages
Services are created with the same technology stack

Determine how services communicate.

For example, inter-service and between services and data stores.

Services communicate asynchronously via messages (events)
Services communicate synchronously via API calls

Determine how to secure service communication.

Service calls are authenticated and authorised via a custom mechanism e.g. JWT tokens
Services have Identity and Access Management (IAM) roles and policies
Failures and retries are handled

Decide between control vs automation.

Services are fully managed e.g. Lambda, Elastic BeanStalk, ECS/EKS
Services have more access to host machines, e.g. EC2 instances

Enable as much service automation as possible.

Machine images are created and ready to spin new instances quickly
Instance disk drives have automatic snapshots enabled
Application code is packaged as containers or functions with all dependencies
The infrastructure is packaged as code and versioned properly

Determine the service storage volume types.

Services need highly-durable, long-term persistence like EBS
Services can do with cheaper ephemeral storage like Instance Store

Determine the kind of writes services are optimised for.

Services are IOPS-intensive and SDDs are more performant
Services perform sequential writes for analytics and HDD are more cost-efficient

Determine how services are discoverable.

Through an API Gateway
Through a service mesh architecture
Through hard-coded URLs (maybe not)

Determine the logging and monitoring metrics.

It’s best to define good metrics to track based on the system needs. For example, we might care more about

CPU load for text-based ML or a service that handles data compression
GPU for an image-based ML or crypto
RAM for a search and index engine or a cache
IOPS for OLTP
Throughput for big data and log processing

Good metrics are defined
Alarms are set with responsible threshold and appropriate channels
Visible dashboards are configured and shared
Access to logs and alarms are properly secured