Random Notes on System Design

Introduction

Over the years I have picked up various insights and best practices related to system design. This post is a collection of random notes that I find useful when thinking about designing scalable and maintainable systems.

They’re not particularly organized, however they come up time and again in my mind when considering different aspects of system design. Working with teams can sometimes be difficult becuase different people have different mental models of how systems should be designed. I hope that by sharing these notes, I can help align some of those mental models.

One-Liners

  • Simple is generally the most resilient

The Priorities of a System

When designing a system, it’s important to consider the following priorities, in this specific order of importance:

  1. Security: The system must, beyond any other consideration, be secure. This includes protecting data, ensuring privacy, and preventing unauthorized access.
  2. Durability: The system should be designed to withstand failures and ensure data integrity over time. It is more important to ensure data is not lost than to ensure high availability or performance.
  3. Availability: The system should be available to users as much as possible, but not at the expense of security or durability.
  4. Latency: The system should perform well enough to satisfy users’ needs. It does not need to be perfect.
  5. Correctness: The system should function correctly and produce accurate results.
  6. Cost: The system should be cost-effective, but not at the expense of security, durability, availability, performance , or correctness.
  7. Self-Sustaining: The system should be able to operate with minimal human intervention. Any Human intervention should be a priority for improvement of the system, not for maintenance of the system.

An important note is that these priorities should be invisible to the end users. If they notice, thats a failure beyond the systems user-acin design and functionality, and should be addressed as an existential priority.

The point of a System

The point of any system, whether it is digital or physical, is to fulfil a users need. A system is not a piece of art, it is not a monument to human ingenuity, it is not a testament to the skill of its creators. A system exists solely to serve its users.

A system that does not serve its users is a failure, regardless of how well it is designed or how impressive its architecture may be.

Evolution as an Agile process

Many companies say they are “Agile”, but the reality is they often are not. Using daily stand-ups, two-week sprints, and other Agile rituals does not make a company Agile. Agile coaches or Scrum Masters are not required, and often appear to sustain process over evolution. Somehow Agile got coopted into a set of rituals and ceremonies that have little to do with the original intent of Agile.

Agile is the embodyment of continuous evolution. A system should be designed to evolve over time, to adapt to changing requirements, and to improve based on user feedback.

In Darwins Theory of Evolution, the most important factor for survival is the ability to adapt. A System only needs to be “good enough” for its current environment, and to excel in that environment it needs to be “a little bit better than the competition”.

By contrast I often see companies trying to design systems that are “perfect” from the start, or that are “future-proof”. This is a flawed approach, as it often leads to over-engineering, increased complexity, and a lack of adaptability. Essentially this is Waterfall thinking implemented using Agile rituals to tell ourselves we are Agile.

A perfect process doesnt exist (after all, there is no perfect animal or plant in nature), however Agile gives us an evolutionary cycle:

  • Build -> Measure -> Learn -> Adapt -> Repeat

Every iteration should end with a working system that we hope is better than the previous iteration. By learning from this system, we can adapt our design and improve it over time.

Internally we should be using Metrics and Observability to measure how well our system is performing, and to identify areas for improvement. This data should inform our decisions about how to adapt our system in the next iteration.

Externally we should be talking to users to gatehr feedback on how well our system is meeting their needs. This feedback should also inform our decisions about how to adapt our system in the next iteration.

From this we can plan the next iteration. Sometimes that will be to add or extend functionality, but often it should be to wind back functionality because it was an evolutionary mis-step. Unfortunately thats often viewed as failure, when in reality its just part of the evolutionary process.

To Quote Einstein:

“A man who has never failed has never tried”

Failure should be more common that success in our evolutionary process, yet we often treat it as a taboo subject. We should embrace failure as a learning opportunity, and use it to inform our future decisions.

Build for 3am, when it’s all gone to hell….

No matter what we do, something will always go wrong. Servers will crash, networks will go down, databases will become corrupted.

When designing a system, we should always be thinking about how it will behave in the worst-case scenario.

Computers dont care about data. People do.

Many AWS services for example can have randomly generated names. For example if you create an S3 bucket but do not provide a name, the S3 service will generate one for you - and it’s not pretty. Thats fine because our IaC got that name and passed it to our Lambda functions and an env var, but what happens when we need to debug something at 3am because the system is down? We have no idea which bucket is which, and we have no idea which Lambda function is which. We are going to be in a world of pain. Its not impossible to debug, but its going to take a lot longer than if we had just used sensible names in the first place.

When designing a system, we should always be thinking about how it will behave when things go wrong. We should be thinking about how we will debug it, how we will recover from failures, and how we will ensure that our users are not impacted.

Incidents

Analysis

  • Triage is fast, but shallow. You’re analyzing to pin-point the problem so you can mitigate it quickly. You are not trying to find the root cause.
  • Post-mortems are slow, but deep. You’re analyzing to find the root cause so you can prevent it from happening again. You are not trying to mitigate the problem quickly.

Log Files

  • Log files are the API your operators use to debug an issue
  • Log files were not intended for this specific issue
  • OTel is making this better, but we still have a long way to go and they are often sampled to save money
  • Design your log files with the assumption that they will be used to debug issues in the future
  • Ensure your log files have sufficient context to understand what was happening at the time of the incident (Otel helps a lot here with tracing)

AI can be relly good at finding patterns and sequence of event in large, incoherent log files

  • Consider using AI to help analyze log files during incidents
  • AI might not be the smartest operator, but it doesnt get tired, and it doesn’t miss tiny changes between two nearly identical lines