Priorities in System Design

Introduction

Is any system there are competing proprities that need to be balanced, however not all proirities are created equal!

As an example, you may need to nesure you;re using Brand colors in your UI, but that is clearly less important than ensuring you;re not leaking credit card data.

The Priorities of any System

Like the Asimov’s Three Laws of Robotics, there are certain priorities that should guide the design of any system. Each priority builds upon the previous one, but previous ones always take precedence over the next ones. In other words, you should never sacrifice a higher priority for the sake of a lower priority.

They are also sub-grouped into three categories. Non-negotiable, Negotiable, and Evolutional:

Non-Negotiable Priorities

1. Security

Beyond anything else, a system must be secure. We need to protect our users data, but we also need to ensure nobody can get root access to our servers. This is Priority One for the simple reason that the most secure system is the one that is turned off. Presumably thats not an option, so we need to work backwards from there and ensure we have the smallest possible attack surface, and that the surface we do have is as well designed for security as we are able to make it.

2. Durability

Durability is an interesting word. Its often confused with Availability or Reliability, but its actually a different concept.

Durability is about ensuring that data is not lost, even when something goes wrong. If we still have the data, at least we can try to do the thing again. If we lose it, it’s game over.

This makes it an interesting prioity because it directly influences the design of the system. It affects the order of operations, the way we handle errors, and the way we design our data storage.

3. Correctness

It seems obvious that a system needs to be Correct, but its surpising how often we seem to accept “bugs”. For some reason we seem to have accepted that software is inherently buggy, and that we just need to live with it. This is a failure of our design, not an inevitability of software development.

Correctness is about ensuring that our system does what it is supposed to do, and that it does it correctly. Its obvious that we dont want our Bank to make a mistake and lose our money, but we also dont want our Social Media platform to accidentally post something we didnt intend to post. We need to ensure our customers receive the products they purchased and that they are not overcharged or undercharged. We need to ensure our users can trust our system to do what it is supposed to do, and to do it correctly.

Negotiable Priorities

The following 3 Priorities are negotiable. You can’t sacrifice the non-negotiable priorities for the sake of these, but you can make trade-offs between them. Your system sits at a point somewhere in the triange between these three priorities, and you need to decide where that point is based on your users needs and your business goals.

An obvious example is adding an extra 9 to your availability may reduce you latency and increase you costs. If you dont have the money, or if your users dont care about that extra 9, then you shouldnt add it. If you do have the money, and your users do care about that extra 9, then you should add it. But you shouldnt add it just because it sounds good, or because your boss wants it, or because your competitors have it. You should add it because it is the right thing to do for your users and your business.

4. Availability

At last, we get to Availability! This is often the first thing that gets thought about when designing a system - mostly because its a solved problem. Modern technology means we can achieve very high levels of availability with relative ease, and its common-place to over-spec a system to ensure it is always available.

Using the SRE model though, we shouldn’t just throw money at the problem (more on this in the Cost section), we should actively design our system to “Just available enough”. We shouldn’t be unreasonable and aim for 100% availability just because it feels right, or makes our boss feel warm and fuzzy inside. This is the first in a subsection of priorities

5. Latency / Performance

I’ve given this Priority two names because they’re kind of interchangeable. Essentially we need answers, and we need them ~now!~ as soon as is pratically possible within the constraints of our system.

Its technically a lower priority that availablity because i’d rather the process compeltes late than not even being able to start it in the first place.

6. Cost

Cost is the most easy to quanitify, and the most likely to be lied about! Companies often say thhey want the best rexperience for their users, and hey, look, it’s priority 6 - it barely matters, right?

The reality is that cost is often the most important priority, we just dont like to talk about that. Really, this is the one that underlines all of the rest:

We want the safest, most durable, most correct, most available, and best performing system that we can afford.

The interesting thing is, if 1-5 are done really well, costs tend to lower than if they are done poorly. It also tends to be the case that if 1-5 are done really well, users are more likely to pay for the system, which means we have more money to spend on the system, which means we can do 1-5 even better. Its a virtuous cycle.

Evolutional Priorities

These are priorities that aren’t necessary for the system to function, but they are important for the long-term success of the system. They are about ensuring the system can evolve over time, and that it can adapt to changing requirements and user needs.

7. Self-Sustaining

There are other evolutional priorities we could add here, but I think this one is the most important. A system should be designed to be self-sustaining, meaning it can operate with minimal human intervention. In the event of a problem, the system should be able to recover itself, or at least provide enough information for a human to intervene and fix the problem.

This used to be extremely hard, but with modern infrastructure and automation tools, it’s becoming increasingly possible to design systems that are self-sustaining.

Cloud systems emit events when all sorts of things happen (and we should emit custom events also). We can use those to trigger corrective actions. In the worst case we can alert a human who can A) fix, and then B) automate the fix for next time.

Conclusion

When designing a system, it’s important to consider the priorities that should guide our design decisions.

I’ve spent a lot of time on-call, and whether the problem is tiny or huge, waking up at 3am and trying to deal with a broken system isn’t fun. If we design our systems with these priorities in mind, we can reduce the likelihood of waking up at 3am, and we can make it easier to deal with the problems when they do happen.