Developing and running software systems is a complicated balancing act between keeping the lights on, satisfying various competing demands from the business and its customers, and staying ahead of the competition. Attempting to do incorporate all of these needs while also working out the “right” thing to do next is near impossible.
The SRE approach prioritises operational running of a system above all else, but even that is too general to help us. Instead we need to define a language of work types, and agree how we will split our time based on a worst case situation. As situations improve and we clear our backlogs of higher priority work, we can spread the remaining time in to higher-level work types.
Classifying work with a hierarchical “type” and defining the priority of each work type can help us decide where to focus our energy, and also to help justify why some work was chosen over others.
If we classify work in our backlogs according to its “type”, and we prioritise work types based on importance to running stable, reliable, compliant systems we can free ourselves to do “higher level” work without concern for the behaviour of our products because our foundations will be solid enough to support us without the need for constant attention.
While we draw the hierarchy of important as a triangle, with foundational work being the largest and lowest part of the triangle, we intend to “invert” the triangle by ensuring the foundations are so well taken care of that we spend the least time. We can only get there by consistently prioritising the foundations beyond all other work.
Human Hierarchy of Needs
Maslow’s hierarchy of needs is a theory from psychology that gives a five-tier model of human needs, depicted as a pyramid. The segments location in the pyramid shows priority; lower levels have a higher priority overriding lower priority needs at higher layer.
The idea is that base needs such as food and water are more important than higher level needs such as education (self-improvement). If you have no food or water a person is unlikely to be concerned with whether they are going to pass an exam.
Product Hierarchy of needs
Technical products are very similar to a person. While they don’t need food and water, they do need basic working hardware to run on. If a system isn’t even able to stay online, its owners are unlikely to care about the outcome from the latest spike end engineer has spent investigating a potential feature.
While the hierarchy of human needs is relatively simple to understand, its not immediately clear what the hierarchy should be for a product, yet its not particularly hard to find an obvious home for the majority of things product teams regularly work on.
For a human, self-actualisation refers to becoming better than ones current self. For a product this translates fairly simply in to discovery work or in Agile terms, a Spike.
A Spike is used to try out some ideas. They don’t make it to production in themselves but they do help us learn how an idea might work, what the constraints are, think about the things we might have otherwise not have fully understood. In this sense a Spike is the betterment of the product and should sit at the top level. If there are other problems with our product we don’t want to spend time on blue-sky thinking when customers are leaving and not paying our bills.
Esteem for a human is where, in the developed world at least, we spend a lot of our time. Doing our jobs to make our products better, implementing new features that we know our customers want.
It is enjoyable to add new features, or to improve old ones, but again its not particularly important if our service wont stay online. We can only afford to do this if our lower level needs are already fulfilled and we have spare capacity to do this work.
Product Love and Belonging
Our products don’t work in a vacuum. They have customers (whether that it a human or another system doesn’t matter) who we need to love our product. They need to be able to rely on our product to work as expected.
A prime candidate for this category is bug fixing. The product isn’t “broken” as such, but it may have unexpected behaviour in certain circumstances, or may perform poorly. These are the thing that drive customers away towards our competitors because they get irritated by the problems the product causes them.
Fixing bugs is more important than adding new features because we need to keep current customers happy, but as with the other higher categories, its not as important as a system that can’t even stay up to serve our customers in the first place.
Product Safety Needs
There are some things related to a product that draw a line between prototype systems that we wouldn’t trust, and professional systems that we would.
Product Safety covers things such as data security, authentication and authorisation. We need to be sure that when a user accesses our product they only see data to which they have been given access. We also need to be sure that only the authorised users are allowed access.
While bugs are annoying, its even worse to find out someone else has access to our banking information for example. Personally Identifiable Information has legal requirements to ensure it is handled properly. If we aren’t sure this is being handled well its almost guaranteed that all higher-level work will be stopped until this is resolved.
Product Physiological Needs
Products, like people have physiological requirements. They need to run on infrastructure, and that infrastructure needs to be available for customers to access it.
At this layer we need to include not just the infrastructure itself, but connectivity between components, the ability to detect failures and the ability to resolve problems before a customer notices. Here we include on-call rotations, run-books, monitoring, alerting, dashboards etc.
This is an area we would, in an ideal world not like to think about. a huge reason why we use AWS is because it provides an awful lot out-of-the-box to help us with this layer.
How does this relate to Tech Products?
Agile prescribes working in short sprints, with each sprint delivering a working product.
What Agile doesn’t do however is tell us how to decide what to work on.
Its common to spend time defining work, refining the work to manageable chunks, and estimating the complexity to help us break things down in to chunks we can deliver in a single sprint. We then load up a sprint with enough complexity that we can deliver what we said we would.
In principal that’s fine, except for one small problem - who decide how important a chunk of work is?
The hierarchy of human needs, when drawn as the pyramid above shows priority in therms of vertical location, and importance as width. What it doesn’t show however is how much time is spent on each section.
A caveman might spend most of their time hunting for food and water. They will also spend a reasonable amount of time on ensuring their own personal safety. They want to spend some time with friends and family, ensuring their place in their community. They spend a little time on providing for their community in general to maintain their place in society and if there is any time left over, they might practice their skills to become better.
In the developed world however, the pyramid is turned on its head. I spend very little time on food and water - i turn on a tap and i can drink, and I open the fridge to get food. I own a home with locks on them although I do spend a little time occasionally making sure i repair things, and i mow my garden. I have a family which i spend my spare time with as much as i can. I work every day in a job I enjoy and get paid for doing. It takes up a lot of my time, but i get respect for doing my job. I spend a a good deal of time on reading about systems engineering, AWS services, and gaining certifications etc. Suddenly the base layer things are small, working towards more and more time spent at higher layers.
The hierarchy of needs of a product are identical in many respects. If we have tickets relating to the lowest layer we should prioritise those over all else. The fundamental foundations of our product are the most important thing to work on, because without them everything else crumbles.
If we have less work at the bottom layer to fill our sprint, we can select tasks from the next layer up. As we work harder and harder to remove work at the bottom of the pyramid, we need to spend less and less time on it and can afford to spend our available time on progressively higher layers. Over time, lower layers become less cumbersome, and we can spend more and more of our time working on features, and investigating ideas on how the future might look.
When a problem occurs at a lower layer however, the hierarchy steps in and says it must be dealt with before anything else. We then maintain a solid, healthy, reliable system that our users are happy with, and continue paying for.
Hierarchy of Needs and Incident Management
Given the hierarchy of needs, it becomes relatively clear how we should infer “importance” of a type of work. This extends further in to the territory of incident management!
Incidents are measured as P1/2/3. P1 is necessarily the most important, and potentially damaging to the product. It is therefore reasonable to say that any incident in the Physiology layer should be a P1. Similarly incidents in the Safety layer aren’t normally quite as damaging - incorrect security configuration or a DB authentication problem aren’t a complete outage, but they definitely deserve to be a P2.
Bugs however are pretty annoying. They annoy users, and they might cause some support work to occur. we probably don’t need to get anyone out of bed at 3am though. In that sense we can say P3 incidents live within the Love/Belonging layer.
Beyond P1/2/3 we don’t alert, but given that this is now squarely in to product development and future investigations, that makes sense.
Any work that is needed in the lower three levels has the potential to become an incident if we don’t deal with it.
HoN and OKRs
In [X Company] we use OKRs to steer our products over the medium to long term. OKRs tend to focus on the future - where we want to be in 3 or 6 months.
OKRs contain two specific types of work - work we have thought about, refined, and planned far enough to put them in to a fairly specific OKR. This is where a squad gets to show off to the rest of the business and to its customers. This is the Esteem layer.
Beyond the work we have already planned there is work that the business or our customers have asked for but we can;t necessarily do just yet. We need a spike to evaluate it or to define it more clearly. As we decided earlier, this sits in the Self-Actualisation layer.
We can’t stop developing features, our customers wont be happy!
That is partially true. Its is important to ensure our product is seen to be improving, but stability and reliability is an improvement in itself. There is a good argument that reliability is actually the only feature that all products have, and is therefore the only feature that truly matters.
Self Actualisation: 0%
This might seem a little extreme, but if there are 45% of complexity tasks available at the lowest level, it should be clear that you really should be prioritising those and not being too concerned about future product development work. You might still need to do some feature work, but its decidedly less important than fixing bugs or ensuring data integrity.
Over time, assuming you’re doing the right work, the number of available tasks at the lower levels should decrease. this frees up more time further up the stack allowing you to spend additional time on more interesting work. Eventually the stack turns upside down, and almost no time will be spent at the base layer, and almost all of your time can be spent on features and really thinking about the future.
Sprints can be organised in many different ways, but consistently prioritising the more fundamental requirements of a product will set things up for success and free up more time for the interesting work we all came here to do. Ignoring it will lead to problems, take time away from delivering what customers want, and will eventually lead to failure of the product
This is true even for a potentially great product.
Appendix A - specific work categorisation
All work should exist here. If a type of work doesn’t appear to be listed, please leave a comment to allow discussion on best placement
- OKR specified Work
Product Love and Belonging
- Bug fixes
Product Safety Needs
- Vulnerability Management
- Cost Awareness
Product Physiology Needs
- No reliance on other environments
- Leverage Managed Services