Storage First Design
What is Storage First Design?
Storage-first is a method of ensuring resilience and scalability in a system. Its a way to ensure a system never loses data, and can always recover from failure. It’s a way to make sure that your system can handle any amount of data, and give itself breathing room to scale up when needed.
It sounds like a pretty basic idea, but its actually a really powerful way to design systems. By making sure that your system is designed around storage, you can ensure that it can handle any amount of data, and that it can recover from failure without losing anything. It changes a few common approaches to the way a system works, but in such a small way its often overlooked. It can be the difference between a system that can handle a few hundred users, and one that can handle millions.
Storage vs Process First
Lets think about a theoretical system with and without storage first.
The system is a payment step in an ecommerce system. At the point this service is invoked, the user has already added items to their cart, and is ready to check out. The system is responsible for taking the payment details, processing the payment, and then returning a response to the user.
Process First Design
In a Process-First approach, we make one call to our back-end and attempt to get a reponse to our user.
The browser makes an HTTP POST request to /process-payment. The service takes the payment details, processes the payment, and then returns a response to the user. If the payment is successful, it returns a success response. If the payment fails, it returns a failure response.
The problem is we dont own the actual card processing step. We have to call out to a third party service to process the payment. If that service is down, or if there is a network issue, then our service will fail. The user will get an error message, and they will have to try again later. We might even lose the payment details if we dont have a way to store them.
Storage-First Design
With storage first design, we would do this in two steps. The browser makes an HTTP POST request to /process-payment. The service takes the payment details, and stores them in a database. It then returns a 202 (Accepted) response with a unique ID for the payment. Ideally the unique ID is generated as some kind of hash of the transaction details. This makes idempotency much easier. If an additional request comes in and has the same hash, we can just return the same response without having to worry about duplicate payments.
Then we have a separate process that runs in the background, and takes the payment details from the database, and processes the payment. If the payment is successful, it updates the database with the success status. If the payment fails, it updates the database with the failure status.
The user can then check the status of their payment by making a GET request to /payment-status/{id}. We can have any number of statuses depending on the actual process, but the user will see essentially “Pending”, “Success” or “Failed”. If the user gets a “Pending” response, they can check back later to see if the payment was successful or not. If the user gets a “Failed” response, they can try again without having to worry about duplicate payments.
On the back end we can have a scalable process that takes not-started payments from the database, and processes them. If the payment processor is down, or if there is a network issue, then the payment will just stay in the database until it can be processed. We can also have a retry mechanism in place to handle transient failures. If our whole system crashes, we can just restart it, and it will pick up where it left off without losing any data.
The only additional complexity is that we now have a separate process that runs in the background to process the payments, and the front end needs to poll for the status of the payment. Neither of these are complicated or novel so not something any team should have any concerns implementing.
Conclusion
Storage-First design is a powerful way to design systems that can handle any amount of process requests, and can recover from failure without losing any data. By making sure that your system is designed around storage, you can ensure that it can recover from failure, both ours and someone elses. Its easy to implement, and can be the difference between a system that can handle a few hundred users, and one that can handle millions. Its a simple change to the way we design our systems, but it can have a huge impact on the resilience and scalability of our systems.