This week started off with news about the Amazon AWS DynamoDB’s outage and Skype’s network-related issues. For organizations who do not run mission-critical applications on Amazon or who use Skype for communications, this may not be a big of a deal. But for companies like NetFlix, Buffer (I use this to schedule some of my social media posts), Viber, and the rest who rely on AWS, this constitutes a Severity 1 issue that needs to be resolved quickly to resume normal business operations. But this isn’t the first time this happened. In 2013, we were unlucky (or lucky enough to learn from the experience) to have been at the receiving end of the outage when several of our customers were affected. We really couldn’t do much for our customers but to “wait and see” until the service provider restores the service back for us to check the infrastructure.
This probably is one of the main reasons why organizations are still hesitant to move to the cloud. The idea that you have no control of what is happening on the underlying infrastructure makes us not trust the cloud providers. And when loss of infrastructure equates to loss of revenue, we become hesitant to even consider the cloud.
Now, this isn’t about whether the cloud is reliable or not. It’s about whether or not you have a proper high availability and disaster recovery (HA/DR) strategy. Because whether your technology infrastructure relies fully on cloud providers, a fully on-premise architecture or a combination of both, a proper HA/DR strategy can spell the difference between surviving a disaster or going bankrupt. In a previous blog post, I discussed about the realistic approach to HA/DR that will not break the bank. I started off with defining business impact analysis and what a loss of a specific business process mean in terms of profit and loss.
NetFlix has been a popular case study for Amazon on how they leverage AWS for content delivery. As of 2014, NetFlix’s annual gross sales was around US$ 5.5B. If we apply the same logic as what was done for the Amazon outage back in 2013 to the calculation of losses for every second that the service is down, that would mean an average of US$158.73. Do the math to see how much sales opportunity is loss every hour that their service is down.
Now, you might be thinking, “we’re not as big as NetFlix nor Amazon.” And that’s why you need to define your own business impact analysis for your specific business processes. You know how much your potential losses will be for every hour of downtime and you can use that as a metric for designing and implementing a proper HA/DR strategy. For those running on Amazon AWS and other cloud providers, that would mean designing for failure of a single data center or geographical region. This also means educating your developers, systems engineers and IT operations folks about designing and managing all aspects of the solution to survive failures. Of course, eliminating the possibility of failure is like hunting for unicorns.
Technology disasters and outages as large is this rarely happens. The reality is, you would have already gotten the return on your investments given the proper design and implementation before something like this happens. But they do happen. We just need to properly plan and prepare for them.
We usually watch movies on NetFlix to wrap up our Sundays. When my son turned on the TV and started the NetFlix app, I didn’t even notice anything. Here’s how NetFlix survived the outage this weekend. They implement their own form of active-active multi-regional resiliency. It’s a good resource for understanding how they do things and what we can adopt in our own environment.
Additional Resources
- Amazon’s AWS DynamoDB experiences outage, affecting Netflix, Reddit, Medium, and more (Update: Fixed)
- The High Cost Of An Amazon Outage
- AWS outage: How Netflix weathered the storm by preparing for the worst
- The NetFlix Tech Blog:Active-Active for Multi-Regional Resiliency
Please note: I reserve the right to delete comments that are offensive or off-topic.