A squirrel chewing through wires outside Yahoo’s data center in 2010. A ship dropping anchor on communication lines running underwater. There are some odd, nearly impossible occurrences that have caused unplanned downtime. Yet the most common cause of data center outages remains human error. And the more public the use of the facility, the more scrutiny the outage receives.
What happens when some of the most prevalent sites on the internet, let alone thousands of organizations, rely on this facility? Well, it gets a little bit more attention. That is what happened on February 28, 2017. Amazon Web Services (AWS) experienced a complete outage of their U.S. East-1 S3 storage environment. An engineer doing maintenance work on the S3 storage system inaccurately entered code, removing too many storage nodes during debugging, which led to a cascading failure that affected the entire U.S. East-1 environment. In just over 24 hours of post-mortem analysis, AWS identified that this human error had caused approximately four hours of downtime.
The Cloud Remains Reliable
It’s at times like this that cloud pundits come out and share their opinions. As we start hearing everything that is wrong with this or that cloud or even cloud in general, keep in mind this could have happened in any data center – your own facility or any other provider’s. With that in mind, AWS’s commitment of 99.95 percent uptime for the year isn’t lost yet; 99.95 percent equates to about 4.38 hours of downtime a year. With this outage lasting just under 4 hours, AWS has approximately 30 minutes left in the bank.
Murphy’s Law Is Still Applicable
We have operated under the adage of Murphy’s Law for so long in IT, why do we think we can operate differently just because the physical servers aren’t ours anymore? Providers have built an architecture within zones that have great redundancy and integrity, but they have built more than one of these zones for a reason.
As you build your applications and identify internal SLA’s, you need to build a strategy to meet and exceed these expectations. This may mean leveraging multiple regions within a provider or possibly multiple providers. Just as we didn’t simply build a single data center and house our applications there in the past, we can’t take for granted that the magic of cloud will own the responsibility of keeping our applications up all the time when we leverage only one availability zone. We must use everything we have learned in the past and continue to build better best practices as we charge forward into the next generation of platforms for our environment. This is good to keep this in mind as you think about what your strategy is from here.