Major hyperscale cloud providers such as Amazon Web Services (AWS) and Azure provide a high number of nines percentage (e.g., 99.99999 percent) of application uptime per year. However, as they are complex systems, they still cannot guarantee perfection. This fact was highlighted by an outage on Feb. 28, 2017, attributable to human error, which affected multiple availability zones in a specific region, and resulted in the unavailability of services provided by many companies relying on AWS in that region. Although hyperscale clouds are not perfect, hyperscale cloud-levels of availability are simply not achievable without using them, and these outages are high-profile because many highly visible companies can be affected simultaneously. While you can design for high availability (HA) between regions, there is higher cost and complexity to fail between regions versus availability zones within regions. The hyperscale cloud still remains the best place for your company’s workloads.
Cloud providers such as AWS and Azure go to great lengths to provide HA options for customers. Both AWS and Azure organize consumable services into geographical regions that are comprised of multiple data centers or “availability zones.” Infrastructures in regions are completely independent from each other, so an issue crippling one or more availability zones within a region does not affect the other region. Each provides mechanisms for the replication of storage within a region or between regions. AWS provides simple mechanisms for moving network addresses (IP addresses) and networks between availability zones, and Azure has options to accomplish similar functions, albeit with a bit more complexity. Both provide HA, disaster recovery (DR) and backup services beyond the design of the infrastructure offerings and zones or regions. Maintenance is not performed simultaneously across regions to ensure maintenance in one region does not affect another.
On Feb. 28, 2017, an engineer working in AWS followed established procedures to remove some redundantly connected (e.g., would not cause an outage if removed from operation) Simple Storage Service (S3) servers from production. Amazon S3 provides reliable, fast storage capacity and simple mechanisms for the access of data for a variety of Amazon or non-Amazon hosted services.
Apparently, a typographical error when issuing a command removed many more servers than planned. It appears the command had no kind of limits or warning about taking that many servers down at once, or perhaps the engineer ignored them. This led to a cascading effect that essentially took down the S3 service in the entire north Virginia region. It affected anything that relied upon this storage, including content and images for many websites. Embarrassingly, AWS’s status page was also a victim and wasn’t able to correctly report the status of the S3 outage for a good portion of the outage.
Amazon S3 Buckets SOURCE: Amazon
Balancing Cost, Complexity and Necessity
It is easy to throw stones at the human error that led to this outage and the missing protocols that could have prevented it, but Amazon will learn from this mistake and add to the already intelligent automation, enormous lists of procedures, and safeguards that will prevent this from recurring. Mistakes like this or worse occur routinely at the on-premises data centers of companies across the country, but the scope is much smaller and public exposure to the incidents is either zero or limited to a very small set of people or companies. Further, the architecture of smaller-scale systems would make it difficult to recover in a timely fashion from mistakes such as this, and would probably lead to data loss.
Maintaining a public-facing service with enough capacity, data centers, security, procedures and human resources to achieve very high levels of availability and reliability is beyond the means of all but the largest companies. Amazon’s S3 service touts 99.999999999 percent data durability (data free from errors), which is an impressive achievement on the scale of S3. This is done with specialized software and commodity (generic) hardware to make it affordable, meaning the Amazon cannot be held hostage by vendor service warranty plans leading to higher costs, and eventually higher pricing, for the market.
Current Amazon Web Services SOURCE: Amazon
The outage was restricted to a single region within AWS. An observer could ask why the organizations affected by the outage didn’t design for regional, or even cloud availability. There are a myriad of reasons, but they can all be summarized to the balance between cost, complexity and necessity.
Dealing with Regional Differences
Most easily configured services running through the public hyperscale cloud providers are based within regions. When configuring storage, networking or compute resources on portals, redundancy features are typically selectable upon the availability zones within the regions. This covers the vast majority of hardware and configuration faults. Some services have built-in regional replication. However, applications have to be constructed to deal with regional differences. An example of this is the application architecture of Netflix. They have spent a great deal of time and resources building an active-active architecture, allowing them to run in multiple AWS regions at once. This requires difficult programming and design concepts to ensure a stateless application that does not tie users to a region and the services within the availability zones if there is a failure in a region. To accomplish this took a considerable amount of development, and there was a significant amount of associated costs.
There are of course, less involved and cheaper alternatives to prepare companies for multi-region or even multi-cloud DR or HA. Microsoft’s ASR (Azure Site Recovery) in Azure Cloud can be configured to replicate Elastic Compute Cloud (EC2) workloads and data from AWS to Azure. While not an active-active configuration or suitable for something as massive as Netflix, it can provide a DR option for an extended outage from AWS.
Additional costs are introduced when running in multiple regions simultaneously, or having HA across regions or clouds. Within regions, the cost of moving data is less than across them. Networking configurations can’t float easily between regions. There is more manual maintenance required to maintain a presence in multiple regions. For these reasons alone, a company must decide between weathering a 3-hour outage every few years and the expense and complexity of multi-regional or multi-cloud HA. Further, as the hyperscale clouds continue to grow, these barriers will disappear and multi-regional availability will be less expensive and more of a standard configuration option.
Netflix Multi-Region High Availability Architecture SOURCE: Netflix
Despite previous and future outages to hyperscale cloud services, they remain the best place for your organization’s workloads. It is important to understand the details of the resiliency offerings of hyperscale clouds so your businesses can plan accordingly. Experts from IT service providers such as CDW can help navigate the terminology, help your company understand the strengths and weaknesses of the hyperscale cloud provider offerings, and maintain your company’s hyperscale cloud-dependent services.