Editor’s note: To accurately reflect the incident since Amazon published its postmortem statement to its customers, this blog post has been amended as of 6:00 p.m. ET on April 29, 2011.
Several days after an isolated network failure within Amazon’s Elastic Compute Cloud (EC2) cascaded into their most significant service outage to date, it’s clear that the event will generate a series of critical questions about cloud computing that will echo across conference rooms around the globe in the coming days, weeks and months.
Who should we blame?
It’s human nature to ask this question first, even though it is a completely illogical starting point for analysis. But, most of the media coverage on the incident has been focused on asserting blame, like Justin Santa Barbara from FathomDB who said that “the blame here lies squarely with [Amazon Web Services], who ‘guaranteed’ a contract they then broke.”
Others, like, Alan Perkins from Cloud81 proclaim “Just because systems are moved to the cloud doesn’t mitigate the responsibility to ensure mission critical outages are mitigated. If a business has a use-case that cannot tolerate down time then that business needs to architect their solution in a way that prevents downtime.” After all, the availability limitations of Amazon’s Elastic Block Store (EBS) have been well documented for at least a year.
The trouble is…they’re both right.
Clearly Amazon has to assume responsibility for the failure of its service. Yet, several SaaS providers who rely upon the US East Availability Zone were “not affected by the AWS issues” because they designed systems that explicitly account for the fallacies of distributed computing and Brewer’s CAP Theorem.
Yes, I know this is dense, technical content. But, as the era of cloud computing and big data takes hold, these are quickly becoming fundamental concepts.
The ability of these organizations to withstand “Amazonageddon” by applying engineering concepts (that were established between ten and twenty years ago, no less) perfectly illustrates Perkins’ point. The proper alignment of business and IT can lead to technical designs that adequately address the cost of downtime. This idea falls right in line with Ken Burns’ reflections on the increasing pressure on CIO’s to balance long term risk with short term ROI during last year’s Gartner Symposium & ITxpo.
Okay, can we put away the pointy fingers now?
Is it over?
Yes and no. Amazon restored availability of their EBS service within a few hours. However, it took nearly two days for the backlogged workload to clear. By now, normal operations should be restored for nearly all affected businesses, albeit not as quickly as they would have liked.
What’s the opposite of eventual consistency; immediate loss? Unfortunately, as I write this there are still a fair number of companies like milesplit.com who are working through data integrity issues created by the EC2 failure. That’s a serious problem. A large percentage of companies who experience a significant data loss go out of business, which is why other companies, like Hyland, offer data protection services.
Will the EC2 outage kill cloud computing?
I can’t predict the future, but I feel safe asserting that it shouldn’t.
Service failures aren’t a new dynamic in corporate IT. All networks, including corporate data centers, clouds, and the Internet will experience failure when measured over a sufficient period of time. The question is not “if” a given data center will fail. The question is “what impact will the failure have on my business when it occurs.”
This is a “big B” Business question that largely revolves around when the failure occurs and how long it lasts. Technology can only be effectively applied to meet the business’s needs once these questions have been adequately addressed. If the need for high availability is understated, service interruptions are likely to result in unacceptable losses. If the need is overstated, valuable resources are likely to be wasted on systems that have been over engineered to achieve levels of availability that are ultimately excessive.
A more useful question to ask is “How will the EC2 outage change the way that cloud computing is utilized?” My hope is that it encourages administrators, developers, and CIOs to see private data center and public cloud deployment as a spectrum, instead of an all or nothing decision point. Most corporate IT departments are too quick to buy into a never ending stream of so-called silver bullets and best practices. A healthy dose of nuance and best thinking are badly needed.
At least Amazon’s SLA will provide some financial relief for the organizations impacted by the outage, right?
Amaz-ingly (sorry, I just love a good pun), it appears that this outage did not violate any of EC2’s SLA provisions. In fact, the EBS service that was at the root of the failure doesn’t appear to be covered by any SLA whatsoever.
Although I am happy to report that Amazon has decided to voluntarily grant a 10 day credit to all customers who were using the affected hosting services, as Phil Wainwright points out, this is still an important lesson. First, define the business side’s tolerance for availability failures, then work with your cloud provider to define an SLA structure that is aligned with those business needs. Service credits should be the result of a contractual agreement, not a discretionary choice.
And that’s really the point, right? This whole incident is just that – a lesson. Organizations should take away that while they’re dependent on the cloud provider for some things, no SLA will fully account for the true cost of downtime or data loss. After all, the reality is that the financial remedies will be limited to the hosting fees paid to your cloud provider. Additional mechanisms, like business insurance, may be needed to adequately mitigate this risk.
Stay tuned for my next post, where I’ll cover more lessons learned that organizations can apply from this situation.