In my last post, we explored some of the conditions that caused Amazon’s cloud to collapse. Today, I’d like to broaden the discussion with the intent of highlighting some important lessons that buyers can apply when evaluating cloud providers. I’ll also reflect a bit on how this experience has impacted Hyland’s SaaS platform, OnBase OnLine.
Can this IaaS, PaaS, SaaS stuff I keep hearing people talk about protect me from future service failures in the cloud?
Actually, yes it just might.
As Ron McClellan has previously evangelized, the cloud is part of a larger spectrum of deployment methodologies and customers should be free to choose the deployment option that best aligns with their organizational needs. That means rejecting “all or nothing” thinking by considering all the available options. This even includes moving back and forth between the extremes or building hybrid systems that comprise both on premises and hosted components.
Here’s some pragmatic advice. If the technology for a project has been selected, but the business processes and requirements are still largely undefined, you’re headed for trouble. Unfortunately, in my experience, this happens much too frequently in organizations of all sizes. No matter the deployment option, you should define your business processes first and then select the technologies – and the deployment option – that best enables those processes.
Within the cloud, deployment options can be further broken down into Infrastructure, Platform, and Software as a Service. My advice is to make sure you thoroughly understand the obligations each of these places on your organization. Here’s a brief primer.
Amazon’s EC2 belongs within the IaaS category. This service (and IaaS overall) places the largest burden on the end user – as the name implies, the hosting provider is only taking responsibility for delivering the “Infrastructure.” Performing backups, installing software, and patching operating systems are almost entirely the customer’s responsibility.
Furthermore, services like EC2, by definition, cannot offer an SLA that addresses the availability of a specific business function or application. Clearly, many of Amazon’s customers were either unaware of these responsibilities or failed to give them adequate attention. The benefit of the IaaS model is that it provides a high level of compatibility with legacy applications and provides the end user with a higher level of control.
Here’s an example we can all relate to. If EC2 was a swimming pool contractor, they would dig a rectangular hole in your backyard and give you most of the tools needed to install the pool. But, you’d be expected to design the pool, buy and pour the concrete, buy and install the liner, buy and install the pump and then fill it with water.
PaaS Clouds, like Microsoft Azure and Google AppEngine, assume responsibility for both the infrastructure and a set of APIs that are running on top of that infrastructure. Within the context of the recent EC2 failure, this represents only an incremental improvement over IaaS. Each customer is still responsible for designing, authoring, testing, deploying, administering, and monitoring applications that correctly utilize these APIs to meet their underlying business needs.
These providers still can’t offer SLAs that are defined in meaningful business terms because they are intentionally limited to delivering technology rather than the ability to complete a given business process. Perhaps the largest liability is that switching from one PaaS platform to another, or spreading load over multiple PaaS platforms, can be very expensive. The system is, by definition, tightly coupled to the platform. Conversely, the primary benefit of the PaaS model is that it forces customers to explicitly design for failure.
Back to the example. If Azure was a swimming pool contractor, they would dig an irregularly shaped hole in your backyard and pour the concrete. But, you’d be expected to design a pool that fits into their concrete mold, which may not be an easy task. You’d also be expected to manufacture and install your own liner, manufacture and install your own pump, and then fill the pool with water. If you ever decide to relocate, you’ll have to buy a new pool and move all of the water yourself.
SaaS providers offer the greatest potential coverage regarding data protection, disaster recovery, and business continuity – the vendors assume responsibility for delivering specific business capabilities, not just technical possibilities. OnBase OnLine, for example, offers tiered SLAs that are tied to the ability to complete specific actions such as “retrieve documents” or “import documents.” Similarly, Hyland commits to providing an array of overlapping data protection services that include backups, document-level data validation and periodic disaster recovery planning tests.
Although our customers must still define their own business continuity strategies to account for compensating controls that are located outside of Hyland’s data centers, this is a significantly smaller burden that is consistent with the customer’s own core competencies.
If OnBase OnLine was a swimming pool contractor, we would ask you to select from several hundreds of pool designs, allowing you to mix and match until you had a custom solution that fit your needs. We’d take care of almost the entire installation. Your only responsibility would be to fill the pool with water (AKA documents, metadata and customizations). Oh yeah…and if you later changed your mind, we’d help you move the entire pool to whatever location you preferred without spilling a single drop of the water.
From a SaaS perspective, what has this outage taught us?
This is a great question.
From my perspective, this service failure has reinforced the need to continue to educate our employees, partners and end users so that they understand this overarching necessity for any cloud solution:
Disaster recovery and business continuity planning is a business function requiring cooperation across organizational boundaries.
The Amazon situation also underscores the need for us to double down on our commitment to transparency. We already have several active initiatives that will extend our leadership in these areas. I look forward to discussing them within this forum at the appropriate time.
Finally, I consider the specific nature of how the EBS service failed to be a validation of many difficult decisions we have made within the past 11 years. Those decisions were strongly influenced by the recognition that platform, temporal and spatial coupling are top level concerns in a shared hosting environment.
Will OnBase OnLine ever experience this type of service failure?
I’m not going to spend my time convincing you that failure is an unavoidable aspect of any distributed system, only to claim that this could never happen to our own SaaS cloud. I’m also wary of using this forum to advertise Hyland’s services. But, it’s a valid question that deserves an answer.
The most direct answer is “we have.” In 2004, four years after Hyland first began offering SaaS services, our primary datacenter suffered a direct hit by a tornado.
This was a disaster in every sense of the word. It ultimately led to the demolition of a building that housed roughly half of our provider’s data center floor space. But as a result of some excellent business and technical decisions, as well as a healthy dose of good luck, our customers never experienced any downtime related to this incident.
After reflecting on the Amazon incident in these blog posts, and in a Computerworld article (“Amazon cloud outage was triggered by configuration error”), there are a slew of other questions I could answer on the topics of business continuity and hosting services. To whet your appetite, here is a list of questions I will attempt to answer in future posts:
- Can data replication guarantee data integrity?
- Can an ECM system be “highly available” if each release isn’t backward compatible?
- Does multi-tenancy actually provide business value?
- Does an elastic cloud service make managing a corporate budget easier?
- What is the difference between high availability, disaster recovery and business continuity?