CIOs Moving to the Cloud: The buck still stops with you
Amazon Web Services has been going through a much publicised outage, which has lasted by all appearances more than 12 hours. A range of services including Hootsuite, Reddit, Heroku, Foursquare, Quora and others have all faced major disruptions.
What is interesting is how they have positioned these outages: many have said EC2 is great, but they are having a bit of a problem at the moment. It appears these providers are taking the view that “whew, glad we outsourced our stuff so that it is clear that is not OUR fault that something like this has happened, and we can point to other vendors to prove the case that it wasn’t us – just imagine if we had have done this on our own servers and this happened, we would have been much more at fault!”
Just because systems are moved to the cloud doesn’t mitigate the responsibility to ensure mission critical outages are mitigated. If a business has a use-case that cannot tolerate down time then that business needs to architect their solution in a way that prevents downtime. Cost tradeoffs are always an issue, but if something goes wrong, and the cost of that problem is too high, then perhaps the service isn’t really feasible.
Imagine an airline providing a service where they cut costs on safety in order to offer a cheap service… Doesn’t bear thinking about. Imagine that the airline outsourced their safety inspections to a third party and then wiped their hands of responsibility in the event of a “downtime”. No-one would buy that.
The whole point about the cloud is that it enables you to free your thinking about one provider. Even if you stick with an Amazon only solution, or a Microsoft or Google or Salesforce or Rackspace or whatever solution, you still need to architect things in a way that allows you to accept the consequences of any flaws, no matter how they are caused.
After all, you are the service provider to your customer base – how you decide to deliver that is up to you.
A lot of people are learning a very hard lesson at the moment – there are good ways and bad ways of doing things. For some, a 12 hour outage is hardly a problem, but for others it can ruin lives.
This is where something like http://rightscale.com would come in very handy.
Alan, Yet another insightful blog!
For mine I think the one thing that is missing is the risk management around moving to the cloud – blindly accepting that the provider will never have an outage is wrong.
It is the responsibility of any CIO/CTO/Business that adopts any solution to ensure the provider (being cloud or locally hosted) has the process and controls in place when an outage happens.
But it doesn’t stop there – as you have rightly pointed out – we all need to ensure we have a BCP/DRP in place if this was to happen. But it does beg the question (as service providers) I wonder what BCP/DRP was in place for the likes of Heroku/Quora etc.. and whether their business reputation has suffered because of the outage and how they will ow address this.
As I have worked at a cloud provider (netdocuments) for a few years, I understand the sensitivity our customers have to uptime and reliability. We manage our own data centers which provides a public/private cloud hybrid.
It’s amazing how the need for competition bares its face when something goes down…Good thing EC2 isn’t the only one providing cloud services or the whole internet could really go down…
From my understanding of the situation, the companies that were affected by the Amazon outage do not offer their customers a guaranteed level of service.
For example, neither Quora nor Reddit offer any uptime guarantees whatsoever.
So their ‘customers’, while understandably annoyed at not being able to use those websites, can do nothing at all about the outage.
In that situation, it’s not entirely imprudent for Reddit management to simply blame their outsourcer for the outsourcer’s failure.
I’d like to hear more from a company that was affected by Amazon’s failure and _did_ have a guaranteed SLA with their customer.