Theoretical Disaster Recovery doesn’t cut it.

by AlanSBPerkins

I have mixed feelings about Amazon’s latest outage, which was caused by a cut in power. The outage was reported quickly and transparently. The information provided after the fault showed a beautifully designed system that would deal with any power loss inevitability.

In theory.

After reviewing the information provided I am left a little bewildered, wondering how such a beautifully designed system wasn’t put to the ultimate test? I mean, how hard can it be to rig a real production test that cuts the main power supply?

If you believe in your systems, and you must believe in your systems when you are providing Infrastructure As A Service, you should be prepared to run a real live test that tests every aspect of the stack. In the case of a power failure test, anything short of actually cutting the power in multiple stages that tests each line of defense is not a real test.

The lesson applies to all IT, indeed to all aspects of business really – that’s what market research is for. But back to IT. If a business isn’t doing real failover and disaster recovery testing that goes beyond ticking the boxes to actually carrying out conceivable scenarios, who are they trying to kid?

Many years ago I had set up a Novell network for a small business client and implemented a backup regime. One drive, let’s say E: had programs and the other, F:, carried data. The system took a back up of F: drive every day and ignored the E drive. After all, there was no need to back up the programs and disk space was expensive at the time.

After a year I arranged to go to the site and do a back up audit and discovered that the person in charge of IT had swapped the drive letter around because he thought it made more sense. We had a year of backups of the program directories, and no data backups at all.

Here is the text from Amazon’s outage report:

At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity). Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power. Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.

Nice system in theory. I love what Amazon is doing, and I am impressed with how they handle these situations.

They say that what doesn’t kill you makes you stronger – here’s hoping we all learn something from this.

Read more from Amazon, Security

IT News: Inchcape's crack at uniting legacy, manual IT	18th Jan 2017
IT News: Who will win Consumer CIO of the Year	25th Nov 2016
CRN Magazine: Meet the buyer: Inchcape Australia	Sep 2016
ComputerWorld: Rackspace CTO Retorts to Gartner’s Disapproval of OpenStack	4th Dec 2013
e27: Cloud can help Asian startups slay giants	21st Nov 2013
ZDNet: Biggest cloud risk for CIOs is being blind to potential	15th Nov 2013
Delimiter: Rackspace hires high-profile cloud CIO Perkins	5th Feb 2013
Asia Pacific Security:Cloud guru Alan Perkins joins Rackspace in new Asia Pacific role	5th Feb 2013
Technology Spectator: CeBIT kicks off with cloud computing focus	22nd May 2012
Australian Financial Review: Conference offers peek into future	22nd May 2012
CeBIT: Interview with Alan Perkins	26th Apr 2012
Sydney Morning Herald: Are CIOs Scared of the Cloud?	6th Mar 2012
The Australian: People to Watch in 2012	23rd Feb 2012
IT Wire: Altium Cloud Guru Weighs his Options	20th Feb 2012
The Australian: Cloud computing set for IT industry baptism	29th Nov 2011
Delimiter: CIO gives top seven tips for cloud adoption	17th Nov 2011
Delimiter: Does Australia need a cloud computing visionary?	13th Nov 2011
The Sydney Morning Herald: Companies not investing enough in IT security.	11th Nov 2011
The Australian Financial Review: Gatekeepers cultivate a new image. Also in print	8th Nov 2011	pS10
Australian IT (The Australian): New frontier in digital sock drawer	18th Oct 2011
MIS Australia: The Right Foundations (Also published in MIS Magazine)	26th Aug 2011	p30
IT News: Salesforce's Chatter goes private	9th Jun 2011
BRW: Internet Advantage	12th May 2011	p45
cio.com.au: Opinion: The buck stops with you on Cloud	22nd Apr 2011
BRW: Smart IT Outsourcing	21st Apr 2011	p33
Delimiter: Cloud Vendors Need to Communicate Better: CIO	13th Apr 2011
Sramana Mitra: Thought Leaders in the Cloud	21st Dec 2010

June 18, 2012

Theoretical Disaster Recovery doesn’t cut it.

Leave a comment Cancel reply

Alan Perkins

Selected Media

Recent Posts

Recent Tweets

Archives

About

Pages

Email Subscription

June 18, 2012

Subscribe

Theoretical Disaster Recovery doesn’t cut it.

Share this:

Related

Leave a comment Cancel reply

Alan Perkins

Selected Media

Recent Posts

Recent Tweets

Archives

About

Pages

Email Subscription