Theoretical Disaster Recovery doesn’t cut it.
I have mixed feelings about Amazon’s latest outage, which was caused by a cut in power. The outage was reported quickly and transparently. The information provided after the fault showed a beautifully designed system that would deal with any power loss inevitability.
After reviewing the information provided I am left a little bewildered, wondering how such a beautifully designed system wasn’t put to the ultimate test? I mean, how hard can it be to rig a real production test that cuts the main power supply?
If you believe in your systems, and you must believe in your systems when you are providing Infrastructure As A Service, you should be prepared to run a real live test that tests every aspect of the stack. In the case of a power failure test, anything short of actually cutting the power in multiple stages that tests each line of defense is not a real test.
The lesson applies to all IT, indeed to all aspects of business really – that’s what market research is for. But back to IT. If a business isn’t doing real failover and disaster recovery testing that goes beyond ticking the boxes to actually carrying out conceivable scenarios, who are they trying to kid?
Many years ago I had set up a Novell network for a small business client and implemented a backup regime. One drive, let’s say E: had programs and the other, F:, carried data. The system took a back up of F: drive every day and ignored the E drive. After all, there was no need to back up the programs and disk space was expensive at the time.
After a year I arranged to go to the site and do a back up audit and discovered that the person in charge of IT had swapped the drive letter around because he thought it made more sense. We had a year of backups of the program directories, and no data backups at all.
Here is the text from Amazon’s outage report:
At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity). Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power. Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.
Nice system in theory. I love what Amazon is doing, and I am impressed with how they handle these situations.
They say that what doesn’t kill you makes you stronger – here’s hoping we all learn something from this.