Theoretical Disaster Recovery doesn’t cut it.
I have mixed feelings about Amazon’s latest outage, which was caused by a cut in power. The outage was reported quickly and transparently. The information provided after the fault showed a beautifully designed system that would deal with any power loss inevitability.
After reviewing the information provided I am left a little bewildered, wondering how such a beautifully designed system wasn’t put to the ultimate test? I mean, how hard can it be to rig a real production test that cuts the main power supply?
If you believe in your systems, and you must believe in your systems when you are providing Infrastructure As A Service, you should be prepared to run a real live test that tests every aspect of the stack. In the case of a power failure test, anything short of actually cutting the power in multiple stages that tests each line of defense is not a real test.
The lesson applies to all IT, indeed to all aspects of business really – that’s what market research is for. But back to IT. If a business isn’t doing real failover and disaster recovery testing that goes beyond ticking the boxes to actually carrying out conceivable scenarios, who are they trying to kid?
Many years ago I had set up a Novell network for a small business client and implemented a backup regime. One drive, let’s say E: had programs and the other, F:, carried data. The system took a back up of F: drive every day and ignored the E drive. After all, there was no need to back up the programs and disk space was expensive at the time.
After a year I arranged to go to the site and do a back up audit and discovered that the person in charge of IT had swapped the drive letter around because he thought it made more sense. We had a year of backups of the program directories, and no data backups at all.
Here is the text from Amazon’s outage report:
At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity). Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power. Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.
Nice system in theory. I love what Amazon is doing, and I am impressed with how they handle these situations.
They say that what doesn’t kill you makes you stronger – here’s hoping we all learn something from this.
CIOs Moving to the Cloud: The buck still stops with you
Amazon Web Services has been going through a much publicised outage, which has lasted by all appearances more than 12 hours. A range of services including Hootsuite, Reddit, Heroku, Foursquare, Quora and others have all faced major disruptions.
What is interesting is how they have positioned these outages: many have said EC2 is great, but they are having a bit of a problem at the moment. It appears these providers are taking the view that “whew, glad we outsourced our stuff so that it is clear that is not OUR fault that something like this has happened, and we can point to other vendors to prove the case that it wasn’t us – just imagine if we had have done this on our own servers and this happened, we would have been much more at fault!”
Just because systems are moved to the cloud doesn’t mitigate the responsibility to ensure mission critical outages are mitigated. If a business has a use-case that cannot tolerate down time then that business needs to architect their solution in a way that prevents downtime. Cost tradeoffs are always an issue, but if something goes wrong, and the cost of that problem is too high, then perhaps the service isn’t really feasible.
Imagine an airline providing a service where they cut costs on safety in order to offer a cheap service… Doesn’t bear thinking about. Imagine that the airline outsourced their safety inspections to a third party and then wiped their hands of responsibility in the event of a “downtime”. No-one would buy that.
The whole point about the cloud is that it enables you to free your thinking about one provider. Even if you stick with an Amazon only solution, or a Microsoft or Google or Salesforce or Rackspace or whatever solution, you still need to architect things in a way that allows you to accept the consequences of any flaws, no matter how they are caused.
After all, you are the service provider to your customer base – how you decide to deliver that is up to you.
A lot of people are learning a very hard lesson at the moment – there are good ways and bad ways of doing things. For some, a 12 hour outage is hardly a problem, but for others it can ruin lives.
Amazon Web Services Part 2 – Scaling Services Provided
Here is the second of three posts on Amazon’s Web Services. The first post provided a look at the key foundational services. The next post will talk about how Altium is leveraging Amazon’s offerings. Meanwhile, in this post I promised to write about some of the ways Amazon takes elasticity to the extreme through a range of services aimed at addressing scalability as an issue.
At its most fundamental level, Amazon’s Web Services are aimed at elasticity – this is reflected in the technology as well as the pricing. The pricing is a pay as you go model – by the hour, by the storage used and/or by the bandwidth consumed. Many of the services reflect this elasticity in their name.
So what tools does Amazon provide you to scale up and down? (Remember scalability isn’t just about being able to scale up – it is about being able to scale in either direction – start small and go big, start big and go small, or start big, grow then shrink again. Or vice versa.
The Elastic Compute Cloud (EC2) enables you to create a machine image that you can turn on or off whenever you need it. This can be done via an API or via a management console. So a program can turn on a computer if it needs it.
EC2 also allows you to define parameters on a computer so that it effectively clones itself or kills off clones based on the demand. For example if a certain number of requests threshold is hit or the time taken to respond to requests, then a new machine instance can start up automatically. And once it is no longer needed, it can just shut down again.
Imagine a business that provides sports statistics to a subscriber base of sporting tragics. Normally the demand requires a four servers to meet the demand the company’s subscriber base. But once every four years, come Olympics, everybody wants to be a sporting expert and so perhaps 400 servers are required for a period of about six weeks. Amazon’s Auto Scaling feature allows for this.
This is just the simple stuff – there are sophisiticated models provided including Elastic Load Balancing, Elastic Map Reduce, Simple Queuing Services and Simple Notification Services. Let’s take a look at these:
- Elastic Load Balancing allows you to automatically distribute incoming traffic across multiple EC2 instances. It automatically detects failed machines and bypasses them so that no requests are being sent to oblivion. The load balancer can handle things such as automatically assuring specific user sessions remain on one instance.
- Elastic Map Reduce allows for large compute assignments to be split up into small units so the work can be shared across multiple EC2 instances, offering potentially massive parallelism. If there is a task to handle huge amounts of data for analysis, simulation or artificial intelligence, Elastic Map Reduce can manage the splitting and the pulling back together of the project components. Any failed tasks are rerun, failed instances are automatically shut down.
- Simple Queue Service (SQS) provides a means for hosting messages travelling between computers. This is an important part of any significant scalable architecture. Messages are posted by one computer without thought for, or knowledge of, what machine is going to pick up the information and process it. Once received, the message is locked to prevent any other computer from trying to read the message, so it is guaranteed that only one computer will process it. Messages can remain in an unread state between machines for up to 14 days. Queues can be shared or kept private and they can be restricted by IP or time. This means that the systems can designed with separate components where each component is loosely coupled from each other component – even different vendors can be responsible for each component. Different systems can work together in a safe and reliable way.
- Simple Notification Service (SNS). Whereas SQS is asynchronous, ie each post to the queue can be done without thought about when it will be picked up by the other other end, SNS is designed to be handled at the other end immediately. Messages are pushed out to be handled using HTTP requests, emails or other protocols, which means it can be used to build instantaneous feedback communities using a range of different architectures. So long as they support standard web integration architectures, the systems will talk to each other. SNS can be used to handle automatic notifications by SMS or Email or API calls when some event has taken place, or if written correctly, when some expected event has not taken place.
I will endeavour to provide example applications over time for each of these scenarios, but for now I hope this has provided a sense of understanding of how highly scalable systems can be built using Amazon’s Web Services.
Amazon Web Services is looking the goods
I have to admit I am a fan of the work that Amazon has done putting together what is now a compelling collection of infrastructure services. Together these facilities provide a fantastic vehicle for hosting highly scalable and reliable systems. And the pricing model where you pay for what you use, with prices constantly being reviewed is very enticing. Elasticity is taken to an entirely new level – machines can be purchased by the hour for as little as eleven cents. Some of the services charge in micro cents – more on that later. This is the first of three posts examining Amazon Web Services. this post introduces the key concepts, the next post will talk about some of the scaling techniques provided, and the final post will focus on how Altium is currently leveraging Amazon offerings.
The key services Amazon offer include storage, compute power, and a range of auxiliary services designed to enhance these. Here is a very brief overview:
- Amazon S3 (Simple Structured Storage) provides storage facilities. You can store files of all kinds including video, audio, software and anything else. You pay for the storage and the bandwidth to access them. Costs are very low. Altium uses S3 to store many things including training videos and software builds. When Altium releases a new build, tens of thousands of customers need to be able to get the 1.8GB file very quickly and this works well for that. Storage reliability comes in two levels – the highest provides a 99.999999999% (11 nines) probability of not being lost.
- Amazon Cloudfront provides a perimeter caching facility – files stored in S3 are distributed to local nodes around the world so that people can get access to the files quickly. There is an option to provide files as delivered via streaming.
- Amazon EC2 (Elastic Compute Cloud) provides access to virtual computers you can buy by the hour. The computers come in a number of hardware configurations ranging from low-end single processor machines through to big boxes with lots of processors and 68GB RAM. Machines come in Linux and Windows flavors. You can also get the machines preconfigured with certain hardware packages, and the price of those (if they are chargeable) is built into the rental. For example you can get a machine MS SQL Server in different flavours ranging from free to the Enterprise level. Machines can be imaged so that you can take them off line or replicate them very quickly. EC2 instances can be tied to Elastic Block Storage, effectively S3 storage.
- Amazon RDS is a special case of EC2 that comes with an embedded MySQL database and a range of value added facilities, including automatic backups and the ability to achieve failover replication into an alternative hardware partition in case the main server goes down.
- Amazon SimpleDB is a lightning fast schema-less string-based database consisting of items with many named attribute value pairs. You can, for instance, store a customer record with values for Name, Address, Phone etc, but you can store anything you like. If you want to store Favorite Color for one customer, you can. You can even store multiple values for the one attribute.
There are a range of other facilities, but I will discuss these in a later post where I talk about facilities to help you achieve elastic scaling (up and down).