Skip to content

Posts from the ‘Amazon’ Category


Theoretical Disaster Recovery doesn’t cut it.

I have mixed feelings about Amazon’s latest outage, which was caused by a cut in power. The outage was reported quickly and transparently. The information provided after the fault showed a beautifully designed system that would deal with any power loss inevitability.

In theory.

After reviewing the information provided I am left a little bewildered, wondering how such a beautifully designed system wasn’t put to the ultimate test? I mean, how hard can it be to rig a real production test that cuts the main power supply?

If you believe in your systems, and you must believe in your systems when you are providing Infrastructure As A Service, you should be prepared to run a real live test that tests every aspect of the stack. In the case of a power failure test, anything short of actually cutting the power in multiple stages that tests each line of defense is not a real test.

The lesson applies to all IT, indeed to all aspects of business really – that’s what market research is for. But back to IT. If a business isn’t doing real failover and disaster recovery testing that goes beyond ticking the boxes to actually carrying out conceivable scenarios, who are they trying to kid?

Many years ago I had set up a Novell network for a small business client and implemented a backup regime. One drive, let’s say E: had programs and the other, F:,  carried data. The system took a back up of F: drive every day and ignored the E drive. After all, there was no need to back up the programs and disk space was expensive at the time.

After a year I arranged to go to the site and do a back up audit and discovered that the person in charge of IT had swapped the drive letter around because he thought it made more sense. We had a year of backups of the program directories, and no data backups at all.

Here is the text from Amazon’s outage report:

At approximately 8:44PM PDT, there was a cable fault in the high voltage Utility power distribution system. Two Utility substations that feed the impacted Availability Zone went offline, causing the entire Availability Zone to fail over to generator power. All EC2 instances and EBS volumes successfully transferred to back-up generator power. At 8:53PM PDT, one of the generators overheated and powered off because of a defective cooling fan. At this point, the EC2 instances and EBS volumes supported by this generator failed over to their secondary back-up power (which is provided by a completely separate power distribution circuit complete with additional generator capacity). Unfortunately, one of the breakers on this particular back-up power distribution circuit was incorrectly configured to open at too low a power threshold and opened when the load transferred to this circuit. After this circuit breaker opened at 8:57PM PDT, the affected instances and volumes were left without primary, back-up, or secondary back-up power. Those customers with affected instances or volumes that were running in multi-Availability Zone configurations avoided meaningful disruption to their applications; however, those affected who were only running in this Availability Zone, had to wait until the power was restored to be fully functional.

Nice system in theory. I love what Amazon is doing, and I am impressed with how they handle these situations.

They say that what doesn’t kill you makes you stronger – here’s hoping we all learn something from this.


Transitioning to the Cloud

Today I am presenting for you the same talk I gave at the CeBIT Cloud 2012 conference in Sydney. Entitled, “Transitioning to the Cloud”, the presentation covers three areas:

  1. If you want to transition your business to the cloud you need to Think Cloud – Cloud, as I see it, is as much a state of mind and you need to embrace this thinking to really make full use of its potential;
  2. Some examples from my personal experience of using some of the large Cloud providers’ offerings, and why they are more than what they superficially appear to be; and
  3. Some tips on adopting Cloud (previously covered in an earlier post)
One of the key messages is that when 19th Century industrialists were offered utility power from a central provider for the first time, all they could think of was that their lathes would keep turning – they didn’t see electricity as a solution to all sorts of unimagined potentialities (such as lighting). The parallel is to ask the question, what are we missing out on today, what hidden potentialities exist in Cloud that we haven’t yet figured out?


My Top 7 Tips for Going to the Cloud

A lot of people ask me for my advice on what are the most important things to consider when moving the business into the Cloud. So here are some of the things that I think business people need to consider when thinking about going to the Cloud:

1. Make sure you know how to get your data out again

Often people think about how they are going to put their data into the Cloud – if they are using Software as a Service, like Salesforce or Netsuite or Intacct or Clarizen, or Google Apps for that matter, they will be thinking about how to get their data into a shape that can go into the system. The documentation for these systems make clear reference to how to prepare and then import the customer’s data, and there are usually consultants who can assist with this process. Typically this process is well planned, but often little thought is given to how exactly you go about extracting the data out again in a way that is of value to you going forward. Often, lip service is paid to the issue by asking questions like “can I get a back up of my data?”, and a reassuring yes is provided to the now comforted prospective customer. It is one thing to be told it can be done, but you need to check that the data is actually in a format that is useful to you. And if the system is mission critical, it needs to not just be useful, it needs to be readily convertible for immediate use.

Some of the things I have done to ensure that my data is safe include writing programs that automatically read the data for updates every fifteen minutes and write them into a Relational Database hosted separately, and even replicated both in house and in the Cloud. All customisations are programmatically managed so that the relational database copy always reflects the structure in the live system. For example, I did this from Salesforce, where there were more than 300 custom objects created. Another example is to write a program that knows how to extract all the data from a system, such as an accounting system, using the API provided. Not until you have actually proven tangibly that you can get your data into a format you can actually use, it is meaningless to have access to a copy of it.

Even without programming, many systems provide some access to your data in a way you can extract it. For example Salesforce provide a once-per-week csv file you can download. If you don’t have an alternative means it is worth setting up a routine with someone responsible to take this data and copy it.

On line databases such as Amazon RDS or Simple DB can be accesed easily enough through OLEDB connections or similar, or copies of the backups can be stored locally in a format that can be opened by alternative data stores.

No matter how you do it, the principle is important here: you should have a fully tested means of accessing your data off line. The more mission critical the data, the more real-time the recoverability needs to be.

2. Think Differently

Steve Jobs’ passing reminded everyone of the Apple Think Different campaign, but seriously, you need to think Differently when it comes to the Cloud in order to leverage it successfully. It truly is different to anything we have seen, and if you are only seeing it as a cost mitigator or a means of outsourcing infrastructure, you are missing a lot of (pardon the pun) blue sky behind the Cloud. Social networking, crowdsourcing, ubiquity of device and location, Metcalfe’s law in general, scalability, the ability to fail fast and loosely coupled web services are all factors of the Cloud that lend itself to being different.

One example is the way that Salesforce enables you to leverage the power of Twitter and Facebook by recording people’s Twitter and Facebook details against their record and if they tweet or post something with a given hashtag, the system is watching and can automatically create a case for them, assign it to a support officer who can find a solution, link the solution and automatically have the system tweet them with a response and a link.

Another example is the way captchas are being used to get the masses to perform optical character recognition on historical documents that are too poor for a machine to read. The system uses a known control word to determine whether you are human or not and poses a second one that is not known. The results are compared against the results entered by others who have received the same word – a high correlation between results from different users indicates what the text is likely to be.

A third example comes from my own testing of the Amazon EC2 platform to test some ideas concerning a new database design that enabled end users to change the structure of the database without programming, kind of like the way Salesforce allows end users to do custom objects. The test was in two parts – the first, which was easy to test, was could it handle more than a billion records. The second, a little more difficult, was, can it handle one thousand simultaneous users on cheap virtual hardware. For this test I needed a simulation that ran across eleven machines. Traditionally I would need to acquire these eleven machines and set them up – an expensive and time consuming exercise. Using Amazon EC2, I was able to set up the machines from scratch in thirty minutes, run my tests in three hours, and then analyse the results. Total cost? Less than five dollars.

There are plenty of ways the Cloud can transform how you do business if you allow it. Get your sales team to focus on harder sells while the Cloud is engineered around a Marketing Automation experience that drives their behaviour for all the low hanging fruit. The Cloud itself, if you configure it correctly, will tell you where the low hanging fruit are.

3. Make sure your systems interactions are atomic

One of the issues with having Cloud-based systems is that you can build compelling processes out of tools from a number of vendors’ systems working together. Linking your CRM to your financials, or your website to marketing automation and analytics for example. While these may seem obvious examples, the point being made here is that we need to ensure when multiple systems are involved that we are thinking about how to prevent a situation where only part of a system succeeds. This is a much more common problem when different types of systems are talking together. So make sure you are not telling the customer that his request for information has been placed in a queue unless you know for sure that the request has been placed in a queue.

4. Start with Upside, not Downside

When I first started looking at Cloud concepts about six years ago I was looking with the eyes of a sceptic and I was asking the question “What can’t I do if I adopt this approach?” By taking this kind of view I found there were plenty of things I didn’t think I could do, and this thinking led me to see restrictions and obstacles. Once I started to ask myself rather contrary question “What can I do if I adopt this approach?”, I started to see all sorts of opportunities emerge. I understand from Salesforce I was possibly the first person in the world to see their CRM product as a business platform rather than a CRM product. This led to building all sorts of systems within Salesforce including purchase requisitioning, customer software licensing, electronic production management systems with automated QA built in and tested on the finished manufactured products (with the results of the tests stored against each product and displayed to the end user when he or she finally purchased the product and plugged it into a computer). Other systems included Human Resources systems with annual leave management systems, individual development plans and hierarchical cost management for each line manager, who could also see things like who had the most leave accrued in the team.

Thinking of what is possible also leads to being able to try things experimentally with a “fail-fast” attitude. The example provided above about the eleven computers is an example of this. But being able to put ideas into practice quickly makes all sorts of innovative approaches viable that may be otherwise ignored or side stepped as pipe dreams.

In traditional approaches, a startup may need to think of architecting a business for the first generation of clients. As the numbers grow, a different architecture may be required, or investment may be required in infrastructure just in case growth may occur. One of the risks of any business that grows too quickly is one of running out of liquid cash. All this can be very limiting in an entrepreneurs thinking, with a real chance that the fear of succeeding too quickly may cause them to underperform. Often the Cloud allows an architecture to scale far further than using traditional approaches, with the ability to consume infrastructure and related services as required, scaling rapidly up, and then if necessary, scaling rapidly back down again. Traditional models require risky investments, Cloud models are far more flexible. And this allows for more optimistic thinking.

5. Check what API options are available

Most mainstream cloud vendors, whether they be offering Software as a Service, Infrastructure as a Service or a Platform as a Service, will have some sort of API that enables you to read and write data, change metadata, set permissions etc. This is important if you want to truly leverage the power that is available to you. For example, you can use Amazon’s Simple Notification Service and Simple Queueing Service to provide asynchronous connections between systems and plan to notify managers when a VIP customer representative has mentioned your company in a tweet. Having a rich API in your bag of tricks enables you to innovate with freedom, seeing the Cloud as one Cloud rather than a disparate products offered by a host of different people.

6. Seek to understand the inner workings of the vendors various risk mitigation strategies

This is something I was guilty of in the early days. I used to say “these guys know better so you can trust them to make sure your data is safe”. Recent events have made me a little more open eyed about the inner workings. If you are not sure how your data is being backed up, ask. Imagine you are having to satisfy your auditor about the safety of your data. Imagine you are having to satisfy your customer that their data is safe, secure and reliably stored. If you don’t know yourself what steps are being taken to guarantee the preservation of the data, you won’t be able to tell them, and you will come across poorly.

I have written an earlier post about an Australian ISP that collapsed after an attack that took out the server with all of their clients’ websites. They had no offsite backup. Recently, Salesforce, one of the most respected companies had two outages on Sandboxes that caused the loss of the customer data on those sandboxes and the data was down for several days. Amazon had a well publicised outage earlier in the year that brought into question the way their system handled mass failure. Separate zones, designed to remain up when others failed, went down simply due to the overload caused by the failure of one. These failures, or at least the Salesforce and Amazon ones cited, have resulted in those companies making some changes, but an astute customer robustly challenging the methods may well have picked them up before a major problem occurred.

7. Remember, it’s your data, and the buck still stops with you

I wrote a post at the time of the major Amazon outage that was picked up by the CIO Magazine. Several companies hosting their data on Amazon Web Services were posting during the outage as if they were innocent bystanders observing the fallout. The reality is that if your services are down it is your responsibility no matter how you host them. Imagine an airline losing an aircraft saying “oops, luckily we outsourced the maintenance on that plane or else it would have looked really bad for us LOL!”. I don’t think so.

Remember, it is your data and you are entitled to it, and your are responsible for its availability and its security.


CIOs Moving to the Cloud: The buck still stops with you

Amazon Web Services has been going through a much publicised outage, which has lasted by all appearances more than 12 hours. A range of services including Hootsuite, Reddit, Heroku, Foursquare, Quora and others have all faced major disruptions.

What is interesting is how they have positioned these outages: many have said EC2 is great, but they are having a bit of a problem at the moment. It appears these providers are taking the view that “whew, glad we outsourced our stuff so that it is clear that is not OUR fault that something like this has happened, and we can point to other vendors to prove the case that it wasn’t us – just imagine if we had have done this on our own servers and this happened, we would have been much more at fault!”


Just because systems are moved to the cloud doesn’t mitigate the responsibility to ensure mission critical outages are mitigated. If a business has a use-case that cannot tolerate down time then that business needs to architect their solution in a way that prevents downtime. Cost tradeoffs are always an issue, but if something goes wrong, and the cost of that problem is too high, then perhaps the service isn’t really feasible.

Imagine an airline providing a service where they cut costs on safety in order to offer a cheap service… Doesn’t bear thinking about. Imagine that the airline outsourced their safety inspections to a third party and then wiped their hands of responsibility in the event of a “downtime”. No-one would buy that.

The whole point about the cloud is that it enables you to free your thinking about one provider. Even if you stick with an Amazon only solution, or a Microsoft or Google or Salesforce or Rackspace or whatever solution, you still need to architect things in a way that allows you to accept the consequences of any flaws, no matter how they are caused.

After all, you are the service provider to your customer base – how you decide to deliver that is up to you.

A lot of people are learning a very hard lesson at the moment – there are good ways and bad ways of doing things. For some, a 12 hour outage is hardly a problem, but for others it can ruin lives.


Amazon Web Services Part 2 – Scaling Services Provided

Here is the second of three posts on Amazon’s Web Services. The first post provided a look at the key foundational services. The next post will talk about how Altium is leveraging Amazon’s offerings. Meanwhile, in this post I promised to write about some of the ways Amazon takes elasticity to the extreme through a range of services aimed at addressing scalability as an issue.

At its most fundamental level, Amazon’s Web Services are aimed at elasticity – this is reflected in the technology as well as the pricing. The pricing is a pay as you go model – by the hour, by the storage used and/or by the bandwidth consumed. Many of the services reflect this elasticity in their name.

So what tools does Amazon provide you to scale up and down? (Remember scalability isn’t just about being able to scale up – it is about being able to scale in either direction – start small and go big, start big and go small, or start big, grow then shrink again. Or vice versa.

The Elastic Compute Cloud (EC2)  enables you to create a machine image that you can turn on or off whenever you need it. This can be done via an API or via a management console. So a program can turn on a computer if it needs it.

EC2 also allows you to define parameters on a computer so that it effectively clones itself or kills off clones based on the demand. For example if a certain number of requests threshold is hit or the time taken to respond to requests, then a new machine instance can start up automatically. And once it is no longer needed, it can just shut down again.

Imagine a business that provides sports statistics to a subscriber base of sporting tragics. Normally the demand requires a four servers to meet the demand the company’s subscriber base. But once every four years, come Olympics, everybody wants to be a sporting expert and so perhaps 400 servers are required for a period of about six weeks. Amazon’s Auto Scaling feature allows for this.

This is just the simple stuff – there are sophisiticated models provided including Elastic Load Balancing, Elastic Map Reduce, Simple Queuing Services and Simple Notification Services. Let’s take a look at these:

  • Elastic Load Balancing allows you to automatically distribute incoming traffic across multiple EC2 instances. It automatically detects failed machines and bypasses them so that no requests are being sent to oblivion. The load balancer can handle things such as automatically assuring specific user sessions remain on one instance.
  • Elastic Map Reduce allows for large compute assignments to be split up into small units so the work can be shared across multiple EC2 instances, offering potentially massive parallelism. If there is a task to handle huge amounts of data for analysis, simulation or artificial intelligence, Elastic Map Reduce can manage the splitting and the pulling back together of the project components. Any failed tasks are rerun, failed instances are automatically shut down.
  • Simple Queue Service (SQS) provides a means for hosting messages travelling between computers. This is an important part of any significant scalable architecture. Messages are posted by one computer without thought for, or knowledge of, what machine is going to pick up the information and process it. Once received, the message is locked to prevent any other computer from trying to read the message, so it is guaranteed that only one computer will process it. Messages can remain in an unread state between machines for up to 14 days. Queues can be shared or kept private and they can be restricted by IP or time. This means that the systems can designed with separate components where each component is loosely coupled from each other component – even different vendors can be responsible for each component. Different systems can work together in a safe and reliable way.
  • Simple Notification Service (SNS). Whereas SQS is asynchronous, ie each post to the queue can be done without thought about when it will be picked up by the other other end, SNS is designed to be handled at the other end immediately. Messages are pushed out to be handled using HTTP requests, emails or other protocols, which means it can be used to build instantaneous feedback communities using a range of different architectures. So long as they support standard web integration architectures, the systems will talk to each other.  SNS can be used to handle automatic notifications by SMS or Email or API calls when some event has taken place, or if written correctly, when some expected event has not taken place.

I will endeavour to provide example applications over time for each of these scenarios, but for now I hope this has provided a sense of understanding of how highly scalable systems can be built using Amazon’s Web Services.


Amazon Web Services is looking the goods

I have to admit I am a fan of the work that Amazon has done putting together what is now a compelling collection of infrastructure services. Together these facilities provide a fantastic vehicle for hosting highly scalable and reliable systems. And the pricing model where you pay for what you use, with prices constantly being reviewed is very enticing. Elasticity is taken to an entirely new level – machines can be purchased by the hour for as little as eleven cents. Some of the services charge in micro cents – more on that later. This is the first of three posts examining Amazon Web Services. this post introduces the key concepts, the next post will talk about some of the scaling techniques provided, and the final post will focus on how Altium is currently leveraging Amazon offerings.

The key services Amazon offer include storage, compute power, and a range of auxiliary services designed to enhance these. Here is a very brief overview:

  • Amazon S3 (Simple Structured Storage) provides storage facilities. You can store files of all kinds including video, audio, software and anything else. You pay for the storage and the bandwidth to access them. Costs are very low. Altium uses S3 to store many things including training videos and software builds. When Altium releases a new build, tens of thousands of customers need to be able to get the 1.8GB file very quickly and this works well for that. Storage reliability comes in two levels – the highest provides a 99.999999999% (11 nines) probability of not being lost.
  • Amazon Cloudfront provides a perimeter caching facility – files stored in S3 are distributed to local nodes around the world so that people can get access to the files quickly. There is an option to provide files as delivered via streaming.
  • Amazon EC2 (Elastic Compute Cloud) provides access to virtual computers you can buy by the hour. The computers come in a number of hardware configurations ranging from low-end single processor machines through to big boxes with lots of processors and 68GB RAM. Machines come in Linux and Windows flavors. You can also get the machines preconfigured with certain hardware packages, and the price of those (if they are chargeable) is built into the rental. For example you can get a machine MS SQL Server in different flavours ranging from free to the Enterprise level. Machines can be imaged so that you can take them off line or replicate them very quickly. EC2 instances can be tied to Elastic Block Storage, effectively S3 storage.
  • Amazon RDS is a special case of EC2 that comes with an embedded MySQL database and a range of value added facilities, including automatic backups and the ability to achieve failover replication into an alternative hardware partition in case the main server goes down.
  • Amazon SimpleDB is a lightning fast schema-less string-based database consisting of items with many named attribute value pairs. You can, for instance, store a customer record with values for Name, Address, Phone etc, but you can store anything you like. If you want to store Favorite Color for one customer,  you can.  You can even store multiple values for the one attribute.

There are a range of other facilities, but I will discuss these in a later post where I talk about facilities to help you achieve elastic scaling (up and down).

%d bloggers like this: