Preparing for the Big Data Revolution
It is no accident that we have recently seen a surge in the amount of interest in big data. Businesses are faced with unprecedented opportunities to understand their customers, achieve efficiencies and predict future trends thanks to the convergence of a number of technologies.
Businesses need to take every opportunity to store everything they can. Lost data represents lost opportunities to understand customer behaviour and interests, drivers for efficiency and industry trends.
A perfect storm
Data storage costs have fallen dramatically. For instance, in 1956 IBM released the first hard disk drive, the RAMAC 305. It allowed the user to store five megabytes of data at a cost of $50,000 – that’s around $435,000 in today’s dollars. In comparison, a four-terabyte drive today can fit in your hand and costs around $180. If you were to build the four-terabyte drive using 1956 technology, it would cost $350 billion and would take up a floor area of 1600 km2 – 2.5 times the area of Singapore. Also, 10-megabyte personal hard drives were advertised circa 1981 for $3398 – that’s $11,000 today, or $4.4 billion for four terabytes.
Gordon Moore’s prediction in 1965 that processing capacity doubles approximately every two years has proved astoundingly accurate. Yet the amount of data we can generate has far outstripped even this exponential growth rate. Data capture has evolved from requiring specialised engineers, then specialised clerical staff, to the point where the interactive web allowed people to capture their own data. While this was a revolutionary step forward in the amount of data we had at our disposal, it pales before the most recent step: the ‘Internet of Things’, which has opened the door for machines to automatically capture huge amounts of data, resulting in a veritable explosion of data, way outstripping Moore’s Law. The result: the data load became too much for our computers, so we simply threw a lot away or stopped looking for new data to store.
With the price of storage decreasing sharply, the economies of storage have meant we can afford to capture more data: it has become increasingly important to find new ways to process all the data being stored at the petabyte scale. A number of technologies have emerged to do this.
Pets versus cattle
Traditionally computer servers were all-important – they were treated like pets. Each server was named and maintained with great attention to ensure that everything was performing as expected. After all, when a server failed, bad things would happen. Under the new model, servers are more like cattle: they are expendable, easily replaced. Parallel processing technologies have superseded monolithic approaches and allow us to take advantage of using many low-cost machines rather than increasingly more powerful central servers.
Hadoop is one project that has emerged to handle very large data sets using the cattle approach. Hadoop uses a ‘divide and conquer’ approach, which enables extremely large workloads to be distributed across multiple computers, with the results brought back together for aggregation once each intermediate step has been performed. To illustrate Hadoop: imagine having a deck of cards and someone asks you to locate the Jack of Diamonds. Under a traditional approach you have to search through the cards until you locate the card. With Hadoop, you can effectively give one card each to 52 people, or four cards each to 13 people, and ask who has the Jack of Diamonds. Much faster and much simpler when complex processes can be broken into manageable steps.
NoSQL, which was intended to mean “not only SQL”, is a collection of database technologies designed to handle large volumes of data – typically with less structure required than in a typical relational database like SQL Server or MySQL. Databases like this are designed to scale out to multiple machines, whereas traditional relational databases are more suited to scaling up on single bigger servers. NoSQL databases can handle semi-structured data; for example, if you need to capture multiple values of one type or obscure values for one person. In a traditional database, the structure of the database is typically more rigid. NoSQL databases are great for handling large workloads but they are typically not designed to handle atomic transactions: relational SQL databases are better designed for workloads where you have to guarantee that all changes are made to the database at the same time, or no changes are made.
Network science studies the way relationships between nodes develop and behave in complex networks. Network concepts apply in many scenarios; examples include computer networks, telecommunications networks, airports or social networks. Given a randomly growing network, some nodes emerge as the most significant and, like gravity, continue to attract additional connections from new nodes. For example, some airports develop into significant hubs while others are left behind. As an airport grows, with more connections and flights, there are increasingly compelling reasons why new airlines will decide to fly to that airport. Likewise, in social networks, some people are far more influential either due to the number of associations they develop or because of the effectiveness of their communication skills or powers of persuasion.
Big data can help us to identify the important nodes in any contextual network. Games console companies have identified the most popular children in the playground and given them a free console on the basis that they will have a lot of influence over their friends. Epidemiologists can identify significant factors in the spread of diseases by looking at the significant nodes and then take steps to prevent further contamination or plan for contagion. Similarly, marketers can use the same approaches to figure out what is more likely to ‘go viral’.
Big data assists businesses to gain a better understanding of customers, treating each customer as an individual – the so-called marketing segment of one. Understanding what moves customers can build strong brand loyalty and evoke an emotional response that can be very powerful. Imagine an airline that recognises that a particular passenger travels from A to B every Monday to Thursday. However, if that passenger plans to stay in B for two weeks, imagine how much loyalty could be generated by offering them a free flight over the weekend to C, a discounted flight for their spouse from A to C, and a discounted hire car and room for the weekend away together.
Digital body language and buying habits can lead online retailers to be able to make astute decisions about what product to offer customers. Target was able to identify pregnant customers very early by their shopping patterns: customers buying certain combinations of cosmetics, magazines, clothes would go on to buy certain maternity products months later.
Big data can be used to drive efficiencies in a business. The freight company UPS, for example, was able to save almost 32 million litres of fuel and shave 147 million km off the distance its trucks travelled in 2011 by placing sensors throughout the trucks. As a side benefit, they learned that the short battery life of their trucks was due to the drivers leaving the headlights on.
By analysing customer relationships, T-Mobile was able to mitigate the risk of a domino effect when one customer decided to leave its service. It did this by identifying the customers who were most closely related digitally to the person churning and making a very attractive offer to those people, preventing the churn from spreading. Further, by analysing people’s billing, call dropout rates and public comments, they were able to act in advance to reduce churn by 50% in a quarter.
CERN conducts physics experiments at the Large Hadron Collider involving sending 3.5 trillion electron volts in each direction around an underground ring, resulting in particle collisions that provide an understanding of the basic building blocks of matter. The Higgs-Boson was proven by analysing the data that was generated in smashing the particles together. 15,000 servers are used to analyse the one petabyte of data that is generated per second and 20 gigabytes is actually stored. This is orchestrated using cloud techniques built on OpenStack and designed and supported by Rackspace.
We have reached a point where it is now better to start storing everything today so that we have a business case for analytical tools tomorrow. Once we start getting used to the idea that everything is available to us, we will find new ways to think about how we leverage our information. The businesses that succeed in the future will be those that constantly look for ways to mine the information they have gleaned.
[This article has been slightly modified from an article I wrote that was previously published in Technology Decisions magazine.]