Data Generation is Growing – Start Storing Everything
The world is generating more data than ever before.
In 2013 IBM reported that 90% of the world’s data had been generated in the last two years, and this trend is continuing. So what is causing this explosion of data?
Several major factors have contributed. Chief among these is the changing way we interact with computers. The first generation of data capture involved rigorously-prepared data being hard-wired physically into the machine by the engineers. The second generation involved professionally trained computer operators who would feed data into the machines at our request. Software was designed to enforce restraints so the machine knew how to process the data. The third generation saw everyone getting access to enter their own data – Web 2.0, the interactive Web: people were given the freedom to capture whatever they wanted to capture, and the amount of data captured exploded. Now we are on a fourth generation of data capture – the Internet of Things, Machine to Machine communication. Gartner has projected that there will be 26Billion connected devices by 2020, and the number of sensors will be measured in the trillions.
The price of storing all the data we are generating has fallen dramatically. The first hard disk drive, IBM’s RAMAC 305 was launched in 1956 at a cost of $50,000, or $435,000 in today’s dollars – the drive stored 5Megabytes. To put that in context, it would cost $350Billion today to store 4 Terabytes of data using that technology, and it would take up a floor area of 2.5 times the area of Singapore. Of course, today, a 4Terabyte drive can be purchased for little more than a hundred dollars and can fit in the palm of a hand.
Since storage costs have been largely resolved, the biggest challenge remained around data processing and analysis. There are two issues here: firstly traditional database systems are designed to run on one machine. Larger databases implies scaling up to bigger machines, but the amount of data now available has outstripped our capacity to process it using traditional methods on one computer – the computers are just not powerful enough. Secondly data has become more structurally complex, and traditional database designs, which rely on the structure of the data being predefined during the design phase, no longer cope with the flexibility required when people and systems can evolve to use data in unpredictable ways.
Until the rise of next-generation database systems like MongoDB, these limitations resulted in a lot of data being thrown away: what’s the point of storing stuff if you cannot make sense of it? But MongoDB has helped change all that. MongoDB is inherently designed to work across many computers thus enabling it to handle vastly larger amounts of data. Furthermore, the structure of the data does not have to be defined in advance – MongoDB allows for the storage of anything, and yet patterns and sense can be gleaned regardless of the structure.
For example, in a database of customers, we may have access to store information about each customer that is highly specific to them as individuals – their pets, their hobbies and special interests, places they have visited, books they have read, companies where they have worked. We may not know what information we can glean, but with MongoDB we can store it today and make sense of it in the future.
Given tools now exist that can handle less structured and very large datasets, and storage costs have so dramatically reduced, it stands to reason to start storing everything. Businesses of the past were valued on their brand awareness; businesses of the future will be increasingly valued on how well they can make use of the data at their disposal to understand each customer, improve their efficiency and responsiveness, and make the best decisions.
Storing data today for use tomorrow makes a great deal of business sense. Once data has been collected, questions about how to use the data will naturally ensue. Without the data, people won’t even think about questions they could be asking, things like:
- What would be the optimum price for this product?
- How soon should we follow up a customer after they have purchased item X?
- What products act as good loss leaders?
- What are the signs that indicate a customer will churn?
- When a customer churns, who are the people most at risk of following them?
- What impact does the weather have on purchasing patterns?
MongoDB makes the perfect solution to store data. It has the flexibility to cope with unstructured and structured data, can scale to many petabytes with full replication, and has a great deal of support for analytics. Even if there is no current interest in analysing data, a business case will be much easier to make for analytical tools in the future if there is a large reservoir of data to use, rather than just an idea to grow from scratch.
The businesses who win in the future will be those that know how to harvest all the data available to them. The sooner they start storing and practising how to glean the most from their data, the sooner they can learn to pre-empt the customers’ needs, pre-empt the marketplace, learn to cope with the veritable deluge in a highly responsive manner and leave the less-informed competition behind.
Preparing for the Big Data Revolution
It is no accident that we have recently seen a surge in the amount of interest in big data. Businesses are faced with unprecedented opportunities to understand their customers, achieve efficiencies and predict future trends thanks to the convergence of a number of technologies.
Businesses need to take every opportunity to store everything they can. Lost data represents lost opportunities to understand customer behaviour and interests, drivers for efficiency and industry trends.
A perfect storm
Data storage costs have fallen dramatically. For instance, in 1956 IBM released the first hard disk drive, the RAMAC 305. It allowed the user to store five megabytes of data at a cost of $50,000 – that’s around $435,000 in today’s dollars. In comparison, a four-terabyte drive today can fit in your hand and costs around $180. If you were to build the four-terabyte drive using 1956 technology, it would cost $350 billion and would take up a floor area of 1600 km2 – 2.5 times the area of Singapore. Also, 10-megabyte personal hard drives were advertised circa 1981 for $3398 – that’s $11,000 today, or $4.4 billion for four terabytes.
Gordon Moore’s prediction in 1965 that processing capacity doubles approximately every two years has proved astoundingly accurate. Yet the amount of data we can generate has far outstripped even this exponential growth rate. Data capture has evolved from requiring specialised engineers, then specialised clerical staff, to the point where the interactive web allowed people to capture their own data. While this was a revolutionary step forward in the amount of data we had at our disposal, it pales before the most recent step: the ‘Internet of Things’, which has opened the door for machines to automatically capture huge amounts of data, resulting in a veritable explosion of data, way outstripping Moore’s Law. The result: the data load became too much for our computers, so we simply threw a lot away or stopped looking for new data to store.
With the price of storage decreasing sharply, the economies of storage have meant we can afford to capture more data: it has become increasingly important to find new ways to process all the data being stored at the petabyte scale. A number of technologies have emerged to do this.
Pets versus cattle
Traditionally computer servers were all-important – they were treated like pets. Each server was named and maintained with great attention to ensure that everything was performing as expected. After all, when a server failed, bad things would happen. Under the new model, servers are more like cattle: they are expendable, easily replaced. Parallel processing technologies have superseded monolithic approaches and allow us to take advantage of using many low-cost machines rather than increasingly more powerful central servers.
Hadoop is one project that has emerged to handle very large data sets using the cattle approach. Hadoop uses a ‘divide and conquer’ approach, which enables extremely large workloads to be distributed across multiple computers, with the results brought back together for aggregation once each intermediate step has been performed. To illustrate Hadoop: imagine having a deck of cards and someone asks you to locate the Jack of Diamonds. Under a traditional approach you have to search through the cards until you locate the card. With Hadoop, you can effectively give one card each to 52 people, or four cards each to 13 people, and ask who has the Jack of Diamonds. Much faster and much simpler when complex processes can be broken into manageable steps.
NoSQL, which was intended to mean “not only SQL”, is a collection of database technologies designed to handle large volumes of data – typically with less structure required than in a typical relational database like SQL Server or MySQL. Databases like this are designed to scale out to multiple machines, whereas traditional relational databases are more suited to scaling up on single bigger servers. NoSQL databases can handle semi-structured data; for example, if you need to capture multiple values of one type or obscure values for one person. In a traditional database, the structure of the database is typically more rigid. NoSQL databases are great for handling large workloads but they are typically not designed to handle atomic transactions: relational SQL databases are better designed for workloads where you have to guarantee that all changes are made to the database at the same time, or no changes are made.
Network science
Network science studies the way relationships between nodes develop and behave in complex networks. Network concepts apply in many scenarios; examples include computer networks, telecommunications networks, airports or social networks. Given a randomly growing network, some nodes emerge as the most significant and, like gravity, continue to attract additional connections from new nodes. For example, some airports develop into significant hubs while others are left behind. As an airport grows, with more connections and flights, there are increasingly compelling reasons why new airlines will decide to fly to that airport. Likewise, in social networks, some people are far more influential either due to the number of associations they develop or because of the effectiveness of their communication skills or powers of persuasion.
Big data can help us to identify the important nodes in any contextual network. Games console companies have identified the most popular children in the playground and given them a free console on the basis that they will have a lot of influence over their friends. Epidemiologists can identify significant factors in the spread of diseases by looking at the significant nodes and then take steps to prevent further contamination or plan for contagion. Similarly, marketers can use the same approaches to figure out what is more likely to ‘go viral’.
Benefits
Big data assists businesses to gain a better understanding of customers, treating each customer as an individual – the so-called marketing segment of one. Understanding what moves customers can build strong brand loyalty and evoke an emotional response that can be very powerful. Imagine an airline that recognises that a particular passenger travels from A to B every Monday to Thursday. However, if that passenger plans to stay in B for two weeks, imagine how much loyalty could be generated by offering them a free flight over the weekend to C, a discounted flight for their spouse from A to C, and a discounted hire car and room for the weekend away together.
Digital body language and buying habits can lead online retailers to be able to make astute decisions about what product to offer customers. Target was able to identify pregnant customers very early by their shopping patterns: customers buying certain combinations of cosmetics, magazines, clothes would go on to buy certain maternity products months later.
Big data can be used to drive efficiencies in a business. The freight company UPS, for example, was able to save almost 32 million litres of fuel and shave 147 million km off the distance its trucks travelled in 2011 by placing sensors throughout the trucks. As a side benefit, they learned that the short battery life of their trucks was due to the drivers leaving the headlights on.
By analysing customer relationships, T-Mobile was able to mitigate the risk of a domino effect when one customer decided to leave its service. It did this by identifying the customers who were most closely related digitally to the person churning and making a very attractive offer to those people, preventing the churn from spreading. Further, by analysing people’s billing, call dropout rates and public comments, they were able to act in advance to reduce churn by 50% in a quarter.
CERN conducts physics experiments at the Large Hadron Collider involving sending 3.5 trillion electron volts in each direction around an underground ring, resulting in particle collisions that provide an understanding of the basic building blocks of matter. The Higgs-Boson was proven by analysing the data that was generated in smashing the particles together. 15,000 servers are used to analyse the one petabyte of data that is generated per second and 20 gigabytes is actually stored. This is orchestrated using cloud techniques built on OpenStack and designed and supported by Rackspace.
Conclusion
We have reached a point where it is now better to start storing everything today so that we have a business case for analytical tools tomorrow. Once we start getting used to the idea that everything is available to us, we will find new ways to think about how we leverage our information. The businesses that succeed in the future will be those that constantly look for ways to mine the information they have gleaned.
[This article has been slightly modified from an article I wrote that was previously published in Technology Decisions magazine.]