Big Data Is Not All the Data

Feb 26, 2014

By Ashish Thusoo

Why are we interested in data? Data is collected, stored, and processed to remove uncertainty. Such a data pipeline generates information, reduced and contextualized data, to induce a change in a system. In its earliest form this likely would have been the observation of seasons to be passed on across generations for hunting and later farming. Modern incarnations of data pipelines, using computers, still address weather forecast as well as new problems like advertisement bidding or financial transactions.

The popular “The Hitchhiker's Guide to the Galaxy” outlined, in an entertaining fashion, the core question of big data, to find the “answer to the Ultimate Question of Life, The Universe, and Everything.” The solution in the book, without giving away too much if you haven’t read it, is to build a planet-size machine to compute the answer over millions of years.

Today, big data describes our development along this path to answer all the questions by collecting and interpreting all the data. Mankind has been collecting data and processing it since we started recording it, first in orally communicated stories and then by collecting facts with symbols, which we started to process with machines only recently in our history.

The Essence of Big Data

Processing data with computers has been around for decades, which is an eternity to modern technological development cycles. Big data therefore is sometimes likened to an old story with new marketing. However, it denotes a change that has become apparent only recently and is not easily resolved.

Before the rapid expansion of knowledge, in the age of the polymath, all the data collected, stored and processed, was small enough to be mastered by a single person. The introduction of machines like the book press accelerated and widened the data pipelines. The pipelines always worked as a funnel with the data ingress being larger than the information egress. Importantly, though, technology has scaled along the funnel in a non-uniform fashion, which exacerbated the natural difference between the start and the end of the funnel with a technical one. This widening gap is the essence of big data.

Three Technological Developments Accelerating Big Data

The technological developments accelerating big data are three-fold. The well-known Moore’s Law describes the observation of exponential growth of transistors in semiconductor processing units, which relates to the processing step of the data pipeline. Less known is Kryder’s Law describing a growth of non-volatile storage capacity, which is faster than the growth in processing power. Lastly, the collection of data is and will be growing at a speed hard to quantify today. The Internet of Things, a world in which countless artificial physical objects collect data, is gradually becoming reality. Ubiquitous connectivity, multiple electronic devices per person with numerous sensors, and the emergences of inexpensive independent electronic sensor chip combinations foretell a future of abundant data.

Big data is the consequence of this development, where we can progressively collect more data than we can store, and store more data than we can process. Technology will advance and we may address some of the challenges, however, the basic dynamics will remain. If we somehow in the future can measure the details of every atom of our planet to the limit of the uncertainty principle then where would we store it? Additionally, we would want to add time as a dimension. Imagine collecting a planetary size time series data set. How and where could it be store, and analyzed?

The answer is that not all the data is equal and not all the answers are worthwhile computing. There is a diminishing return with increasing detail of the data collected. For example, for weather forecasting the improvements achievable from collecting data for every square kilometer versus every square meter and for every day versus every second are tremendous. Conversely, the improvements from increasing it further to every square millimeter and every millisecond are probably minuscule.

Solving the Big Data Challenge Starts by Changing Our Mind-Set

The lesson for big data is that the volume of the data should not blind us. Solving the big data challenge starts by changing our mind-set and collecting only potentially valuable data and increasingly drop, aggregate, or filter data at the earliest possible point of the pipeline, i.e. at the sensor, data mining, or collection point. This narrows the storage and processing requirements dramatically, hopefully to a point where the advances in the fields, i.e., in cloud computing, can keep up with the relentless increase in data.

A side effect of big data is the consolidation of organizations around data and processing ability. This can have the form of open data when data is freely available and separate data collection and storage do not provide a lasting competitive advantage and could instead lead to a cost disadvantage. This may threaten search engine business models like Google Search, Microsoft Bing, and Yahoo practice them. They set themselves apart with technological scale and data science both of which become increasingly accessible and affordable to a wider range of organizations.

Facebook is the other example in which the data is generated and locked away within one platform and unavailable to competitors creating a lasting advantage. This is a difficult problem for competitors to address especially since data has gravity, i.e., attracts each other. It is much more interesting for data to be shared and created where it can be contextualized with a maximum network effect and return for users. The surprising stumbling block to giving the “answer to the Ultimate Question of Life, The Universe, and Everything” in the end may turn out to be data monopolies.

About the author:

Ashish Thusoo is the CEO and co-founder ofQubole, a pioneering big data startup. He is also the co-creator of Apache Hive and served as the project's founding Vice President at the Apache Software Foundation. Before Qubole, Thusoo ran the Data Infrastructure team at Facebook, leading the team in the creation of one of the largest data processing and analytics platform in the world. Thusoo has a Bachelor's degree in CS from IIT-Delhi and a Master's degree in CS from University of Wisconsin-Madison.

Image courtesy of Shutterstock