The Fundamentals of Big Data Analytics

It’s widely accepted today that the phrase “big data” implies more than just storing more data. It also means doing more with data. There are arguably too many terms that we use to describe the techniques for “doing more,” although big data analytics or data science probably come closest. We can probably refine the various techniques into three big groups:

  1. Predictive analytics, which are the class of algorithms that use data from the past to predict the future
  2. Collective intelligence, which uses the inputs from large groups to create seemingly intelligent behavior
  3. Machine learning, in which programs “learn from experience” and refine their algorithms-based on new information
  4. These are clearly intersecting techniques—collective intelligence often is predictive, while predictive and collective techniques both involve machine learning.

Predictive algorithms take many forms, but a large proportion build on fundamental mathematical concepts taught in high school. Creating a “line of best fit” between two variables involves a fairly simple computation known as linear regression. Once created, the regression formula can be used to predict the value of one variable based on the other. Regression analysis can be extended to more than two variables (multivariate regression), curves (nonlinear regression), categorical predictions (logistic regression), and adjusted to understand seasonal variation (time series analysis).

Collective intelligence is often predictive, while predictive and collective techniques both involve machine learning.

Collective intelligence sounds like a complex academic pursuit, but it’s actually something we encounter every day. When Google or another search engine corrects or predicts your searches, it is using the data collected from the billions of other peoples’ searches that came before yours.

Machine learning as a general technique includes most of the algorithms employed by predictive and collective solutions. Whenever a system can adjust its behavior based on new input data, it can be said to have learned.

A supervised machine learning algorithm is one that requires some training in order to build a model. For instance, in the case of spam classification algorithms, human beings are generally required to provide examples of spam and non-spam emails. The spam detector uses these examples—called the training set—to create algorithms that can be used to distinguish spam from non-spam. The final test of the algorithm is to provide it with some fresh data—a validation set—to see how well it does.

Unsupervised machine learning requires no training sets, and clustering algorithms fall into this category. A good example is the familiar basket analysis algorithm—if you order three of the four ingredients in a Waldorf salad from Walmart online, the missing ingredient likely will be recommended to you. This is not because Walmart is comparing your order to a recipe book, but because a clustering algorithm has noticed that these four items usually appear together.

Under the hood, there are dozens of algorithms that can be used to perform machine learning. Classification includes techniques such as logistic regression, naive Bayesian analysis, decision trees, K-nearest neighbors, and Support Vector Machines. Clustering algorithms include K-means and hierarchical clustering. Because of the very large number of complicated algorithms —and those that just sound complicated—it is hard for even the most experienced data scientist to pick the correct technique for the data at hand. For that reason, ensemble techniques often are employed to run multiple algorithms on the data and select the resulting model with the best outcomes.

Big data analytics is indeed a complex field, but if you understand the basic concepts outlined above—such as the difference between supervised and unsupervised learning—you are sure to be ahead of the person who wants to talk data science at your next cocktail party!