Apache Spark offers a solid foundation for machine learning. There are other tools and packages to help you dive into deep learning, but Spark offers a consistent approach to data access and, therefore, makes machine learning on Spark easier as you need less plumbing.
Let's take a quick trip through the ages, borrowing imagery from the fictional worls of Dungeons and Dragons, Star Trek, and the Terminator series to tell the story of the early days of clusters and Hadoop to the future of machine learning. However, so that we don't get stuck in a philosophical world, we also will write a machine learning application using Java.
Sparking with Apache Spark
We'll begin with a tale to help put the evolution of machine learning in context. Once upon a time, there were mages who wanted to process a lot of data. They thought traditional databases would not be appropriate. No, the conventional database management systems (those famous RDBMSs) were not suitable—they did not understand the files and required structured data. Even the more advanced databases such as Informix with its support for objects and virtual tables did not help our wizards.
The mages of the Yahoo!-land needed something more powerful, but the solution had to work on conventional servers, not octo-CPU hydras, which were so rare and costly. The mages needed a lot of potions, not powerful potions. This is what these magicians with amazing powers called “scale-out."
Eventually, the magi could store and process impressive quantities of grimoires, books, recipes, diagrams, and other witchcraft using the wizardry called ”Hadoop,” whose name was based on an ancient children artifact (but we are going too deeply in the legend). After Hadoop appeard in 2006, it enabled the sorting of 1.8TB of data on 188 nodes in less than 2 days.
But that was not enough. Unfortunately, the diabolical features of Hadoop are limited. Hadoop is basically confined to its file system, HDFS, and a processing method, MapReduce.
Finally, the great scholars from the University of Berkeley had the idea to leverage memory usage and open its gates to other algorithms. Spark was born.
Make the Machines Work
Since the dawn of time, man has tried to make men or machines work for him. This created greed, jealousy, slavery, wars. Ultimately, the encounter with the Vulcans made possible the creation of the United Federation of Planets, which now ensure peace in the galaxy (well, there are always one or two Klingons to enliven our daily routine).
But, prior to achieving this, we had to go through several stages of artificial intelligence. The first steps were based on programming languages like Prolog or Lisp. Obviously, this dates to way before the last World War of 2053. At the beginning of the 21st century, processing capacities continued to grow per Moore's law, but the growth of data became exponential with the arrival of the first smartphones (the ancestor of our communicators) and the Internet that now allows teleportation, etc.
So, on one hand, we have a linear growth of the computing power and on the other hand, an exponential growth of the data. At that time, the temporary solution was to create computing clusters (a fancy name for groups of servers). Spark supported this technology by offering a unification of the APIs (application programming interfaces). Unifying this level of access allowed for many advancements—even in deep learning. Queries could be done in SQL (Structured Query Language)—well, a version of SQL specific to Spark, the management of graphs via GraphX and the flow computing via Spark Streaming. By early 2017, the Spark layers look like this:
Spark was written in the Scala programming language. Its APIs are available in Java, Python, Scala, and, with version 2.1.0, R—one must love all this work for data scientists.
Machine, Work! Earn your Living!
It is through deep learning that we could create the intelligence that powered the first androids. Unfortunately, they became the T-800s that threaten us today.