Deep Learning at Big Data Scale

Many people are unsure of the differences between deep learning, machine learning, and artificial intelligence. Generally speaking, and with minimal debate, it is reasonably well-accepted that artificial intelligence can most easily be categorized as that which we have not yet figured out how to solve, while machine learning is a practical application with the know-how to solve problems, such as with anomaly detection.

Deep learning is a subset of machine learning with a focus on neural networks and is becoming popularized for image recognition. Certain algorithms from both categories can typically leverage a graphics processing unit (GPU) to computationally solve problems in a linear fashion, thus improving performance of these complex problems over that of a traditional CPU-only approach.

Machine learning and deep learning are not specific to a language or a technology stack. The only real requirement they have is access to data which can be used to learn and then more data to which learnings may be applied. Since this is the only real requirement, we should concern ourselves first and foremost with accessing data from a file system.

Even though Hadoop provides a file system (HDFS), deep learning has failed to come to fruition on HDFS because it is a non-standard file system. While big data at scale is closely associated with Hadoop, it has become rather obvious over the more than 10 years Hadoop has been around that it wasn’t built to solve most business problems. It is important to understand the technologies available to solve problems, but it is critical to your success to understand their limitations in order to deliver business value.

HDFS tends to be most often leveraged within the Java ecosystem. While Java is awesome for enterprise systems, it is not close to being considered the greatest language for machine learning. Python, while less than optimal for enterprise applications, has become the dominant language in the machine learning space, and it does not have native bindings to HDFS. Spark, conversely, has had some success—it can depend upon the HDFS API—but Spark is only one toolset with a currently limited scope in deep learning. The most prominent Spark library for deep learning comes via DeepLearning4J.

Moving beyond Spark is where NFS and POSIX come into the equation. These standards have been around for 30-plus years, and every programming language supports these out of the box. This means that every language—whether it is popular for machine learning or not—can leverage a standard file system.

Hadoop lives in one world while machine learning lives in another. Some of the most prominent technologies with a focus on deep learning are TensorFlow, Caffe, Theano, CNTK, MXNet, Torch, and Paddle. Use case specialization is becoming the biggest consideration in the adoption of these tools. None of them are the dominant winner in the space, but they all have one important detail in common: They depend upon a standard file system; they do not work with HDFS.

What is most interesting when talking about deep learning is the dependence upon both a CPU and GPU. Data meant to be used by a GPU usually requires a transformation to occur first. Moving data between servers with CPUs and GPUs is a major step in this process. It is critical to ensure the final data needed by the GPU is collocated on the server with the GPU. Otherwise, a separate data movement process will need to be put in place, adding significant latency and overhead to the learning process.

Resource management of these tools is also radically different than the Hadoop world. Where Hadoop focuses on YARN for its workloads, the rest of the enterprise—including these deep learning tools—rely upon tools such as Kubernetes and Mesos for orchestration of services and resource utilization.

Machine learning, and even more so, deep learning, hungers for large-scale streaming datasets and is not really at home when it comes to Hadoop, thanks to a lack of standards. That’s OK though, because knowing the limitations of these technologies is what is important and will help you save a tremendous amount of time when trying to solve problems within your business.


Subscribe to Big Data Quarterly E-Edition