Data Operations Problems Created by Deep Learning

Thanks to the dramatic uptick in GPU capabilities, gone are the days when data scientists created and ran models in a one-off manual process. This is good news because the one-off model was typically not optimized to get the best results. The bad news is that with an increase in the total number of models created—including iterations over time—the amount of data used as inputs and generated by the models quickly spirals out of control. The additional bad news is that there are a variety of complexities associated with model, data, job, and workflow management.

The typical starting point for any deep learning application is with the data sources, and as the number of sources grows, so does the complexity of the data management problem. Over time, each data source may be expanded with the addition of new data, or enriched with metadata or data sources. To be clear, data cannot be versioned in the way that software or deep learning models can by using a version control system such as Git. In addition, the data sources must be versioned in lockstep with both the software and the models. It is imperative that the data be versioned over time in order to reproduce past results and to have an explanation as to what was done at a given point in time.

If the data is not stored where the compute workload will occur, then there will be data movement issues. Deep learning frameworks and GPU-based workloads do not support the HDFS (Hadoop Distributed File System), because it is not a standard file system. While they do support storage area network/network attached storage, the problem is that the distributed workloads require data to be copied back into HDFS. Deep learning software needs a POSIX (Portable Operating System Interface) file system and the distributed analytics workloads need a file system which supports the HDFS API. Without a system that supports both POSIX and HDFS APIs, data must be copied out of HDFS to a POSIX file system to use with a GPU. Upon completion of the job, the data must be copied back into HDFS to perform distributed analytics.

If this sounds crazy, it’s because it is. It also requires a lot of work to manage, since every time data is copied in or out of HDFS, steps such as the application of security controls must be repeated. Versioning this data is critical, and it cannot be accomplished in-place within HDFS. The crux here is to reduce the data storage and management into a single context for both deep learning and all other workloads simultaneously.

The code that is used to create the working models is another point of pain. Languages such as Python are often used, as are notebooks, such as Jupyter or Apache Zeppelin, to create the models. Versioning code is pretty straightforward with great tools such as Git. However, when the generated models and parameters are coupled with the data, there is a complex issue at hand. For this, there is no “easy button.” However, using point-in-time consistent snapshots within your distributed storage alongside the GPUs is a great option, as it can handle the data in-place with no data copies required.

After the models are created, they must be performance tested. Data must be run through all the models and analyzed in aggregate to find the best-performing model to take forward into production. Development and testing work are predominantly batch-style workloads. Because of this, new issues arise when preparing for production, such as the handling of real-time workloads, and testing and upgrading models while running. Iterating over and testing changes to existing models require a deployment model such as the rendezvous architecture. This brings with it the concepts of a decoy and canary as ways to test and validate models before going all in for a production environment.

Performing data preparation workloads (distributed processing), GPU workloads, data analysis workloads (distributed processing), and even real-time learning and scoring into a single platform causes the abundance of problems discussed here to no longer be serious issues. It is critical to use a platform that supports both HDFS and POSIX. Now, if this platform also supports event streaming, then double the bonus, because real-time scoring is the end goal of most deep learning applications. This removes the need to move or copy the data multiple times and addresses the data operations problems associated with deep learning.


Subscribe to Big Data Quarterly E-Edition