Accelerating Data Science With RAPIDS

It is well-known that data scientists spend about 90% of their time performing data logistics-related tasks. After that part is done, to some degree, the data scientist finally gets to ask questions pertaining to the problem that needs to be solved. The thing is, 90% is a huge percentage—especially when spread across a team of data scientists. Anything that a data scientist can do to reduce it is a good use of their time, and a benefit to the organization as a whole.

With the advent of big data, frameworks such as Hadoop MapReduce and Apache Spark have been heavily relied upon to implement data manipulation and ETL pipelines. They have been well-regarded choices for data processing at scale due to their distributed nature. However, there are pros and cons to most tool choices. While these tools are distributed, the downside is that they are arguably some of the most complex tool stacks to manage. For this reason, the market has had adoption issues, especially with finding adequately experienced people to build and manage these systems. Couple that with the complexities and learning curves of writing the software, and it becomes a rather precariously steep hill to climb.

The utopian vision of software engineers and data scientists is to be able to obtain performance benefits and scale-out capabilities with the least impact possible—perhaps by changing as little as a single line of code. Of course, having to change a couple of lines is OK, but minimal change is really the key. While this may seem to be a pipe dream to some, it is a reality for many. For Python users working in the realm of software libraries such as XGBoost, Scikit-learn, or Pandas, there is the unique opportunity to not worry about setting up, configuring, or managing a cluster for Hadoop or Spark. They can make one minor library import change within their Python code and they can benefit from GPU (graphics processing unit) acceleration with RAPIDS.

The RAPIDS Framework

RAPIDS is a data science framework offering support for executing an end-to-end data science pipeline entirely on the GPU. Instead of creating new APIs to work with the GPU, RAPIDS has taken the approach of adopting and leveraging other popular APIs and libraries, and doing the hard work under the covers for the developer. RAPIDS hides the “how to” interaction with the GPU. The data scientists continue using the same APIs that they were already using, and can just swap out the implementation of the library from the original CPU-based version to the new GPU-accelerated version.

A key underpinning of RAPIDS is its reliance on DASK to handle distributing Python jobs. Arguably, DASK is simpler to implement and manage than other distributed schedulers/managers. For nearly 2 years, it has had support for running within Kubernetes, and has even broader integrations within the high-performance compute arena with SLURM (Simple Linux Utility for Resource Management) and LFS (Linux from Scratch). This is important to consider because with this broader reach, the more likely we are to see long-term success.


Ultimately, the benefit of RAPIDS is to streamline the entire data science pipeline by allowing the developer to do more work, in less time, within the constrained resources provided, and with the tools they are accustomed to using. This GPU-accelerated approach takes the scaling concepts further than where the Hadoop ecosystem had gone, by enabling scale-out over hardware which supports delivering substantially greater performance improvement. This approach provides for better compute density within the data center.

This leads to better utilization of hardware, and puts a lot of CPU back into the processing pool to then be reallocated. This is a really good thing for those writing the checks for infrastructure utilization, whether on-premise or in the cloud. It provides a cost benefit to businesses that have previously only experienced CPU scale-out, allowing the business to continue scaling out in a much more affordable way. It is very common to see workloads that required hundreds of CPU servers be tackled with just a couple of handfuls of GPUs in fractions of the time they previously required.

With RAPIDS, it is very reasonable to see performance improvements ranging from one to three orders of magnitudes. When you start adding up the time spent on what are effectively mundane activities, which can be as much as 90%, anything that can cut out wasted time is worth considering.


Subscribe to Big Data Quarterly E-Edition