Accelerated ETL With Spark and RAPIDS

Extract, transform, and load (ETL). Those are three words that when placed side-by-side in nearly any order strike fear into people across all levels of business. ETL is perhaps one of the most frustrating topics in existence because without it, downstream data processing, such as analytics and machine learning, cannot really function. It involves getting the data from a data source, changing the formatting in some logical way as to benefit the downstream process, and then loading it into the next storage location for later use.

There is now a slightly broader category beyond just ETL and that is data preparation, which includes some of the more standard concepts such as data standardization and cleansing. These concepts are often lumped together with the “transform” part of ETL.

As data sizes have grown over the last decade, so has the amount of time it takes to run ETL processes to support the myriad downstream workloads. A decade ago, most people were only thinking about making their KPI dashboards faster. As time rolled forward, they started to think about getting more intelligent analytics out of their data, and the data sizes quickly grew from gigabytes to terabytes.

Apache Hadoop came onto the scene to streamline at-scale ETL workloads and analytics processing. Fast-forward to the realization that Hadoop wasn’t fast enough, and suddenly Apache Spark made its way onto the scene. Spark kept the data in-memory instead of writing it to storage in between every step, and the processing performance improved 100x over Hadoop.

Spark is scalable; provides support for Scala, Java, and Python; and does a nice job with ETL workloads. The provided Python support is very convenient to quickly write and test some code and validate a use case. One of the downsides is that, in order to get the best performance when scaling Spark within any environment, the jobs must be written in Scala or Java. This means moving the work from a data scientist to a data engineer.

Spark delivers modest scaling and performance, and some people may have left well enough alone. But, within the realm of big data, the community doesn’t just walk away from improving scaling out and cost, or from making things faster and more resilient. This is where we now see GPU acceleration taking things to the next level. This provides a distinct competitive advantage due primarily to the fact that more data can be used to provide better accuracy in a shorter period of time.

Spark already supports running GPU-accelerated XGBoost models. For those unaware, XGBoost is perhaps the most popular machine learning library that exists today. The open source project RAPIDS provides the GPU acceleration for the RAPIDS Spark XGBoost library. When using this library with Apache Spark it provides an approximately 34x speedup at a 6x cost savings.

Spark 3.0 brings with it GPU-aware scheduling as well as plugins to provide support for GPU-accelerated SQL and DataFrames. GPU acceleration will is a first-class citizen within Spark 3.0. What this means to end users is that they will be able to realize performance improvements without any actual code changes within their Spark jobs. Combining these enhancements with the pre-existing XGBoost support will drive performance improvements and cost savings. This will allow SQL queries to load very large volumes of data into the GPU to be processed via DataFrames, and then that data can be utilized by the XGBoost library for creating machine learning models or for inferencing.

All of this GPU-accelerated goodness comes with the official release of Spark 3.0, which is expected to be released in late spring.

As with any open source project, release dates are not easy to come by, as it depends on the community and lots of testing. Cloud users will likely have an accelerated path to using Spark 3.0, whereas those running Spark on-premise will likely take a bit longer due to IT processes and hardware availability to perform testing on the new version.




Subscribe to Big Data Quarterly E-Edition