High-Performance Data Science—Laptops to Supercomputers

By Jim Scott

Sep 11, 2020

When talking about data science, most people feel as if they are in one of two camps as far as data size. The first is really small data—hundreds of megabytes to a few gigabytes. The second is gigabytes to terabytes. Notice I didn’t say “big data,” nor did I say “petabytes.” The source datasets may start at petabyte-scale, but keep in mind that data is often very raw, and most of it is ignored. This is the case, even in the typical data analytics (warehousing) workloads that may operate over these datasets to perform large-scale aggregations. The vast majority of data science-related workloads are using 10 terabytes or less of that data. In truth, more than 95% of these problems are smaller than 100 gigabytes. While there is certainly a lot of work that goes into cleaning up, aggregating, and reducing the data down to relevant datasets that are useful for each use case, the typical working set of data for data science workloads is not petabyte-scale.

Two Camps of Users

Breaking up the data science users into those two camps is important because of the approaches used to speed up data science processes. Within the small data camp, the normal argument is that the data is too small to be able to leverage a GPU (graphics processing unit)-accelerated technology. On the bigger data side, people often say that they don’t have enough data to utilize those GPU-powered supercomputer machines for massive data scale. It is really a very interesting point of complexity in the ecosystem because everyone wants everything to work faster, but most do not know that GPU-acceleration can be applied to both of those situations.

After peeling back the onion, we need to inspect the first layer of technology that most data scientists leverage, which is ANSI SQL. An important note is that some technologies claim to support a SQL variant, which means the technologies are lacking language support or they don’t follow the standard. The nice thing about ANSI SQL these days is that there are a number of query engines that provide ANSI SQL support without the need for using a relational database.

Query Engines

Two of the more popular query engines available are Presto and BlazingSQL. While both are widely used, BlazingSQL is GPU-accelerated. It can leverage the power of GPUs to do everything faster. It makes very efficient use of the underlying memory available in the GPU and provides many advanced features such as column-pruning, predicate pushdown, and, most importantly, out-of-core processing.

Out-of-core processing means that, regardless of data size, SQL queries will run even with very limited GPU memory. For example, with only one GPU with 32 gigabytes of memory, BlazingSQL can efficiently process 10 terabytes of data. Even on smaller laptops with one GPU and a smaller memory footprint, BlazingSQL can process tens of gigabytes in seconds. Of significance is that the more GPUs it has, the faster BlazingSQL will run. It supports scale-out architecture, so if you need more horsepower, just add horses—or, in this case, GPUs.

As we discussed, most data science problems are not petabyte-size, and GPU-acceleration is quite easy to introduce through the ANSI SQL standards-compliant BlazingSQL. Since both camps of data scientists leverage SQL as a primary tool for executing their data science tool chain, they both have an “easy button” for getting started.

From Laptop to Production

Another important point which targets both of these data science groups is that BlazingSQL builds upon the open source project RAPIDS. RAPIDS provides a plethora of components useful in the data science users’ toolbox. Recently, RAPIDS added support for MLflow to boost data science development. MLflow provides both experimentation tracking and packaging for reproducing experiments.

With such capabilities, a data scientist is able to work on and solve reasonably sized problems (e.g., millions of rows and thousands of columns) on GPU-equipped laptops. Those same problems can be run in a distributed manner across a number of servers without changing any code, meaning it is trivial to scale to considerably larger datasets. This approach provides an easy way to hand over data science solutions to be taken into production environments, as they are often running in scalable and shared environments.

Photo by Joshua Sortino on Unsplash