<< back Page 3 of 3

Accelerating the Data Science Ecosystem

By Jim Scott

Dec 16, 2019

Data Science Acceleration

If a single reason could be identified as the most prominent cause for Python rising to the top of the hill in data science, it would arguably be Pandas. Without question, Pandas cannot be viewed in isolation within the broader Python ecosystem, but it has spurred a revolution of sorts. Pandas was created from the name “panel data,” and with this delivered the DataFrame API. A DataFrame is a two-dimensional labeled data structure that resembles a table, or a spreadsheet with rows and columns. The reason to mention this layout is that tabular data is by far the most common type of data to use within the realm of data science. Pandas is built on top of NumPy, which delivers an n-dimensional data structure. NumPy is a foundational component to many other data science libraries. This is where the acceleration begins. The RAPIDS project uses the Pandas DataFrame as its API foundation. It provides implementations called cuPy and cuDF and regularly delivers 10x to 100x performance improvements by leveraging a GPU.

The next library to be noted is scikit-learn. This is a simple and efficient machine learning toolset. It is popular for its ease of use for algorithms in both supervised and unsupervised learning applications covering classification, regression, clustering, dimensionality reduction, and model selection. The accelerated version of scikit-learn is called cuML and provides 10x to 50x performance improvements.

Graph analytics has driven a large number of data science use cases in the last 20 years. The best known in terms of the internet is PageRank, which works by counting the number and quality of links to a page to determine importance. This has been a primary driver for search engine results. One of the more popular graph APIs available is NetworkX, which is a Python package. It supports the creation, manipulation, and analysis of the structure, dynamics, and functions of complex graphs. As the volume of data has grown over time, so has the complexity of graphs. As complexity grows, so too does the time required to perform analysis. Fortunately, the accelerated version is called cuGraph and provides 250x performance improvements.

For more articles like this one, go to the 2020 Data Sourcebook

There are plenty of data science use cases for spatial analysis, especially as the IoT movement has taken off. GeoPandas, which aids in performing distance calculations, point-in-polygon and point-to-polygon operations, shape intersections and unions, as well as the most popular spatial searches, has become critical in data science. Not only has this become simpler from a development standpoint, but the performance must be off-the-charts compared to the past because the volume of data required for data science is massive. The accelerated version is called cuSpatial, which provides between 10x and 10,000x performance boosts over traditional approaches.

While Python sometimes gets a bad rap when it comes to performance, there is a library available which can help in some cases to speed up the performance. Numba translates Python functions based on NumPy to optimized machine code at runtime. The code does not need to be rewritten or separately compiled; instead, it just needs a decorator/annotation applied, and Python handles all the work very elegantly. What is even more impressive is that not only does this enable an increase in speed, but that it also supports optimizing the code to run on a GPU, which delivers 100x performance benefits. No GPU development expertise is required to benefit from this capability.

Running Python workloads beyond a single computer may seem to be a practical impossibility. However, this is not the case. Over the last few years, a project called DASK has come about which handles distributing Python workloads beyond a single machine. It is easy to use and has a minimal learning curve. When compared to frameworks such as YARN for handling distribution of a workload, there is nearly no effort using DASK for the first time. Realistically, within this industry, it would be illogical for a new user who wants to distribute a workload to ever come along and say, wow, YARN seems to be the best option in the marketplace, whereas the simplicity of DASK makes its use fairly compelling.

Data scientists can use RAPIDS by changing as little as one line of code within their existing codebase. RAPIDS is a popular choice for Python users to get the job done faster. RAPIDS makes workloads that otherwise would be nearly impossible to complete in a reasonable time rather easy to handle. Time-to-completion is unequivocally the single most important reason for data scientists to accelerate their data science workloads because more iterations equal better results.

What’s Ahead

This isn’t to say that the tools from the last 10 years are now being abandoned. There are so many workloads already built on them that it would be foolish to think that a shift will occur overnight. Spark still does a good job within the environments it has won over, and not everyone out there is a fan of every technology. The real beauty, however, is that for those who are not fans of the Spark framework, there are options that go beyond what most people considered possible.

Data science is here, performing data science tasks has gotten easier, and it has quickly become the most pervasive way to utilize and benefit from the value of data at scale. Although new tools have improved the day-to-day life of the data scientist, they still have a ways to go to really take the mundane and repetitive tasks out of the day-to-day processes. From a performance standpoint, RAPIDS has made it possible for data scientists to use the most popular Python libraries to not only outperform Moore’s Law, but to make GPU acceleration ubiquitous.

Big data is dead, long live big data—for data science.

<< back Page 3 of 3