<< back Page 3 of 3

Python’s Pervasive Portfolio

By Jim Scott

Dec 15, 2021

There are two complementary approaches for accelerating performance in Python workloads. The first performance problem comes by way of being limited by the compute ability of a single CPU. These libraries were all built on a CPU because the CPU is the common unit of computation on all computers. Just because it is common, doesn’t mean it should stop there, and that is why the community has moved toward providing GPU accelerated libraries.

CuPy provides GPU acceleration to parts of the NumPy library by leveraging CUDA Python. While programming the GPU is normally performed by more experienced software developers, CUDA Python provides bindings to the GPU so that higher-level libraries can use the GPU as easily as possible. Additionally, there is RAPIDS, which provides libraries following the APIs of pandas and scikit-learn.

The impact of GPU acceleration is profound, to say the least. Be warned that going down this path is going to have a side effect on productivity and emotional well-being. The first time a user runs their code, and it runs 100x faster, they may not stop smiling for a while.

The second performance problem comes by way of scaling. Scaling can refer to using multiple processors (CPUs or GPUs) in the same computer, spanning multiple computers, or both. This is perhaps the single most important aspect of compute solutions for the future. As data sizes continue increasing, or as the complexity of the problems expands, utilizing the computational power across a cluster of computers is the key.

To address this there are a couple of options. The first is Dask, which was mentioned earlier. This is a good solution with a decent maturity level, and it already integrates with RAPIDS to provide scalable GPU acceleration across machines.

The next one is Legate. Legate provides a solution for scaling out over CPUs and GPUs by utilizing an ephemeral cluster. This is appealing for a variety of reasons, with the foremost being that, once implemented, a user need not change any code to use the CPU on their desktop and then when they are ready, they can do the same exact code on a cluster of CPUs or GPUs. There is value in simplicity. Legate is not very mature, but it is built on a very mature open source research project called Legion. This is a comprehensive scalable compute solution, but it requires libraries to be plugged in to take advantage of these facilities. Again, the benefit is that once the library is plugged into Legate, the user can easily move from as little as one CPU or GPU to a cluster of one, the other, or both. NumPy is the first library being plugged into Legate, for all the obvious reasons already shared about the importance of NumPy to the PyData ecosystem.

Legate will continue to add more libraries, all of which will be able to play together on a scaleout framework that will not require the end user to possess special knowledge of how to scale a workload.

What’s Ahead in Data Science

As Moore’s law has effectively run its course, it is pretty easy to see that with the combination of the PyData community and commercial software offerings (especially those accelerating Python’s foundational components), as well as all the startups building solutions, there is a lot of momentum propelling Python forward.

Not sure if Python can provide solutions to the problems you encounter? Jump into the community. We all love to help solve problems.

<< back Page 3 of 3