Scaling Python, and How!

Distributed computing is the simultaneous use of more than one computer to solve a problem. It is often used for problems that are so big that no individual computer can handle them. This method of scaling compute is referred to as horizontal scaling, which means adding more computers rather than vertical scaling, which refers to creating a bigger single computer with more resources. While distributed computing is great for scaling and providing greater reliability, it does introduce complexity when it comes to design, development, and testing.

Python has become the default language of solving complex data problems due to its ease of use, plethora of domain-specific software libraries, and stellar community and ecosystem. All of these things have led to the emergence of even more new and easier-to-use frameworks that enable users to scale their Python code.

The first framework that integrated Python for distributed computing was Apache Spark via the PySpark API. However, Python was not the default language for this analytics engine, and, therefore, while users could leverage the Python API for executing Spark code, it was not very “Pythonic.” Pythonic refers to code that follows the general standards in Python.  This is because PySpark is a Python-based wrapper over the core Spark framework, which is written primarily in Scala.

New Python-Based Frameworks

As the popularity of the Python language has grown, there has been hesitancy to adopt the syntax associated with PySpark due to distinctive differences in how it approaches scaling compute. The Python community wanted a more Pythonic way to scale their code while also reducing the complexity of shifting code from a single machine to distributed environments. This demand has seen the emergence of new frameworks that are Python-based and allow for simplified horizontal and vertical scaling to solve big data challenges.

The most popular framework to provide users with an approach to scaling (vertical and horizontal) for their Python workload is Dask. It is an open source library for parallel and distributed computing in Python. It provides integration to many of the most popular libraries in the PyData ecosystem, including NumPy, Pandas, and Scikit-Learn, to allow both horizontal and vertical scaling. In addition to scaling compute, Dask also manages the data that cannot fit into memory to run inside a distributed cluster. Dask provides its own version of array and DataFrame data structures, whose APIs mimic those found in NumPy and Pandas. This allows users to run their existing Python code with minimal code changes. It does have a learning curve, but it does allow the user to run code both on a local machine or a cluster, improving the ease of use. This leads to simplified scaling of analytics and machine learning code written in Python, allowing users to solve larger and more complex data problems.

Ray is a fast and simple distributed computing framework for Python that is also open source. Ray sets up and manages clusters of computers so that you can run distributed tasks on them. These tasks extend beyond distributing data to include machine learning algorithms and reinforcement learning agents. Data scientists can efficiently parallelize Python programs on laptops, and then run the code tested locally on a cluster practically without any changes. Ray is part of an ecosystem of packages that allow you to scale out workloads based on hyperparameter optimization, reinforcement learning, and deep learning. While it has its place in the data science workflow, its primary goal is to parallelize Python code.

Dask and Ray are very similar tools but do have their differences. Each has its own methods for distribution via schedulers, data objects, and data serialization. The Dask/Ray comparison depends on the use case. Importantly, both provide facilities to take advantage of different machine sizes and accelerators such as GPUs. However, the maturity level and rate of adoption of Dask is considerably higher.

Solving Serious Data Problems

Finally, there is one other framework worth noting. Legate and the library built on top of it called cuNumeric provide scaling to thousands of CPUs or GPUs. cuNumeric implements the NumPy API directly and, as such, is enabling users to leverage it as a drop-in replacement for NumPy, effectively with zero code change required. This open source project has made great strides in the last year, is approaching 50% API coverage of NumPy, and can automatically scale out Python code for the supported APIs without the user needing to understand how to scale or even how it works. However, its limitation is that of the API it supports.

There are some great options in the Python ecosystem to scale workloads to solve serious data problems. They are maturing quickly and can prevent users from having to rewrite the projects in alternative languages, which is a big win for productivity.


Subscribe to Big Data Quarterly E-Edition