<< back Page 2 of 3 next >>

Python’s Pervasive Portfolio

Matplotlib is a 2D plotting library. Easy-to-use visualization is important to understand relationships in math and science applications. This library gets downloaded about 1 million times per day.

scikit-learn provides a substantial amount of algorithms covering a range of uses such as classification, regression, clustering, and dimensionality reduction. If you hear of someone doing machine learning, the odds are good they are using, or have used, this library. It is downloaded nearly 1 million times a day.

pandas is a flexible and powerful data manipulation library. Loading data from a source is important, but being able to sift through that data for what is needed to solve a problem is more important. This library made the concept of a DataFrame popular. This library is downloaded about 3 million times per day.

NumPy is the most foundational library in this ecosystem. Nearly every Python library solving a scientific related problem builds on top of NumPy. It provides the base data structures and computing tools to enable developers. This library is downloaded about 3 million times per day.

Finally, Jupyter provides a notebook-type interface simplifying iterative development practices normally associated with writing code, then running it to see the output, and looping until happy with the result, then proceeding to the next logic block.

Development Environments

Arguably, the most innovative tool for developing software is that of the integrated development environment (IDE). This toolset provides the user with quick access to a plethora of information, including methods, parameters, and documentation describing how to use the code, as well as referencing other libraries. IDEs reduce the amount of time it takes to write good code.

Jupyter filled the void for those in the scientific computing community. It is lightweight, can run in the browser, and supports a very iterative development process providing accessibility to the masses to develop and collaborate. One important feature these notebooks support is allowing the developer to add markdown documentation to the notebook as they go. This is the ultimate in documentation, as it provides the user with an easy way to explain why the code was written, or even why the results will be a certain way.

Jupyter notebooks have become such an integral part of these folks’ development processes that a number of projects have built support for these notebooks. They can be directly integrated into production workflows with frameworks such as Prefect and Apache Airflow.

Additionally, to really drive the point across, the most popular IDE, MS VS Code, even supports Jupyter notebooks directly within the IDE. This not only shows the strength of the use case but reinforces the importance of the notebook model.

Even the best projects have drawbacks, but, again, these are opportunities to drive improvements with such a community. While Jupyter notebooks support markdown within a notebook, it is not the easiest to use, and the notebook file is a JSON formatted file. From a software source-control perspective, it is quite complex to perform a diff command between versions of a notebook. Now there is a new notebook format called MyST Markdown Notebooks. This new notebook type is supported by Jupyter and uses a markdown file to store code and documentation. What is interesting about this approach is that it puts documentation and revision history first. There is great power in being able to trace through the history of a notebook.

A Strong Foundation

Building a skyscraper requires a resilient foundation. Software being built to solve the mysteries of the universe is no exception to this concept.

NumPy is one of these foundational components of nearly every scientific computing endeavor. But it doesn’t stop there—oh no. NumPy gets used in products ranging from security solutions to all those machine learning libraries everyone is raving about.

As mentioned earlier, a strength of Python is to expose native C++ libraries to users to gain the underlying performance benefits of native code. This is the approach currently being used to provide acceleration to the entire PyData ecosystem. Without this strong foundation, where everything depends on a small subset of libraries, acceleration would be very fragmented.

<< back Page 2 of 3 next >>


Subscribe to Big Data Quarterly E-Edition