<< back Page 2 of 3 next >>

Accelerating the Data Science Ecosystem


Data Science Tools

Due to the iterative nature of data science, it is important to use tools that provide a broad range of capabilities. Tools such as Git have not only dominated the regular software engineering industry, but have found their way to being integral to the data science lifecycle. Git has been used as a foundational component to provide capabilities such as model management and model versioning. It provides the ability for the user to track models over time. It isn’t the only tool for the job, but it is very popular.

AutoML

A far more confusing set of capabilities is provided by AutoML (automated machine learning) tools. They are really only confusing because there is no standardized way of categorizing which capabilities an AutoML tool provides. Data preparation, feature engineering, model selection, hyperparameter optimization, results analysis, and even automated visualizations are all within the realm of AutoML. However, most product offerings will provide only one or two of these capabilities. This makes it rather difficult to look at the category through a single lens.

Workflow Management

In the most simplistic view, people may think that a data science workflow is nothing more than running some code and looking at the results. This is not, however, the case in reality. Data comes from a variety of sources, and there are a number of different mechanisms used for cleaning, standardizing, and normalizing each dataset. In addition, a data scientist needs to figure out which features matter, some of which will be combinations of different fields of data. When looking at the bigger picture, it gets complicated very quickly. Because of the level of complication, workflow management is a major functional capability for data scientists. It is the difference between running 50 steps in a workflow manually, versus having them run on their own. It is all too easy to miss a step and have to start over, so these tools are critical to long-term success. Tools such as Kubeflow and Airflow are dominating the open source ecosystem due to their integration with orchestration tools such as Kubernetes.

For more articles like this one, go to the 2020 Data Sourcebook

Deployment

Putting software into a production environment can be complicated—which is why tools to simplify the deployment of models have become so popular. Seldon is a tool that can be used on its own. However, many other products have chosen to integrate it into a larger offering. The better tools available allow users to deploy a model, or multiple versions at the same time, to support split-testing to compare performance, and they often will support rollback functionality. Rollback functionality is more important than most people realize, as it provides a fast and easy way to put a previous version of a model back into production.

Data science development and deployment tools worth considering often provide a variety of the capabilities mentioned herein, as well as support a mix of different languages. Consider evaluating Dessa, ElementAI, Domino Data Lab, H2O.ai, Dataiku, Algorithmia, and iguazio—especially its nuclio offering.

<< back Page 2 of 3 next >>


Newsletters

Subscribe to Big Data Quarterly E-Edition