Page 1 of 3 next >>

Accelerating the Data Science Ecosystem

By Jim Scott

Dec 16, 2019

The big data ecosystem has changed. No longer is it true that people just want to store massive quantities of data. Action is not only needed but must be taken to sustain the viability of an organization. While some say big data is dead, the concept isn’t going anywhere. Instead, it is the notion of inaction on big data that is dead. In addition, the technologies that were built to store and process big data are not loved by the industry; they are merely tolerated and are often maligned. They have been difficult to put into production, to maintain, manage, and even find people with the skills to do all the work.

The old forms of big data—with technologies dependent upon or bound to Hadoop—are actively disappearing. In their place is a flourishing data science ecosystem built on much more standard and industry-accepted components. This ecosystem is built on the foundational tenets that specialty skill sets should not be required to apply the concepts of data science to problems. A software engineer, data scientist, systems administrator, or anyone with the basic need to solve the problem as it is presented should be able to do so without having to undergo hundreds of hours of additional education to understand all the nuances, as was the case with the older big data tools.

The Languages of Data Science

Java did a great job delivering solid results for enterprise solutions for a variety of reasons. It enabled many of these big data technologies to be created. It has not, however, been a popular option for data scientists.

Scala attempted to bridge the divide from the verbose Java to something that is still fast but is more flexible and data scientist-friendly. In reality, though, Scala only found a low level of popularity, and mostly within the Spark ecosystem, with minimal use beyond that.

For more articles like this one, go to the 2020 Data Sourcebook

R found some popularity within the data science community and was seen as being competitive with SAS. It has had a fair bit of success in handling good-sized datasets but is not really heavily used at scale.

Python has been around for more than 30 years, but really didn’t discover popularity until it had been around for nearly 20 years. It is likely that the single most dominant reason for its rise in popularity is its shallow learning curve due to its easy-to-understand syntactic structure. Simply put, a low barrier to entry leads to broader adoption, which leads to a larger community. Python is the real champion of the data science ecosystem.

The Data Science Lifecycle

Before we go too much further on how this ecosystem is evolving, we should ensure we are working from the same foundation.

The act of data science starts with the exploration phase. The user identifies and acquires source datasets to begin exploration. This is where notebooks, such as Jupyter, would be used as a place for code to be written for the user to explore the data and perform feature engineering.

The training phase follows and is where larger datasets are used to build models. The total amount of compute power is very important here, as this is one of the most time-consuming and repeated steps in the lifecycle. The output of this phase is a model.

Upon successfully building a model, it is time for the deployment phase. The model should be paired with a framework that allows the scoring of data to occur. This can be done in batch or be event-driven for real-time uses. The output of this phase will look similar to a traditional microservice.

Finally, we move on to the production phase. In this phase, we focus on monitoring the models and being able to update them without interrupting production. This requires continuous integration/continuous delivery pipeline types of capabilities.

The Data Science Workflow

The lifecycle just discussed is fairly broad. It sets us up to dig into the core of the data scientist’s job. The simple view of the workflow is to identify the tools and data sources for the job, write some code, train a model, test the model, and analyze the outputs—then, repeat those steps until we are significantly happy with the results, swap out tools if necessary, and go to production.

To elaborate a bit more, data scientists will spend the abundance of their time on the logistics of data handling. This includes gathering the data, cleaning, standardizing, normalizing, and feature-engineering. These tasks often get lumped together with the more common term extract, transform, and load, or ETL, but it is actually more accurate if we group these together under the term “data preparation.” While there are many viable options to help with data preparation, Datalogue provides a nice set of capabilities.

The next step is to perform training, based on the data that was prepared. After this work has been performed, normally, users will perform some form of visualization to get a better understanding of the results. A couple of excellent data visualization tools that handle large datasets are Graphistry and OmniSci.

Where this workflow gets complicated is that those data preparation steps will often be performed multiple times because, inevitably, the user will discover something wrong with the data or even a missing feature and will need to start over.

The final step in the workflow is to move the models into a production environment to perform inferencing and then start the entire process over to have continuous training and deployment occur.

Page 1 of 3 next >>