Prior to the advent of big data and cloud approaches to handling large volumes of data, “data science” wasn’t a term. On top of that, every use case that fell into the category we now call data science was a heavy lift. We have moved beyond the basics of big data; Hadoop is minimally relevant now as the cost and complexity was too high, and even the primary vendors have distanced themselves from involvement with that technology.
For more articles looking at the current state of technology and what's ahead for 2021, download the Data Sourcebook.
Data science as a discipline now comfortably encompasses a variety of features and functionality. Generally speaking, it includes loading data from a source such as files or event streams, followed by some type of data preparation for downstream use. The downstream uses are varied and generally are analytics or learning (machine- or deep- oriented). Of course, within the category of learning, there are supporting capabilities such as feature engineering, model training, and data visualization, which are critically important to the discipline. And, when a solution is ready for production, the conversation also trends toward model management and inferencing, which are the things that really round out the overall pipeline of data science.
This is a lot to consider, and data science as an overarching concept can seem daunting to many. However, there is hope that lies within the level of maturity we are seeing in the industry around these technologies. We are now in an era in which we are dealing with the practical application of data, regardless of scale. Data volumes are still growing at unfathomable rates, which begs the question: How are companies processing the data? Well, most of the data is sitting unprocessed and not being used. Companies that are successfully processing large volumes of data are leveraging GPU-acceleration technologies. The reason for this is that Moore’s law hit a wall a few years back, and GPUs have sustained the compute capacity growth leading the future of data processing to support data science.
There are two important lenses to consider for the future of data science—the first being that of data scientists, data engineers, analysts, and other similar data professionals who support the efforts to enable data to solve for a particular use case. The second is the line-of-business owners, or sponsors of the projects, who are investing in a solution for a known problem in which data science promises to deliver new value.
Popular Use Cases
It used to be the case that the business owner really couldn’t get too far beyond taking transactional data, loading it into a data warehouse, generating reports, and performing basic analytics. Most of these efforts were to provide dashboards and look at data in the rearview mirror. Lots of time was spent achieving only minimal insights. Now, though, as we look across industries as diverse as retail, financial services, and healthcare, we are seeing a variety of very popular use cases dominate the landscape. These include analysis of past data signals to help predict future outcomes; recommendation systems that analyze signals to present compelling offers and content; and fraud, anomaly, and pattern detection to enable analysis of real-time data to identify points outside the normal and trigger follow-up actions.
What makes all these use cases so prevalent is their ability to provide a significant return on investment and drive further use cases.
Data Sources
In the past, the data required for these use cases was locked up in log files, and moving those files and processing them was costly and time-consuming. We have come a long way in the last decade, not only in reducing processing time and cost, but also in the delivery mechanism supporting the movement of this data. Batch processing was the only way of handling this type of data, and this approach takes time. While batch processing is still heavily relied upon within the enterprise, the shift to streaming-based analytics and stream-based processing is now prevalent.
Event stream processing is what has enabled enterprises to push the envelope and has opened up the door to support all the aforementioned use cases in near-real time. As events occur, they can be processed to determine what should be done next, whether that is detecting a security event or recommending the next show a user should consider viewing. And of the utmost importance here is that the use and applicability of event streams are not slowing down anytime soon. This part of the workflow is the fastest growing and will dominate every use case.