<< back Page 2 of 2

New Approaches Advance Data Science


Apache Kafka has led the event-streaming segment over the last few years, but even it faces very strong competition from Apache Pulsar. Similar to how Apache Spark dethroned Apache Hadoop, Pulsar was architected to solve many of the most painful problems experienced by Kafka users. It scales up and down with ease and provides access to dynamic topic creation, making it effortless to utilize for a broad range of use cases. Pulsar, similar to Kafka, supports processing batches of records at a time, but Pulsar also supports processing events one at a time. The big benefit here is a simplified architecture where one technology can, in fact, excel at both processing models, enabling developers to choose what is best for their use case without having to pick up a new technology stack.

Workflows

Users across organizations are engaging in more concerted ways, thanks to a number of approaches that have become available in previous years. Tools such as Dataiku and Domino Data Labs are helping to remove the complexities of working in larger teams. While these tools are different, most people lump them into the same broad category. The products do not offer the same features, but some of the more prominent capabilities, such as creating workflows or sharing datasets, or even deploying into different environments, are among the reasons people are moving to these types of solutions. Ultimately, they help data scientists to do work on their local desktops and then easily push their code into these shared environments—truly creating seamless experiences from the desktop to the data center, whether on-premise or in the cloud.

Data Processing

We have seen a shift in separating compute from storage, and perhaps the most prevalent use of this has been in data warehouses and the concept of the data lake. It is one thing to separate them, but the compute capabilities must still be provided through a mechanism people understand. Users who accessed data warehouses only had SQL to query data. As such, SQL has managed to maintain its position. Because of this, standalone SQL query engines became popular nearly a decade ago, and they are most certainly part of the mainstream now. Presto is considered to be the most popular CPU-based SQL query engine and is continuing to grow in popularity. BlazingSQL has quickly become the dominant GPU-based SQL query engine. It exemplifies the concept of scalable software that can scale from the desktop to the data center, as it is being used on the Summit supercomputer at Oak Ridge National Laboratory.

Cloud providers are continuing to invest heavily in serverless-based options for users to run their workloads, obviating the need to spin up and manage the infrastructure. It will be interesting to see what the future holds for Apache Spark. While Spark still maintains a commanding position in the data center for distributed compute workloads, companies have not abandoned their search for other viable and easily scalable technologies such as Dask and Ray as well as the plethora of available cloud services.

NVIDIA has led the development efforts of the open source project RAPIDS, which provides GPU-accelerated data science capabilities around the PyData ecosystem. It provides support for accelerating Python and Dask workloads for scale-up and scale-out, meeting most data scientists where they do their development work today. RAPIDS is also the enabling technology to provide GPU-accelerated Apache Spark, providing a better TCO of Spark ETL and analytics workloads.

Databricks, which has been a leading provider of Spark, has started making some interesting movements in the Spark ecosystem in 2020. It created a new engine called Delta Engine, a core part of its product offering, which replaces the Spark engine, yet maintains Spark API compatibility.

Tools and Technologies

The maturity of the data sicence space, especially including the tools and technologies available, is making it far easier to apply more creative problem-solving techniques. Feature engineering is one of the more critical steps in creating useful models. Training models and testing models take an immense amount of time. When engineered features are not producing the desired results, data scientists will continue searching for ways to enhance the features available to them to solve a problem.

Deep learning and machine learning use cases have tended to stay in their own swim lanes as far as use cases, but this is where we are just beginning to see a big change. Nearly everyone in this area understands that more and better data available lead to better results in both types of learning. We are now seeing mixed workflows where deep learning is being used to create better, more informative features used in machine learning models. These workflows all require the same data-processing characteristics but are now becoming intertwined. Tabular data is becoming enriched by the outputs of deep-learning models, which are, in turn, being used to drive better fraud and anomaly detection or improved product recommendations.

What’s Ahead

The technologies that started picking up steam over the last decade and were the trend-setters have now reached the level of maturity we hoped for as we anticipate mass adoption for the enterprise. But the community refuses to call any of the incumbent technologies good enough to drive the future. New technologies, especially those built to run on the next generation of GPUs, are continuing to pave the way to new approaches to data science.

Most of these technologies, similar to what RAPIDS provides, make it very easy to do more with less and continue to improve the day-to-day life of the data scientists. So long as these technologies continue to adopt familiar APIs and integrate nicely into existing workflows, we will continue to see quick progress into the enterprise. Personally, I think the tectonic technology shifts we saw with the advent of Hadoop will not be seen again for quite a long time. We should be able to settle into a more pragmatic approach to the introduction of new technologies. However, with that said, don’t get too comfortable, because the development of new tools and technologies to support the data science space isn’t slowing down any time soon.

<< back Page 2 of 2


Newsletters

Subscribe to Big Data Quarterly E-Edition