Making the Most of Spark’s Capabilities

There is excitement in the technical and business communities around the potential Spark, an open source in-memory application framework for distributed data processing and iterative analysis on massive data volumes. However, it is still early days, and despite the promise Spark holds, companies are just now working through how best to use it, how to integrate it within their existing environments, and how to make the most of its libraries and other built-in capabilities.

To provide more information and thought leadership on the framework, IBM, which recently announced a major commitment to Apache Spark, sponsored a DBTA webcast. The webcast featured presenters Trent Gray-Donald, distinguished engineer, IBM Analytics – Cloud Data Services, and Luis Arellano, program director, IBM.

According to IBM, Spark is potentially the most important new open source project in a decade that is being defined by data because it dramatically improves the performance of data dependent apps and radically simplifies the process of developing intelligent apps, which are fueled by data.

A recent survey showed that 72% of users were aware of Spark, and that 66% have either implemented Spark or were evaluating it for implementation. Spark includes a core of libraries that enable numerous analytical methods. The four core libraries of Spark are Spark SQL, Spark Streaming, MIIib, and GraphX.

The reasons for Spark’s growing popularity are the productivity and financial benefits it provides. It contains a concise and expressive syntax while also being integrated with common programming languages. Spark is unique in that it competes with Hadoop but works alongside it as well.

In terms of deployment options, Spark can run side by side with MapReduce, run on Yarn, and jobs can be launched within MapReduce with Spark, as well.

Spark’s flexibility allows users a wide variety of operations to perform with it. “It is essentially a new distributed computing platform,” stated Arellano.

“One of the things that Spark is excellent at is providing a common platform. It allows for users to do everything from data prep, data ingestion, model building, and model deployment. It really enables you to transform your business,” Arellano noted.

The core concept for Spark is resilient distributed datasets (RDDs). “RDDs are the collections of elements that Spark works on in elements. They are spread out through the whole cluster,” explained Gray-Donald, who provided a technical breakdown of Spark, including issues such as running a Spark application and streaming.

As part of its commitment to Spark, IBM plans to embed Spark into its industry-leading analytics and commerce platforms, and to offer Spark as a service on IBM Cloud. IBM will also put more than 3,500 IBM researchers and developers to work on Spark-related projects at more than a dozen labs worldwide; donate its breakthrough IBM SystemML machine learning technology to the Spark open source ecosystem; and educate more than one million data scientists and data engineers on Spark. 

To view a replay of this webinar, go here.