Pentaho Adds New Support for Apache Spark

Pentaho users will now be able to use Apache Spark within Pentaho thanks to a new native integration solution that will enable the orchestration of all Spark jobs. Pentaho Data Integration (PDI), an effort initiated by Pentaho Labs, will enable customers to increase productivity, reduce maintenance costs, and dramatically lower the skill sets required as Spark is incorporated into big data projects.  PDI will support Spark, SparkSQL and the orchestration of all Spark jobs.

Spark is an open source processing engine, built for speed, ease of use, and machine learning. Spark stores blends, and governs data at entirely new levels of speed, scale and simplicity, according to Pentaho.

PDI builds on previous projects from Pentaho Labs, and furthering the efforts that have led to support for YARN and the Adaptive Big Data Layer.

“Over the last 3 to 4 years, we’ve really had a transformation into a big data integration and analytics company,” said Donna Prlich, vice president of product solutions and marketing at Pentaho. “The growth in our business has really been around the big data market and embedded analytics.”

By using a big data blueprint, Pentaho experimented with possible use cases and saw how customers benefited from simplified, real-time analytic capabilities. By doing that, Pentaho saw how Spark would help them and their customers. 

“As these different deployments started to scale, emerging technology, like Yarn and Spark, starting coming into the picture so we experimented with these different technologies and eventually got to the point where we decided to build them into the product and support them,” Prlich said

For businesses, PDI means increased productivity, reduced maintenance costs, and lower skill sets required to utilize Spark for big data projects, according to Pentaho who services major enterprise customers such as Nasdaq, Sears, and many more.

 “We feel that Spark has a lot of promise and we want to support some early use cases that we see,” Prlich said.

Any organizations that are dealing with large volumes, varieties, and types of data they have to blend will benefit from PDI, according to Prlich. “It’s still very early but we are honing in on our existing Hadoop customers as well as looking at new opportunities where Spark would be a good fit,” Prlich said.

Pentaho hopes to offer more real-time capabilities such as Spark Streaming and is looking into oroviding machine learning through Spark in the future. “As these big data use cases start to get built out and emerge, there are lots of potential opportunities for technology to get them faster deployed and more successful,” Prlich said.