Pentaho Open Sources Big Data Capabilities to Fuel Adoption

Bookmark and Share

Pentaho Corporation is making the big data capabilities of its new Pentaho Kettle 4.3 release freely available under open source, and has moved the entire Pentaho Kettle project to the Apache License Version 2.0.  Pentaho says that because Apache is the license under which Hadoop and several of the leading NoSQL databases are published, this move will help to further accelerate the adoption of Pentaho Kettle by developers, analysts and data scientists as a tool for big data management and analytics.

Big data capabilities available under open source Pentaho Kettle 4.3 include the ability to input, output, manipulate and report on data using Hadoop and NoSQL stores, including Apache Cassandra, Hadoop HDFS, Hadoop MapReduce, Apache Hive, Apache HBase, MongoDB and Hadapt's Adaptive Analytical Platform.

Pentaho Kettle also makes available job orchestration steps for Hadoop, Amazon Elastic MapReduce, Pentaho MapReduce, HDFS File Operations, and Pig scripts. And, Pentaho Kettle can execute ETL transforms outside the Hadoop cluster or within the nodes of the cluster taking advantage of Hadoop's distributed processing and reliability.

Pentaho Kettle's Hadoop capabilities work with major Hadoop distributions, such as Amazon Elastic MapReduce, Apache Hadoop, Cloudera's Distribution including Apache Hadoop (CDH), Cloudera Enterprise, Greenplum HD, HortonWorks Data Platform powered by Apache Hadoop, and MapR's M3 Free and M5 Edition.

Pentaho Data Integration, or Kettle, had been freely available under LGPL, but LGPL is not fully compatible with Apache, Doug Moran, product manager, Big Data, and vice president of Community at Pentaho, tells 5 Minute Briefing. The incompatibility is in the area of derivative works, and since Pentaho is pushing heavily into the big data space, and tightly integrated with Hadoop, Cassandra, and because much of that technology falls under the Apache license, "we decided it would just be easier to move the entire framework to Apache," says Moran. Also, Moran notes, on the Apache website, where  what is permissible for Apache projects is explained, LGPL is not a third party library that would be allowed.

"We would like Kettle to become pervasive and possibly be used in some of the other Apache projects," he says. The customers won't see any difference and for the most part even community people may not see much difference, but it is that perception that there are no issues with embedding Kettle along with Hadoop or any other of those technologies that is important, he emphasizes. 

Pentaho Kettle for big data delivers the following benefits to developers, analysts and data scientists, according to the company:

  • Delivers at least a 10x boost in productivity for developers through visual tools that eliminate the need to write code such as Hadoop MapReduce Java programs, Pig scripts, Hive queries, or NoSQL database queries and scripts;
  • Makes big data platforms usable for a wide range of developers, whereas previously big data platforms were usable only by those with deep developer skills such as the ability write Java MapReduce jobs and Pig scripts;
  • Enables easy visual orchestration of big data tasks such as Hadoop MapReduce jobs, Pentaho MapReduce jobs, Pig scripts, Hive queries, HBase queries, as well as traditional IT tasks such as data mart/warehouse loads and operational data extract-transform-load jobs;
  • Leverages the full capabilities of each big data platform through Pentaho Kettle's native integration with each one, while enabling easy co-existence and migration between big data platforms and traditional relational databases;

Visit to download Pentaho Kettle for Big Data, and access how-to guides, videos and additional resources.