Pivotal Makes Major Improvements to Big Data Suite

Pivotal has made updates to its big data suite that include upgrades to the Pivotal HD enterprise-grade Apache Hadoop distribution, and  performance improvements for Pivotal Greenplum Database.

Pivotal HD, now based on the Open Data Platform core, delivers an updated Hadoop stack to provide better stability, management, security, monitoring and data processing, and Pivotal Greenplum Database offers major performance improvements with the addition of Pivotal Query Optimizer.

The improvements are targeted at helping customers manage datasets that are expanding dramatically due to mobile, cloud, social, and the Internet of Things, and to tackle complex queries with speed and flexibility.

Providing up to 100x performance improvements, the enhanced Pivotal Query Optimizer, a cost-based query optimizer for big data, is now available for the Greenplum Database, in addition to Pivotal Pivotal HAWQ  (Hadoop With Query).

“We have redesigned the query optimization functionality within the database to really determine effectively the cost of processing a query across a number of machines and processors in a cluster. We have optimized the query process to leverage a massively parallel process database in order to get the best query performance. It is not just optimized for parallel processing across a distributed database but also very extensible. You can configure it in great detail which has boosted the performance in general,” said Nithin Rao, director of data product management, Pivotal.

With this release, Pivotal HD includes major updates to Apache Hadoop components, including Apache Spark. It provides customers with better stability, management, security, monitoring, and data processing capabilities in the Hadoop stack. This allows enterprises to off-load more business-critical workloads to Hadoop, to store and process large volumes of data at lower costs, in way that is compliant with policies and regulations.

“We have made a significant update to the PHD distribution and added a significant amount of functionality,” said Rao. Now based on a standardized Open Data Platform (ODP) core consisting of Apache Hadoop 2.6 and Apache Ambari, the new release updates existing Hadoop components for scripting and query (Pig and Hive), non-relational database (HBase), along with basic coordination and workflow orchestration (Zookeeper and Oozie); adds Apache Spark stack including related component Spark SQL, Spark Streaming, MLLib, GraphX; and adds additional Hadoop components for improved security (Ranger, Knox), monitoring (Nagios, Ganglia in addition to Ambari) and data processing (Tez).

The total number of distributons that are built on the Open Data Platform is now four, said Rao. In addition to the Pivotal HD, there is also the Hortonworks HD platform, and the IBM and Infosys distributions based on the core. “We are seeing the ODP get major traction in the market,” said Rao.

Standardization of Hadoop distributions on the Open Data Platform will ultimately benefit vendors as well as customers, said Rao. “Hadoop is rapidly gaining popularity but there are number of different distributions and we are starting to see fragmentation,” he said.  “It is hard for a company to choose which direction to go in even though all the components of these Hadoop distributions are essentially open source. It is tough for customers to decide which to deploy and tough for the ecosystem of Hadoop vendors that are building tools on top of Hadoop to really determine which distribution to certify against. The idea is to take away that fragmentation.”

The idea is akin to the development of the Linux kernel, explained Sundeep Madra, vice president, Data Product Group, Pivotal.

“Think about the UNIX ecosystem with different vendors and when Linux came around,” he said. There is now a common kernel on which various distributions such as RHEL, Centos, and Oracle Linux are based,  he said. "As a software vendor you can be comfortable that if your software runs on one of those, it will run on any of the others. Now,  by having four distributions with a common core, you can be sure that if you certify on any one it will run on any of the others. We think that is important for the growth of the ecosystem.”

More information is at