Paxata Self-Service Data Preparation to Improve Governance and Ease of Use

Paxata, provider of a data preparation platform for the enterprise, has announced the availability of its Winter ’15 product release. 

The new release allows administrators to deploy Paxata in heterogeneous environments including the Hortonworks Data Platform on YARN and with multiple versions of Apache Spark. The latest release also improves the way business analysts find, access, and apply data by delivering additional  ease of use capabilities supported by machine learning innovations, and provides enterprise-grade security and a multi-tenant governance model.

Closing the Gap Between Data Gathering and Information Creation

There is a gap between data and information, said Nenshad Bardoliwalla, co-founder and chief product officer at Paxata. Despite the fact that huge quantities of data are being collected and stored in organizations, they still have difficulty turning it into usable information for decision making. “We coined the term ‘self-service data preparation,’” said Bardoliwalla, to describe the process of letting a set of people in the enterprise who have limited technical expertise - and may only have the skill set to work with Microsoft Excel - to access data from a variety of big data and traditional sources and create information that is clean and relevant.

Many IT organizations are still using technologies for data preparation that go back 20-30 years and involve ETL and MDM processes that require months to transform data, said Bardoliwalla, observing that those timeframes are no longer feasible.  Paxata, he said, seeks to do for data preparation what Tableau has done for data visualization and “is targeted at a much broader swath of people than those in IT departments.”

Paxata customers today include the top three banks in the U.S., a large semiconductor company in the world, a trusted audit firm, a leading networking company, and large government agencies, he noted. “We are very proud of the customers we have been able to amass,” said Bardoliwalla. “These companies will not tolerate a system that does not provide governance. We knew from day-one that if we were going to bring self-service to the data preparation space, we would have to combine the freedom and flexibility that end users expect, but also the governance, scale, and security that IT teams expect.” The software provides versioning and step tracking, enabling the system to reproduce with true fidelity how users have worked with the data to transform it into usable information, he noted.

The Winter '15 release includes three key areas of enhancement, said Bardoliwalla.

  1. Additional Machine Learning to Allow End Users to Work More Quickly: With this release, Paxata continues to increase the intelligence of the system to find relationships and commonality in data using additional machine learning and other advanced algorithms to improve accuracy and performance for recommendations. It also includes expanded coverage for multiple dataset use cases, and enhancements to facilitate new business scenarios with complex textual data such as product catalog descriptions.
  2. Increased Governance in the Enterprise: Enhancing security, the new release provides multitenant Lightweight Directory Access Protocol (LDAP)-compliant and SAML 2.x integration to the system.  Paxata syncs directly to existing authentication and authorization providers for up-to-date status for users joining, switching groups and leaving the organization, and avoids the common sync challenges found in static systems and ensuring unified user management.
  3. Increased Support of Multiple Platforms Across the Big Data Landscape: The Winter release provides full support for the Hortonworks Data Platform, in addition to the Cloudera Distribution for Hadoop.  Paxata can connect to multiple versions of Cloudera and Hortonworks clusters simultaneously. Just as customers in the past had multiple RDBMS versions, today many are adopting multiple Hadoop distributions, said Bardoliwalla.  

Paxata’s release can also be run on Apache Spark deployed on YARN for optimized cluster usage. “We have brought Spark to many organizations because we bet on that technology back in 2013,” said Bardoliwalla.