Moving Data Between Hadoop and RDBMSs

Apache Hadoop has been a great innovation for storing large amounts of unstructured data, but users still need to reference data from existing RDBMS based systems.  Addressing this topic in “From Oracle to Hadoop: Unlocking Hadoop for Your RDBMS with Apache Sqoop and Other Tools,” a session at the Strata + Hadoop World conference,  Guy Harrison, executive director of Research and Development at Dell Software, David Robson, principal technologist at Dell Software, and Kathleen Ting, a technical account manager at Cloudera and a co-author of O’Reilly’s Apache Sqoop Cookbook, looked at how to transfer data efficently between Hadoop and an RDBMS using Sqoop.

Observing that Dell is the largest independent vendor of database tools, with its Quest Software acquisition a few years ago, Harrison noted that relational technology has reigned supreme for over two decades. However, he said, it is unlikely that we will return to a one-size-fits-all approach to data management and, going forward, Hadoop and RDBMSs will have to work together.  Apache Sqoop, he said, is designed to solve the problem of moving data in bulk between Hadoop and SQL databases.

Covering the key attributes of Apache Sqoop, the presenters said that Sqoop 1.4.5 now provides a direct parameter option for Oracle with 5x-20x performance improvement on Oracle table imports.

Going forward, the design goals of Sqoop 2.0 are ease of use with a REST API and Java API, ease of security with separation of responsibilities, and ease of extensibility with a Connector SDK, and a focus on pluggability, said Ting, Sqoop 2.0, she added, is going to take Sqoop to “a whole new level” in terms of security, ease of use and extensibility.