Dell, Cloudera and Syncsort Streamline ETL Processes for New Hadoop Users

Dell has introduced a new solution for Hadoop, co-designed with Cloudera and Syncsort, to ease the planning, design, construction and deployment process of transforming data into a ready state for analysis, then loading it for business reporting or for querying.

According to Armando Acosta, Dell’s Hadoop product and planning manager, the new solution is meant to address a problem Dell has seen with many customers. They want to adopt big data and analytics technologies to gain operational efficiency, but are unsure how to get started and also find that Hadoop expertise can be difficult to locate. The “Dell | Cloudera | Syncsort Data Warehouse Optimization – ETL Offload Reference Architecture” aims to help them achieve faster insights and drive operational efficiency by defining how to integrate Hadoop into their environment. With this specific reference architecture, Dell is recommending that they use Hadoop to augment their data warehouse by offloading data transformation jobs that tend to consume a lot of capacity and shift those workloads into Hadoop.

The end to end solution also integrates Syncsort technology. Integration with Syncsort allows Dell to put an ETL engine on top of its Hadoop reference architecture, explained Acosta. “The beauty of this is that if you look at somebody with an enterprise data warehouse and you look at their skillset, they are used to writing SQL scripts, and that is how they get their data transformation jobs.” Bridging the roadblocks to Hadoop use which include a lack of Java and MapReduce expertise as well as the need to rewrite existing SQL code into Pig or Hive, Syncsort provides a tool called SILQ that allows organizations to take a SQL script and translate it into a MapReduce job and then run the data transformation job within the Dell reference architecture.

According to Syncsort, one of the biggest barriers to offloading from the data warehouse into Hadoop has been a legacy of thousands of scripts built and extended over time. SILQ removes this roadblock, eliminating the complexity and risk. “The beauty is that customers don’t have to retool, and they don’t have to worry about that skills gap because we make the use case very easy from end to end," said Acosta.

Additionally, since organizations are always looking for ways to control costs, by moving data transformation jobs and datasets into Hadoop, they can also lower costs around their enterprise data warehouse, Acosta added.

The reference architecture has been fully validated and tested, said Acosta. “We built it hand in hand with Syncsort and Cloudera to make sure that it is a validated, certified solution that is going to work the moment you plug it in.”

To put this to the test, Dell says it had an entry-level technician and an expert-level senior engineer run the same workload on four Dell PowerEdge R730xd servers and two Dell PowerEdge R730 servers, powered by Intel Xeon processor E5-2600 v3 product family, in a Hadoop cluster. The technician created ETL jobs with the Dell | Cloudera | Syncsort solution 60% faster than an expert level senior engineer running the same scenario with do-it-yourself, open-source ETL solutions. Additionally, the entry-level technician was able to streamline ETL design by 53% enabling the equivalent of four days back.

The solution is specifically optimized for Cloudera, although this is not to say that an organization could not take this reference architecture and run it with Hortonworks, said Acosta. “It is totally doable but we are promoting this solution specifically with Cloudera. Cloudera has been our number-one partner in Hadoop.”

For more information, visit