Learning to Swim in the Data Lake

Bookmark and Share

The data lake approach is increasingly being championed as a way to realize the promise of big data. This allows organizations to avoid transforming and loading data into a purpose-built data store, and instead, move data in its raw form in a data lake until they need it. The goal is to eliminate data silos and increase analytical agility by making data available to all users in a central, enterprise-wide reservoir.

While data warehouses have been the principle data storage repository for companies since the 1970s, companies have begun to look on the horizon for what is next when it comes to data storage. To provide information about the key technologies, features, best practices and pitfalls to consider when evaluating a data lake approach, Database Trends and Applications recently hosted a special roundtable webcast presented by Rich Reimer, VP of marketing and product management, Splice Machine; Rodan Zadeh, director of product marketing, Attunity; and George Corugedo, CTO and co-founder, RedPoint Global Inc.

“Data Lakes are becoming more interesting because companies are beginning to face a big squeeze. In many cases, what they are seeing is that every year their data is growing 30%-40%, but unfortunately the IT budgets are only growing 3%-4%,” explained Reimer. The growth of the data is clearly outpacing the IT budget, and also the database storage systems are over whelmed. This has even caused some companies to throw out some of their data.

According to Reimer, once a company has an operational data lake it can offload data from its data warehouses, use the data lake as part of the ETL process, and it is possible to take the data across the source systems. Once the data is taken across the source systems it is possible to create applications and then access data across those different applications. 

In today’s market though there are still challenges to Hadoop adoption for companies. These include moving legacy enterprise data to Hadoop, keeping Hadoop data lakes in sync with data warehouses, and arming employees with the necessary skills to work with this new technology. “Companies are not just using this new concept of data lakes, but are using data warehouses in concert with emerging Hadoop technology,” statedZadeh.  Attunity Replicate is designed to upload data from a variety of databases. For example, data can be moved from a data lake to the operational data store, where the data has numerous uses.

Data lakes provide value for master data management (MDM) as well, added Corugedo. “The beauty of a data lake is the ability to keep all of the atomic, original, and granular data. That is beneficial for lots of reasons,” he pointed out.  If there is any confusion about data origin or trustworthiness between IT staff and business users, organizations can go right back and restructure the data in the correct form. Also, statisticians within a company only want data within a data lake because they can only work with data in its most granular form.

View the webcast here

Related Articles

Splice Machine today announced the general availability of its Hadoop RDBMS, a platform to build real-time, scalable applications, that incorporates new features that emerged from charter customers using the the beta offering. With the additional new features and the validation from beta customers, Splice Machine 1.0 can support enterprises struggling with their existing databases and seeking to scale-out affordably, said Monte Zweben, co-founder and CEO, Splice Machine.

Posted November 19, 2014