Teradata has introduced a new "design pattern" approach for data lake deployment. The company says its concept of a data lake pattern leverages IP from its client engagements, as well as services and technology to help organizations more quickly and securely get to successful data lake deployment.
“What we are doing is capturing and recycling IP to de-risk and accelerate the data lake journey,” said Chad Meley, vice president of products and services at Teradata. “We have had several years of quality experience working with the leaders of data-driven cultures and we have seen what works and doesn’t work.”
According to Teradata, organizations are exploring data lake functionality to create insight and opportunity from large data volumes, but many IT teams are facing serious problems due to a lack of best practices, the shortage of data scientists, and even confusion regarding the definition of a data lake. While data lakes are often assumed to be synonymous with Hadoop, Teradata asserts, organizations need to understand that a data lake can be built on multiple technologies such as Hadoop, NoSQL, Amazon Simple Storage Service, a relational database, or various combinations of those technologies.
The Design Pattern is a Plan
With the concept of a data lake design pattern, Teradata says, it is providing the plan that is necessary for an effective enterprise data lake.
The design pattern approach that Teradata is offering consists of intellectual property based on enterprise-class best practices acquired from real-world implementations combined with products.
With the 2014 acquisition of the big data consultancy Think Big, Teradata says, it has amassed IP and a set of best practices that are tried and true and tested in the field to help organizations build data lakes.
Teradata’s Data Lake Design Pattern services from Think Big include a Data Lake Foundation, for teams just getting started with a data lake or seeking best practices consulting; Data Lake Architecture, designed for organizations that are looking for recommendations for data lake best practices and technology choices; and Data Lake Analytics, which support data preparation for execution of analytics cycles.
Teradata is also offering products and technologies for use with data lake environments, such as Teradata Listener, which it says simplifies streaming big data into the data lake with an intelligent, self-service software solution; Teradata Appliance for Hadoop for storing data; Presto, which provides a SQL-on-Hadoop architecture; and data lake accelerators built from IP, referred to as Pipeline Controller and Buffer Server, which combine to orchestrate data movement from local servers into Hadoop.
Teradata views a data lake as a collection of long-term data containers that capture, refine, and explore any form of raw data at scale, enabled by low cost technologies, from which multiple downstream facilities may draw.
A data lake design pattern, in contrast, is an architecture and set of corresponding requirements with best practices for implementations, although how a pattern is implemented can vary from workload to workload, and from organization to organization.
Today, companies are finding that there are numerous problems that can arise on the road to a data lake deployment, said Meley. One of the problems is that the barrier to entry on the cost side is so low that some organizations are actually finding that various departments within organizations have not one but multiple data lake repositories. “The reason that is a problem is that you suddenly have multiple versions of data, different protocols for how it is deployed and secured,” he said.
“The other thing we see is tension between people deploying the data lake and the business users around the competing objectives,” he noted. On the one hand, he said, what makes a data lakes interesting is that it is not a data warehouse, meaning there is not a lot of rigor upfront, but the downside is that if the data lake, for example, contains PII data, it can be a governance nightmare. “A lot of people are struggling to find the right balance,” he said. While a small startup might get away with that, “when you start dealing with the kinds of clients that we are dealing with, that is just unacceptable,” Meley noted.
There is also a lack of appreciation upfront of the skills required for users to engage and get value from the data in a data lake, he noted. Moreover, many large organizations are also realizing that they are not going to get competitive advantage by putting the work into having the best ingestion framework for their data lake, and instead want to buy that services and technology so they can focus on capabilities that will allow them to compete and differentiate themselves.
More Than Hadoop
“A lot of people associate Teradata with data warehousing technology, but it is more than the technology that allows clients to help manage their data with a data warehouse,” said Meley, adding that Teradata “led the charge” years ago in defining what a data warehouse is in a non-technical way, with concepts such as a single version of the truth, integrating heterogeneous data, star schemas, and other concepts that had nothing to do with the underlying technologies.
According to Meley, critical to organizations' ability to reap the benefits of the data lake now is an understanding of the data warehouse as separate from the technology underlying it. In the same way that a data warehouse is a design pattern and an MPP relational database is a technology to support that, this differentiation has to happen for data lakes and the underlying technologies in order for people to begin to realize the data lake promise, he said. “That is the essence of what we are trying to bring to market.”
For more information, visit the Teradata website.