Enable data discovery. The ability to access all the data in one place makes data discovery really easy. Further, you are able to easily explore all of the various data types because Hadoop has schema on read rather than requiring that you spend time fitting data to existing schemas. Iterate with data blending. Think about what types of data you could join or derive for further analytics. Data lakes make great discovery environments because Hadoop facilitates functions that other databases can’t do as well.
Look at the data in the lake as an investment; the more you can consolidate data in one place, the better the resource it becomes.
Enable data science in the data lake. Data scientists want raw data, historical data from internal sources, and they want to mix it up with new sources. Their requirements typically extend beyond the capability of SQL, so they need flexible programming environments as well as access to machine-learning libraries and scripting libraries, which should also be included in the data lake.
Enable enterprise business intelligence. BI is not separate or going away. As a component in the architecture, the data lake feeds the enterprise data warehouse, and as such, it is a great prototyping environment for enterprise BI. When you have a new data source or project, you need to be able to profile the data, test business rules, and wrangle the data. It is often effective to use data virtualization as an abstraction to mask any complexity, and yet be able to quickly and easily test and show users what they need to know. In doing so, you get the benefits of discovery when you’re trying to do analysis.
Additionally, IT can build to specific requirements, but business doesn’t always know what it doesn’t know. Agile BI was the first step, but enabling discovery enables the business to flesh out definitions and build better EDWs—including the data, tools, and governance—because you know what you’re going to be working with.
3. Shift your mindset.
Our survey revealed lessons learned from data lake pioneers. Significantly, in adopting a data lake strategy and architecture, you need to be prepared to change the way you think about data. You need to move from a “current project” perspective to long-term thinking.
Think about reusability. As you’re pulling data into the lake, consider whether there’s future value to be gained from the data even though you might not currently need it. Also, consider who else may benefit from the data—either now or in the future. Look at the data in the lake as an investment; the more you can consolidate data in one place, the better the resource it becomes.
From a technology efficiency standpoint, as you add more and more data, you gain more parallelism in the data lake, more concurrency. If you combine all the clusters into a single data lake cluster that serves multiple applications (a platform or foundation for all applications), bring all that data in one place so it is reusable, all the apps benefit from performance. Having multiple clusters works against you, so don’t fracture datasets even when it’s possible. The more you can compound your investment, the better it will work. This kind of critical mass allows resource pooling that yields benefits in the long term. It’s similar to the benefits of server virtualization in which multiple applications can leverage a single physical server but without the containers—the benefits of centralization.
Think about establishing governance first. Governance is the leading challenge for data lakes, as identified by 71% of survey respondents. It is absolutely essential to establish governance at the onset of your data lake adoption—postponing makes it much more difficult to move, map, and assign privileges. When it comes to governance in discovery mode, think about embracing sharing and collaborative peer reviews, thus loosening governance rules but increasing monitoring roles. Even when the data will be put in the warehouse for use with standard tools by users, an approach for governance must be set at the start of the data lake to avoid data swamps/data dumps.
Think about tackling security up front. Finally, tackle security up front. Sixty-seven percent of survey respondents reported that security is a serious challenge. You need to determine what kind of security will be necessary for incoming data, as well as establish barriers, access points, and networks. Also, what security factors will surface as Hadoop meets enterprise data management? Hadoop functions as part of the enterprise architecture, not just project-based as in the early days, so think big picture and plan ahead for what will be necessary down the road. Consider: What are your standards? How do your tools work? How does it work with enterprise security? As discussed earlier, it might be effective to set up security zones or an access cluster.
A Key Component of an Organizational Data Strategy
As the barriers of technology, price/performance, and maturity give way, we are able to advance our data management principles and techniques. As such, data lakes, or whatever label you choose, are an inevitable component of an organizational data strategy. Forward-thinking adopters see data lakes as a component to make infrastructure ubiquitous and to tackle data reusability and multiple workloads without data duplication, thus enabling agility, discovery, and data science capabilities. As early adopters have taught us, addressing data organization, governance, and security at the onset will ensure that we are not creating tomorrow’s data swamps on our journey to a data lake.