<< back Page 3 of 3

Don’t Get Washed Out by the Overflowing Data Lake: 5 Key Considerations

By Steve Sarsfield

Dec 13, 2018

A powerful analytics platform can store and access all your data. Users need to store big data on Hadoop, in databases, and in the cloud because they offer varying cost models at the expense of performance. A multitiered approach offers quick access to hot data that is important to daily business and low-cost storage for data that varies in importance and timeliness.

Analytics platforms that support tiered storage can help you manage multiple storage tiers for cost-effective analytics. You should be able to perform advanced SQL queries on bulk data stored in HDFS. You should be able to access the ORC, Parquet, text, and JSON files that exist in the tiers and use them without moving the data. Move the data into a different tier when your organization requires faster performance for in-depth analytics.

5-Consider Open Standards, Not Just Open Source

There are some amazing technologies that have been developed by the open source community for data management. Technologies such as Kafka and Spark are common in today’s landscape and provide useful functions such as data ingestion and operational analytics. However, when the open source tools can’t handle all of your unique needs, your commercial tools need to play nice with them. The commercial tools need to exhibit “openness” toward the open source tools.

For more articles on big data trends, access the BIG DATA SOURCEBOOK

This is why your software must support open standards. With open standards, your company can pick and choose among competing vendors and not be locked in to any one platform or technology. Many people seem to think that open source software offers the same advantages, but it doesn’t. Open source simply means that the underlying software code is available for inspection and modification. If you want to access your Parquet data, it shouldn’t matter whether the source code of the solution is open or not. It only matters that you can access that file format without copying or moving the data across solutions with no requirement to convert to a proprietary format.

The standards extend beyond open source as well. Can you use standard SQL to perform analytics? Can the extract-transform-load tool talk to the database to load data? Can you use a standard visualization tool to create stunning data stories? Can users who prefer Python use the data within your analytical systems, taking advantage of all of the optimizations it offers? Sticking with open standards helps everyone using the data, even when migrating systems.

What’s Ahead

Our business colleagues see the fictional portrayal of data analytics in the movies and it’s no surprise that they crave it. They see the Uber or GrubHub app depicting how far away their driver or food is from its destination, and it’s useful for them. They see how Google Maps can predict where they want to go and how long it will take to get there, and they want that information. So it’s no surprise that colleagues expect more from you in managing your data architecture and delivering analytics for delivering on business challenges.

The realization of fictional portrayals of data analytics is not far from reach. Only by avoiding some of the pitfalls of available technologies can we start to achieve success in the data lake. As we advance in technology, fiction will drive our reality. Be ready for it.

<< back Page 3 of 3