As an industry, we’ve been talking about the promise of data lakes for more than a decade. It’s a fantastic concept—to put an end to data silos with a single repository for big data analytics. Imagine having a singular place to house all your data for analytics to support product-led growth and business insight. Sadly, the data lake idea went cold for a while because early attempts were built on Hadoop-based repositories that were on-prem and lacked resources and scalability. We ended up with a “Hadoop hangover.”
Data lakes of the past were known for management challenges and slow time-to-value. But the accelerated adoption of cloud object storage, along with the exponential growth of data, has made them attractive again.
In fact, we need data lakes to support data analytics now more than ever. While cloud object storage first became popular as a cost-effective way to temporarily store or archive data, it has caught on because it is inexpensive, secure, durable, and elastic. It’s not only cost-effective but it’s easy to stream data in. These features make the cloud a perfect place to build a data lake—with one addressable exception.
Data Lake or Data Swamp?
The economics, built-in security, and scalability of cloud object storage encourage organizations to store more and more data—creating a massive data lake with limitless potential for data analytics. Businesses understand that having more data (not less) can be a strategic advantage. Unfortunately, many data lake initiatives in recent history failed because the data lake became a data swamp—comprised of cold data that could not be easily accessed or used. Many found that it’s easy to send data to the cloud but making it accessible to users across the organization who can analyze that data and act on the insights from it is difficult. These data lakes became a dumping ground for multi-structured datasets, accumulating and collecting digital dust without a glimmer of the promised strategic advantage.
Simply put, cloud object storage wasn’t built for general-purpose analytics—just as Hadoop wasn’t. To gain insights, data must be transformed and moved out of the lake into an analytical database such as Splunk, MySQL, or Oracle, depending on the use case. This process is complex, slow, and costly. It’s also a challenge because the industry currently faces a shortage of the data engineers who are needed to cleanse and transform data and build the data pipelines needed to get it into these analytical systems.
Gartner found that more than half of enterprises plan to invest in a data lake within the next 2 years despite these well-known challenges. There are an incredible number of use cases for the data lake, from investigating cyber-breaches through security logs to researching and improving customer experience. It’s no wonder that businesses are still holding onto the promise of the data lake. So how can we clean up the swamp and make sure these efforts don’t fail? And critically, how do we unlock and provide access to data stored in the cloud—the most significant barrier of all?
Turning Up the Heat on Cold Cloud Storage
It’s possible (and preferable) to make cloud object storage hot for data analytics, but it requires rethinking the architecture. We need to make sure the storage has the look and feel of a database, in essence, turning cloud object storage into a high-performance analytics database or warehouse. Having “hot data” requires fast and easy access in minutes—not weeks or months—even when processing tens of terabytes per day. That type of performance requires a different approach to pipelining data, avoiding transformation and movement. The architecture needed is as simple as compressing, indexing, and publishing data to tools such as Kibana and/or Looker via well-known APIs in order to store once and move and process less.
One of the most important ways to turn up the heat on data analytics is by facilitating search. Specifically, search is the ultimate democratizer of data, allowing for self-service data stream selection and publishing without IT admins or database engineers. All data should be fully searchable and available for analysis using existing data tools. Imagine giving users the ability to search and query at will, easily asking questions and analyzing data with ease. Most of the better-known data warehouse and data lakehouse platforms don’t provide this critical functionality.
But some forward-leaning enterprises have found a way. Take, for example, BAI Communications, whose data lake strategy embraces this type of architecture. In major commuter cities, BAI provides state-of-the-art communications infrastructure (cellular, Wi-Fi, broadcast, radio, and IP networks). BAI streams its data to a centralized data lake built on Amazon S3 cloud object storage, where it is secure and compliant with numerous government regulations. Using its data lake built on cloud object storage which has been activated for analytics through a multi-API data lake platform, BAI can find, access, and analyze its data faster, more easily, and in a more cost-controlled manner than ever before. The company is using insights generated from its global networks over multiple years to help rail operators maintain the flow of traffic and optimize routes, turning data insights into business value. This approach proved especially valuable when the pandemic hit, since BAI was able to deeply understand how COVID-19 impacted public transit networks regionally, all around the world, so they could continue providing critical connectivity to citizens.