The Rising Challenge of Muddy Data Lakes

Jul 11, 2022

By Dong Li, VP of Growth & Marketing, Kyligence

As digital transformation is becoming essential for enterprise organizations, more and more businesses start to claim the willingness and ownership of data assets. To serve the data analytics and applications for business growth and operational needs, data lakes are being widely adopted as the data infrastructure because of their scalability and flexibility.

Data lakes are strong at parking petabytes of data and production delivery due to their "schema-on-read" structure. But every coin has two sides. Data lakes, as a semantically flexible data store and bypassed governance efforts, have been seen as muddy swamps and inefficient in data management.

Data analysts always seek innovative insights from cross-domain data sources. The traditional way to accomplish this is to develop an end-to-end pipeline and generate insights for specific scenarios. One business group reuses nothing from others because people are cautious about using data that is out of their control.

Since there is little collaboration to reuse metrics and tables between business groups, there will be more and more tables appearing in data lakes. For example, an Internet company with 5,700 source tables exploded to almost 1 million wide tables and aggregated tables in the data lake. As a result, the data team had to govern excessive numbers of tables. They not only faced the challenge of managing the data quality and consistency, but also maintaining the ever-growing cost of the data explosion.

There are millions of tables on data lakes, which require a large number of computation and storage resources, and also human efforts for development and maintenance. But to think about the return: How much usage is there for each table actually? And how much is one single query?

To meet these challenges many data lake practitioners are trying to bring data warehouse methodology to data lakes and building lakehouse architectures. Data warehouse methodologies require critical criteria for data quality and standard, which can help to catch up with the governance efforts.

Multidimensional database technology has matured in the data analysis industry since the 1960s.[1] A multidimensional database is a variation of the relational model that uses multidimensional structures to organize data and express the relationships between data.[2] Multidimensional databases are widely recognized today as the online analytical processing technology (OLAP), and is positioned at the data mart layer in a data warehouse.

The Benefits of Multidimensional Databases

Other than relational databases using tables as key entities, multidimensional databases use another approach. They define data models based on the relation of tables, which include business dimensions and measures rather than columns and rows and form a unified semantic layer for the single source of business metrics. As a result, business users will have total alignment based on the standardized semantic definition and relieve the trust pain.

From the storage perspective, multidimensional databases will persist data in the format of OLAP cubes, which store the aggregated results for multiple combinations of dimensions, to accelerate multidimensional analytics. Data engineers just need to prepare fact tables and dimensions in star schema or Snowflake schema format and then ingest them into cubes, rather than creating various wide tables for each report and application. As a result, the storage is well-governed and clarifies the muddy storage.

With enterprises moving to the cloud, IT teams are seeking more effective ways to serve more data volume and business requirements at a lower cost. From an architectural perspective, storage resources are usually cheaper than computation and network resources, multidimensional databases adopt pre-computing technology and use more storage and less computation. Especially for repetitive analytic requirements, for example, BI dashboards or data services, multidimensional databases will have better return-over-investment (ROI) than relational MPP databases.

Multidimensional databases store semantic information like dimensions and measures, which are always easy to understand by business users. Data analysts and business users can find whatever data they want via this semantic layer, and in the meantime, data engineers can govern the data lake comfortably, and easily identify hot data models as the most valuable data.

Multidimensional databases store data in multidimensional structures, bring back the semantic standard, and can be taken as the pain relievers for the governance of muddy data lakes. Moreover, the data analytics can be accelerated with reduced data-to-insight and enable citizen analytics.

References

Newsletters

The Rising Challenge of Muddy Data Lakes

The Benefits of Multidimensional Databases

White Papers

Sponsors