The Data Lakehouse: Data Warehousing and More

Dipankar Mazumdar,Jason Hughes,JB Onofre
2023-10-13
Abstract:Relational Database Management Systems designed for Online Analytical Processing (RDBMS-OLAP) have been foundational to democratizing data and enabling analytical use cases such as business intelligence and reporting for many years. However, RDBMS-OLAP systems present some well-known challenges. They are primarily optimized only for relational workloads, lead to proliferation of data copies which can become unmanageable, and since the data is stored in proprietary formats, it can lead to vendor lock-in, restricting access to engines, tools, and capabilities beyond what the vendor offers. As the demand for data-driven decision making surges, the need for a more robust data architecture to address these challenges becomes ever more critical. Cloud data lakes have addressed some of the shortcomings of RDBMS-OLAP systems, but they present their own set of challenges. More recently, organizations have often followed a two-tier architectural approach to take advantage of both these platforms, leveraging both cloud data lakes and RDBMS-OLAP systems. However, this approach brings additional challenges, complexities, and overhead. This paper discusses how a data lakehouse, a new architectural approach, achieves the same benefits of an RDBMS-OLAP and cloud data lake combined, while also providing additional advantages. We take today's data warehousing and break it down into implementation independent components, capabilities, and practices. We then take these aspects and show how a lakehouse architecture satisfies them. Then, we go a step further and discuss what additional capabilities and benefits a lakehouse architecture provides over an RDBMS-OLAP.
Databases
What problem does this paper attempt to address?
The problem that this paper attempts to solve is the challenges faced by existing data architectures when dealing with large - scale data analysis. Specifically: 1. **Limitations of Relational Database Management Systems (RDBMS - OLAP)**: - **Only Optimized for Relational Workloads**: RDBMS - OLAP systems are mainly optimized for structured data and have difficulty handling semi - structured or unstructured data, which limits their application in advanced analysis tasks (such as machine learning). - **Proliferation of Data Copies**: To meet different analysis requirements, enterprises usually need to create multiple data copies, which not only increases management complexity but may also lead to data inconsistency. - **Vendor Lock - in**: Since data is stored in a proprietary format, it is difficult for enterprises to switch to other data processing tools or engines, resulting in the vendor lock - in problem. 2. **Limitations of Cloud Data Lakes**: - **Lack of Transaction Support**: Traditional data lakes lack transaction support (ACID properties) like RDBMS - OLAP and cannot guarantee data consistency and reliability. - **Data Governance and Quality**: Data in the data lake usually lacks effective governance and quality control mechanisms, resulting in unreliable data. 3. **Complexity of the Two - Tier Architecture**: - **Additional Complexity and Overhead**: To combine the advantages of RDBMS - OLAP and data lakes, many enterprises have adopted a two - tier architecture, but this brings additional complexity and management overhead, such as the need to maintain ETL pipelines and deal with data copy problems. To solve the above problems, this paper proposes a new data architecture - Data Lakehouse. The Data Lakehouse aims to combine the advantages of RDBMS - OLAP and data lakes while overcoming their disadvantages. Specific goals include: - **Provide Transaction Support**: The Data Lakehouse supports ACID transactions to ensure data consistency and reliability. - **Open Data Format**: Data is stored in an open format (such as Apache Parquet, ORC, etc.), allowing different engines to access the same data set and avoiding vendor lock - in. - **Reduce Data Copies**: By directly accessing source data, unnecessary data copies are reduced. - **Data Governance and Quality**: Adopt mature data governance and quality control practices to ensure data security and compliance. - **Flexible Schema Management**: Support schema evolution without having to rewrite the entire table. - **High Scalability**: Separate storage and computation, utilize low - cost cloud object storage and independently scalable computing resources, and support various analysis workloads. Through these improvements, the Data Lakehouse aims to provide a more efficient, more flexible, and more reliable data architecture to meet the needs of modern enterprises for data - driven decision - making.