Abstract:Relational Database Management Systems designed for Online Analytical Processing (RDBMS-OLAP) have been foundational to democratizing data and enabling analytical use cases such as business intelligence and reporting for many years. However, RDBMS-OLAP systems present some well-known challenges. They are primarily optimized only for relational workloads, lead to proliferation of data copies which can become unmanageable, and since the data is stored in proprietary formats, it can lead to vendor lock-in, restricting access to engines, tools, and capabilities beyond what the vendor offers. As the demand for data-driven decision making surges, the need for a more robust data architecture to address these challenges becomes ever more critical. Cloud data lakes have addressed some of the shortcomings of RDBMS-OLAP systems, but they present their own set of challenges. More recently, organizations have often followed a two-tier architectural approach to take advantage of both these platforms, leveraging both cloud data lakes and RDBMS-OLAP systems. However, this approach brings additional challenges, complexities, and overhead. This paper discusses how a data lakehouse, a new architectural approach, achieves the same benefits of an RDBMS-OLAP and cloud data lake combined, while also providing additional advantages. We take today's data warehousing and break it down into implementation independent components, capabilities, and practices. We then take these aspects and show how a lakehouse architecture satisfies them. Then, we go a step further and discuss what additional capabilities and benefits a lakehouse architecture provides over an RDBMS-OLAP.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is the challenges faced by existing data architectures when dealing with large - scale data analysis. Specifically: 1. **Limitations of Relational Database Management Systems (RDBMS - OLAP)**: - **Only Optimized for Relational Workloads**: RDBMS - OLAP systems are mainly optimized for structured data and have difficulty handling semi - structured or unstructured data, which limits their application in advanced analysis tasks (such as machine learning). - **Proliferation of Data Copies**: To meet different analysis requirements, enterprises usually need to create multiple data copies, which not only increases management complexity but may also lead to data inconsistency. - **Vendor Lock - in**: Since data is stored in a proprietary format, it is difficult for enterprises to switch to other data processing tools or engines, resulting in the vendor lock - in problem. 2. **Limitations of Cloud Data Lakes**: - **Lack of Transaction Support**: Traditional data lakes lack transaction support (ACID properties) like RDBMS - OLAP and cannot guarantee data consistency and reliability. - **Data Governance and Quality**: Data in the data lake usually lacks effective governance and quality control mechanisms, resulting in unreliable data. 3. **Complexity of the Two - Tier Architecture**: - **Additional Complexity and Overhead**: To combine the advantages of RDBMS - OLAP and data lakes, many enterprises have adopted a two - tier architecture, but this brings additional complexity and management overhead, such as the need to maintain ETL pipelines and deal with data copy problems. To solve the above problems, this paper proposes a new data architecture - Data Lakehouse. The Data Lakehouse aims to combine the advantages of RDBMS - OLAP and data lakes while overcoming their disadvantages. Specific goals include: - **Provide Transaction Support**: The Data Lakehouse supports ACID transactions to ensure data consistency and reliability. - **Open Data Format**: Data is stored in an open format (such as Apache Parquet, ORC, etc.), allowing different engines to access the same data set and avoiding vendor lock - in. - **Reduce Data Copies**: By directly accessing source data, unnecessary data copies are reduced. - **Data Governance and Quality**: Adopt mature data governance and quality control practices to ensure data security and compliance. - **Flexible Schema Management**: Support schema evolution without having to rewrite the entire table. - **High Scalability**: Separate storage and computation, utilize low - cost cloud object storage and independently scalable computing resources, and support various analysis workloads. Through these improvements, the Data Lakehouse aims to provide a more efficient, more flexible, and more reliable data architecture to meet the needs of modern enterprises for data - driven decision - making.

The Data Lakehouse: Data Warehousing and More

Data Lakehouse: Next Generation Information System

An Overview of Data Warehouse and Data Lake in Modern Enterprise Data Management

A Lakehouse Architecture for the Management and Analysis of Heterogeneous Data for Biomedical Research and Mega-biobanks

Building a serverless Data Lakehouse from spare parts

On data lake architectures and metadata management

DROLAP - A Dense-Region Based Approach to On-Line Analytical Processing

The evolution of data storage architectures: examining the secure value of the Data Lakehouse

A Big Data Lake for Multilevel Streaming Analytics

Data Lakes: A Survey of Functions and Systems

Spatial big data architecture: From Data Warehouses and Data Lakes to the LakeHouse

Towards the Building of a Dense-Region-based OLAP System

A Review on Data Lake

Data Lakehouse: Benefits in small and medium enterprises

The End of an Architectural Era for Analytical Databases

Toward data lakes as central building blocks for data management and analysis

DLToDW: Transferring Relational and NoSQL Databases from a Data Lake

A Zone-Based Data Lake Architecture for IoT, Small and Big Data

Benchmarking Data Lakes Featuring Structured and Unstructured Data with DLBench

Leveraging Oil and Gas Data Lakes to Enable Data Science Factories

A topological approach to synaptic connectivity and spatial memory