HTAP Databases: A Survey

Chao Zhang,Guoliang Li,Jintao Zhang,Xinning Zhang,Jianhua Feng
DOI: https://doi.org/10.1109/TKDE.2024.3389693
2024-04-24
Abstract:Since Gartner coined the term, Hybrid Transactional and Analytical Processing (HTAP), numerous HTAP databases have been proposed to combine transactions with analytics in order to enable real-time data analytics for various data-intensive applications. HTAP databases typically process the mixed workloads of transactions and analytical queries in a unified system by leveraging both a row store and a column store. As there are different storage architectures and processing techniques to satisfy various requirements of diverse applications, it is critical to summarize the pros and cons of these key techniques. This paper offers a comprehensive survey of HTAP databases. We mainly classify state-of-the-art HTAP databases according to four storage architectures: (a) Primary Row Store and In-Memory Column Store; (b) Distributed Row Store and Column Store Replica; (c) Primary Row Store and Distributed In-Memory Column Store; and (d) Primary Column Store and Delta Row Store. We then review the key techniques in HTAP databases, including hybrid workload processing, data organization, data synchronization, query optimization, and resource scheduling. We also discuss existing HTAP benchmarks. Finally, we provide the research challenges and opportunities for HTAP techniques.
Databases
What problem does this paper attempt to address?
This paper is a comprehensive survey of Hybrid Transactional/Analytical Processing (HTAP) databases. HTAP aims to combine Online Transaction Processing (OLTP) and Online Analytical Processing (OLAP) to achieve real-time data analysis and eliminate the need for ETL processes. With the growth of various data-intensive applications such as finance, e-commerce, and fraud detection, HTAP technology is becoming increasingly important. The paper mentions that HTAP databases typically adopt a dual storage architecture of row store and column store to accommodate different workload requirements. However, different HTAP databases employ different storage strategies and technologies based on application priorities, availability, scalability, performance, and data freshness requirements. The main challenges include handling mixed workloads, data organization, data synchronization, query optimization, and resource scheduling. The paper categorizes existing HTAP databases into four storage architectures and discusses key techniques such as handling mixed workloads, organizing data, when to synchronize transactional data, optimizing queries, and scheduling resources. Additionally, it mentions existing HTAP benchmarking and future research challenges and opportunities. Overall, the paper aims to address how to understand and evaluate various architectures and technologies of HTAP databases, as well as their advantages and disadvantages in addressing different application requirements. It also proposes future research directions to improve data processing efficiency and real-time analytical capabilities.