Taming the Big Data Monster: Managing Petabytes of Data with Multi-Model Databases.
Yang Chen,Feng Zhang,Yinhao Hong,Yunpeng Chai,Wei Lu,Hong Chen,Xiaoyong Du,Peipei Wang,Le Mi,Jintao Li,Xilin Tang,Yanliang Zhou,Wei Zhou,Peng Zhang,Fengyi Chen,Pengfei Li,Yu Li
DOI: https://doi.org/10.1109/sbac-pad55451.2022.00039
2022-01-01
Abstract:With the development of big data technology, the amount of business data that Internet companies need to handle has reached the petabyte level, which poses great pressure on the system processing capacity. For example, the peak order volume of Alibaba's Global Shopping Festival in 2020 reached 583,000 orders per second. Even worse, multi-model data are involved in real business. The inability to perform high-throughput, lowlatency transaction processing can result in a poor user experience that can lead to serious financial losses due to customer churn. Although numerous optimizations have been proposed, they can fail in the face of petabytes of data, or be significantly less effective. In this paper, we propose a novel and practical multi-model big data system that can manage petabytes of data. Particularly, we show three special designs for processing the petabytes of data. First, we perform partition to reduce the amount of unnecessary data to be scanned. Second, we adaptively adopt row storage mode for big tables that are frequently updated and column storage mode for tables that are frequently queried to improve the system efficiency. Third, we conduct compression to accelerate IO access speed. We analyze Alibaba's two real PB-level business scenarios, Double 11 and Zhixingtong, and generate workloads and benchmark accordingly to verify our system. Experiments show that our system can efficiently manage petabyte-scale data in real scenarios, providing high-performance querying of terabyte-scale datasets, and be suitable for various workloads.