Separation is for Better Reunion: Data Lake Storage at Huawei
Xin Tang,Chengliang Chai,Dawei Zhao,Haohai Ma,Yong Zheng,Zhenyong Fan,Xin Wu,Jiaquan Zhang,Rui Zhang,Duanshun Li,Yi He,Keji Huang,Guangbin Meng,Yidong Wang,Yuefeng Zhou,Tao,Lirong Jian,Jiwu Shu,Yuping Wang,Ye Yuan,Guoren Wang,Guoliang Li
DOI: https://doi.org/10.1109/icde60146.2024.00386
2024-01-01
Abstract:Huawei collaborates with some Chinese large busi-ness companies to store and process exabytes of nationwide operational data in data lake storage to provide business insights. Specifically, our customers will ask to store and process massive log message data to support their real-time and decision-making applications. Thus, we need computation and storage components in the analytic platform to process and store these data cost-efficiently. To meet these user requirements, we have designed a storage system in data lake, StreamLake, which introduces a novel design to serve log message streaming and batch data processing in distributed storage, with high scalability, efficiency, reliability and low cost. Specifically, we introduce a stream (storage) object as a storage abstraction for message streaming data to achieve the storage-disaggregated architecture with high scalability and reliability. Moreover, we utilize the erasure coding and tiered storage to save the storage cost, and furthermore, the stream object can be automatically converted to a table object such that cost-effective stream and batch data processing can be achieved. For tabular data, we implement the lakehouse functionality to support ACID via the table object, with a metadata acceleration to improve the efficiency of data access between the compute and storage engines. Also, we design a LakeBrain optimizer at the storage side to optimize the query performance and resource utilization under the storage-disaggregated architecture. Finally, we have also deployed StreamLake in China Mobile, the world's largest mobile network operator to serve over 20PB production data, and the results demonstrate improvements of 30% to 4x in terms of query performance and over 37% in terms of cost saving.