Feisu: Fast Query Execution Over Heterogeneous Data Sources On Large-Scale Clusters

An Qin,Yuan Yuan,Dai Tan,Pengyu Sun,Xiang Zhang,Hao Cao,Rubao Lee,Xiaodong Zhang
DOI: https://doi.org/10.1109/ICDE.2017.162
2017-01-01
Abstract:Fast data analytics at an increasingly large scale has become a critical task in any Internet service company. For example, in Baidu, the major search engine company in China, large volumes of Web and business data in PB-scale are timely and constantly acquired and analyzed for the purposes of evaluating product revenue, tracking product demanding activities on market, predicting user behavior, upgrading product rankings, and diagnosing spam cases, and many others. Response time for queries of various data analytics not only affects user experiences, but also has a serious impact on productivity of business operations.In this paper, to meet the challenge of fast data analytics, we present Feisu (meaning fast in Chinese), a data integration system over heterogeneous storage systems, which has been widely used in Baidu's critical and daily business analytics applications after our R&D efforts. Feisu is designed and implemented to co-work together with several heterogeneous storage systems, and exploit the query similarity embedded in complex query workloads. Our experiments using real world workloads show that Feisu can significantly improve query performance in Baidu. Feisu has been in production use in Baidu for two years to effectively manage over dozens of petabytes of data for various applications.
What problem does this paper attempt to address?