A Study of SQL-on-Hadoop Systems.

Yueguo Chen,Xiongpai Qin,Haoqiong Bian,Jun Chen,Zhaoan Dong,Xiaoyong Du,Yanjie Gao,Dehai Liu,Jiaheng Lu,Huijie Zhang
DOI: https://doi.org/10.1007/978-3-319-13021-7_12
2014-01-01
Abstract:Hadoop is now the de facto standard for storing and processing big data, not only for unstructured data but also for some structured data. As a result, providing SQL analysis functionality to the big data resided in HDFS becomes more and more important. Hive is a pioneer system that support SQL-like analysis to the data in HDFS. However, the performance of Hive is not satisfactory for many applications. This leads to the quick emergence of dozens of SQL-on-Hadoop systems that try to support interactive SQL query processing to the data stored in HDFS. This paper firstly gives a brief technical review on recent efforts of SQL-on-Hadoop systems. Then we test and compare the performance of five representative SQL-on-Hadoop systems, based on some queries selected or derived from the TPC-DS benchmark. According to the results, we show that such systems can benefit more from the applications of many parallel query processing techniques that have been widely studied in the traditional MPP analytical databases.
What problem does this paper attempt to address?