KCGS-Store: A Columnar Storage Based on Group Sorting of Key Columns

Tao Xu,Dongsheng Wang
DOI: https://doi.org/10.1109/cloud.2016.0041
2016-01-01
Abstract:For the sake of capacity and cost, disks are currently considered as the main storage medium for the massive data. However, the I/O bandwidth of disks lags far behind the growing speed of data, which thus becomes the performance bottleneck of the big data management systems. Therefore, optimizing the storage structure to improve the efficiency of reading and writing has become one important challenge in the age of big data. In this paper, we present a columnar storage structure based on group sorting of the key columns called KCGS-Store. In KCGS-Store, each key column is divided into several groups using the partitioning function. According to the groups of all key columns, the table is split into a number of sub-tables in which the key columns of all records have the same ranges of values. We design a data structure named pool to keep the sub-table and store data by columns. For each key column, all pools belonging to the same group are combined, and arranged by the ordering of the ranges of values. In this way, irrelevant column values can be effectively filtered when executing SQL commands, so as to reduce the amount of data being read; consequently, the query performance can be improved. Meanwhile, using the pool matrix, we can reorganize the records at a little overhead of time and storage space. The evaluation results show that when compared with ORCFile and Parquet, KCGS-Store is superior in many aspects including storage space, data loading and SQL querying.
What problem does this paper attempt to address?