A multi-table connection strategy based on Hadoop

Jian XU,Qun CHEN,Zhuo WANG,Zhan-huai LI
DOI: https://doi.org/10.3969/j.issn.1004-373X.2014.06.029
2014-01-01
Abstract:When Hadoop is used to deal with the issue of multi-table connection,a large number of intermediate results are written into local disks. As a result,efficiency of the system becomes very low. In order to solve this problem,a "Replace-Query" method is proposed. By building indexes for the connected tables,the pre-output tuple set are replaced as index informa-tion to send to the intermediate results. The I/O cost of the intermediate results becomes quite low. In order to improve system performance,it makes full use of buffer pool,secondary sort and multi-thread technique to optimize the management of indexes. These indexes participate in the whole multi-table connecting process and the records can be fully and rapidly recovered by que-rying. An experiment for contrasting it with the original Hadoop was designed on TPC-H data set. The results show that this method provides a 35.5% reduction in space consumption,and improves the running efficiency of 12.9%.
What problem does this paper attempt to address?