MapReduce model-based optimization of range queries
Hui Zhao,Shuqiang Yang,Zhikun Chen,Songchang Jin,Hong Yin,Long Li
DOI: https://doi.org/10.1109/FSKD.2012.6234050
2012-01-01
Abstract:In recent years, MapReduce parallel computing model has gained lots of attentions from industry and academia. In Google, Yahoo, Facebook, etc., it has played a very good effect, which greatly simplifies the design of large-scale data-intensive applications. MapReduce-based systems were originally used to manage massive unstructured and semi-structured data, for example: to generate the inverted index, to calculate web page rank, log analysis, etc. Therefore, current MapReduce systems don't give more considerations to the optimization of structured data, for example: it uses the brute-force scanning mode to process the whole datasets, which confront the common workflow of structured data processing, range query and analysis. To address the problem, this paper propose to build a global B-tree like index on top of hadoop distributed file system for structured data, and use the global index to eliminate unnecessary map tasks during range queries, thereby reducing the overhead of data I/O and tasks scheduling, which not only reduces query response time, but also greatly optimizes system resource utilization.