Efficient and Flexible Index Access in MapReduce.

Zhao Cao,Shimin Chen,Dongzhe Ma,Jianhua Feng,Min Wang
DOI: https://doi.org/10.5441/002/edbt.2014.07
2014-01-01
Abstract:A popular programming paradigm in the cloud, MapReduce is ex- tensively considered and used for "big data" analysis. Unfortu- nately, a great many "big data" applications require capabilities be- yond those originally intended by MapReduce, often burdening de- velopers to write unnatural non-obvious MapReduce programs so as to twist the underlying system to meet the requirements. In this paper, we focus on a class of "big data" applications that in addi- tion to MapReduce's main data source, require selective access to one or many data sources, e.g., various kinds of indices, knowledge bases, external cloud services. We propose to extend MapReduce with EFind, an Efficient and Flexible index access solution, to better support this class of ap- plications. EFind introduces a standard index access interface to MapReduce so that (i) developers can easily and flexibly express index access operations without unnatural code, and (ii) the EFind enhanced MapReduce system can automatically optimize the in- dex access operations. We propose and analyze a number of in- dex access strategies that utilize caching, re-partitioning, and index locality to reduce redundant index accesses. EFind collects index statistics and performs cost-based adaptive optimization to improve index access efficiency. Our experimental results, using both real- world and synthetic data sets, show that EFind chooses execution plans that are optimal or close to optimal, and achieves a factor of 2x-8x improvements compared to an approach that accesses in- dices without optimization.
What problem does this paper attempt to address?