Super Rack: Reusing the Results of Queries in MapReduce Systems

Zhanye Wang,Tao Xu,Dongsheng Wang
DOI: https://doi.org/10.1109/uic-atc-scalcom-cbdcom-iop.2015.51
2015-01-01
Abstract:Over the last few years, Apache MapReduce has become a prevailing framework for large scale data processing. Instead of writing MapReduce programs which are too obscure to express, many developers usually adopt high level query languages, such as Hive or Pig Latin, to finish their complex queries. These languages automatically compile each query into a workflow of MapReduce jobs, so they much facilitate querying and managing large datasets residing in a distributed environment. One option to speed up the execution of workflows is to save the results produced previously and reuse them in the future if needed. In this paper we present Super Rack, assisting by shared storage devices, each workflow can store its results in Super Rack, so the incoming query can reuse these results in order to avoid the redundant computation and fasten the execution. We propose several novel techniques to improve the access and storage efficiency of the previous results. We also evaluate Super Rack to exhibit its feasibility and effectiveness. The experimental results show that our solution outperforms Hive significantly under TPC-H benchmark and real life workloads.
What problem does this paper attempt to address?