TS-Hadoop: Handling Access Skew in MapReduce by Using Tiered Storage Infrastructure

Zhanye Wang,Jing Li,Tao Xu,Yu Gu,Dongsheng Wang
DOI: https://doi.org/10.1109/ictc.2014.6983331
2014-01-01
Abstract:Over the last few years, MapReduce systems has become popular for processing large-scale data sets and are increasingly being used in web indexing, data mining, and machine learning. Unlike simple application scenarios such as word count, many applications of MapReduce exhibit strong skewed access patterns in real production environment, the data access is non-uniform, often only a small portion of data are accessed far more frequently than others. Clearly, handling these hot data efficiently is quite critical to the overall performance of the MapReduce computation. In this paper, we present TS-Hadoop, a MapReduce system based on Apache Hadoop. The most significant feature of TS-Hadoop is that it utilizes tiered storage infrastructure, besides HDFS, TS-Hadoop also has a shared-disk cluster called HCache, it can be guaranteed that the data in HCache could be processed in highly parallel way. TS-Hadoop automatically distinguish hot and cold data based on current workload, and move them into HCache and HDFS respectively, the hot data in HCache could would be processed efficiently. Experiments show that the average execution time of MapReduce jobs in TS-Hadoop is much faster than traditional Hadoop platform when facing access skew workloads.
What problem does this paper attempt to address?