IC-Data: Improving Compressed Data Processing in Hadoop.

Adnan Haider,Xi Yang,Ning Liu,Xian-He Sun,Shuibing He
DOI: https://doi.org/10.1109/HiPC.2015.28
2015-01-01
Abstract:As dataset sizes for data analytic applications and scientific applications running on Hadoop increases, data compression has become essential to store this data within a reasonable storage cost. Although data is often stored compressed, currently Hadoop takes 49% longer to process compressed data compared to uncompressed data. Processing compressed data reduces the amount of task parallelism and creates uneven workload distribution both of which are fundamental issues the MapReduce parallel programming paradigm should alleviate. In this paper, we propose the design and implementation of a Network Overlapped Compression scheme, NOC, and Compression Aware Storage scheme, CAS. NOC reduces data load time and hides compression overhead by interleaving network I/O with compression. CAS increases parallelism by dynamically changing a file's block size based on compression ratio. Additionally, we develop a MapReduce Module which recognizes the characteristics of compressed data to improve resource allocation and load balance. Collectively, NOC, CAS, and the MapReduce Module decrease job execution time on average by 66% and data load time by 31%.
What problem does this paper attempt to address?