Compression-aware I/O performance analysis for big data clustering.

Zhenghua Xue,Geng Shen,Jianhui Li,Qian Xu,Yang Zhang,Jing Shao
DOI: https://doi.org/10.1145/2351316.2351323
2012-01-01
Abstract:ABSTRACTAs the data volume increases, I/O bottleneck has become a great challenge for data analysis. Data compression can alleviate the bottleneck effectively. Taking K-means algorithm as an example, this paper proposes a compression-aware performance improvement model for big-data clustering. The model quantitatively analyzes the effect of a variety of factors related to compression during the entire computational process. We perform clustering experiments on 10 dimensional data with up to 1.114 TB in size on a cluster computer with hundreds of computing cores. The measurement validates that using compression contributes significantly to improving the I/O performance, and confirms our theoretical analysis empirically. Furthermore, the proposed model can effectively determine when and how to use compression to improve I/O performance for big-data analysis.
What problem does this paper attempt to address?