Accelerating Big Data Application by Eliminating Redundancy on Hadoop Cluster
Kelun Lei,Shaokang Du,Xin You,Zhibo Xuan,Haoran Kong,Hailong Yang,Jing Shang,Zhiwen Xiao,Zhihui Wu,Zhongzhi Luan,Depei Qian
DOI: https://doi.org/10.1109/icpads60453.2023.00114
2023-01-01
Abstract:Big data applications are widely adopted to mine valuable information from a tremendous amount of industry data, which is commonly represented as a series of map-reduce operations. Among various map-reduce frameworks, Hadoop is most commonly adopted for data processing at large scale. Although Hadoop eases the development of highly scalable distributed big data applications, inefficient implementation due to poor coding practice and deep software abstractions can cause severe performance issues such as unreasonable slowdown, high response latency, and waste of computing resources, which can lead to unsatisfactory serving delay or significant maintenance cost. In this paper, we first categorize three common types of redundant patterns in big data applications. Then we propose a tool-assisted optimization workflow to detect the redundant patterns automatically, which profiles the application by sampling hardware performance monitoring units. Moreover, we present a profiling visualization method that can help to pinpoint the redundant codes. Based on these approaches, we optimize several big data applications by eliminating redundancies, yielding up to 14.8% performance improvement.