Improving the Memory Efficiency of In-Memory MapReduce Based HPC Systems.

Cheng Pei,Xuanhua Shi,Hai Jin
DOI: https://doi.org/10.1007/978-3-319-27119-4_12
2015-01-01
Abstract:In-memory cluster computing systems based MapReduce, such as Spark, have made a great impact in addressing all kinds of big data problems. Given the overuse of memory speed, which stems from avoiding the latency caused by disk I/O operations, some process designs may cause resource inefficiency in traditional high performance computing (HPC) systems. Hash-based shuffle, particularly large-scale shuffle, can significantly affect job performance through excessive file operations and unreasonable use of memory. Some intermediate data unnecessarily overflow to the disk when memory usage is unevenly distributed or when memory runs out. Thus, in this study, Write Handle Reusing is proposed to fully utilize memory in shuffle file writing and reading. Load Balancing Optimizer is introduced to ensure the even distribution of data processing across all worker nodes, and Memory-Aware Task Scheduler that coordinates concurrency level and memory usage is also developed to prevent memory spilling. Experimental results on representative workloads demonstrate that the proposed approaches can decrease the overall job execution time and improve memory efficiency.
What problem does this paper attempt to address?