Memory optimization of Spark parallel computing framework

Wang-jian LIAO,Yong-feng HUANG,Cong-kai BAO
DOI: https://doi.org/10.3969/j.issn.1007-130X.2018.04.003
2018-01-01
Abstract:The cluster parallel computing framework represented by Spark is widely used in the big data and cloud computing,and its performance optimization is the key in applications.The paper analyzes the framework of the execution process and memory management mechanism of Spark framework.Combining the characteristics of Spark and JVM memory management,three strategies are proposed:(1) Serialization and compression are used to reduce the cache data size and reduce the occupied memory space,then reduce the GC consumption,thus improving the performance.(2) The running memory size is reduced within a certain range,and recalculation replaces the cache,thus improving the performance.(3)By adjusting the proportion of the old generation and new generation of the JVM,the ratio of Spark computing and cache space,and other memory allocation parameters,the performance can be improved greatly.Experiments show that the serialization and compression can reduce the cache space by 42%,the performance is increased by 21% when the submitting memory is reduced from 1 000 MB to 800 MB,and optimizing the memory ratio can improve the performance by 10% to 30%.
What problem does this paper attempt to address?