LCS: an Efficient Data Eviction Strategy for Spark.

Yuanzhen Geng,Xuanhua Shi,Cheng Pei,Hai Jin,Wenbin Jiang
DOI: https://doi.org/10.1007/s10766-016-0470-1
2016-01-01
International Journal of Parallel Programming
Abstract:As an in-memory distributed computing system, Spark is often used to speed up iterative applications. It caches intermediate data generated by previous iterations into memory, so there is no need to repeat the generation when reusing these data later. This sharing mechanism of caching data in memory makes Spark much faster than other systems. When memory used for caching data reaches the capacity limits, data eviction will be performed to supply space for new data, and the evicted data need to be recovered when they are used again. However, classical strategies do not aware of recovery cost, which could cause system performance degradation. This paper shows that the recovery costs have significant difference in Spark, thus a cost aware eviction strategy can obviously reduces the total recovery cost. To this end, a strategy named LCS is proposed, which gets dependencies information between cache data via analyzing application, and calculates the recovery cost during running. By predicting how many times cache data will be reused and using it to weight the recovery cost, LCS always evicts the data which lead to minimum recovery cost in future. Experimental results show that this approach can achieve better performance when memory space is not sufficient, and reduce 30–50% of the total execution time .
What problem does this paper attempt to address?