Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Chunlin Li,Qianqian Cai,Youlong Luo
DOI: https://doi.org/10.1007/s11227-021-04000-2
IF: 3.3
2021-08-02
The Journal of Supercomputing
Abstract:Both data shuffling and cache recovery are essential parts of the Spark system, and they directly affect Spark parallel computing performance. Existing dynamic partitioning schemes to solve the data skewing problem in the data shuffle phase suffer from poor dynamic adaptability and insufficient granularity. To address the above problems, this paper proposes a dynamic balanced partitioning method for the shuffle phase based on reservoir sampling. The method mitigates the impact of data skew on Spark performance by sampling and preprocessing intermediate data, predicting the overall data skew, and giving the overall partitioning strategy executed by the application. In addition, an inappropriate failure recovery strategy increases the recovery overhead and leads to an inefficient data recovery mechanism. To address the above issues, this paper proposes a checkpoint-based fast recovery strategy for the RDD cache. The strategy analyzes the task execution mechanism of the in-memory computing framework and forms a new failure recovery strategy using the failure recovery model plus weight information based on the semantic analysis of the code to obtain detailed information about the task, so as to improve the efficiency of the data recovery mechanism. The experimental results show that the proposed dynamic balanced partitioning approach can effectively optimize the total completion time of the application and improve Spark parallel computing performance. The proposed cache fast recovery strategy can effectively improve the computational speed of data recovery and the computational rate of Spark.
What problem does this paper attempt to address?