Abstract:Both data shuffling and cache recovery are essential parts of the Spark system, and they directly affect Spark parallel computing performance. Existing dynamic partitioning schemes to solve the data skewing problem in the data shuffle phase suffer from poor dynamic adaptability and insufficient granularity. To address the above problems, this paper proposes a dynamic balanced partitioning method for the shuffle phase based on reservoir sampling. The method mitigates the impact of data skew on Spark performance by sampling and preprocessing intermediate data, predicting the overall data skew, and giving the overall partitioning strategy executed by the application. In addition, an inappropriate failure recovery strategy increases the recovery overhead and leads to an inefficient data recovery mechanism. To address the above issues, this paper proposes a checkpoint-based fast recovery strategy for the RDD cache. The strategy analyzes the task execution mechanism of the in-memory computing framework and forms a new failure recovery strategy using the failure recovery model plus weight information based on the semantic analysis of the code to obtain detailed information about the task, so as to improve the efficiency of the data recovery mechanism. The experimental results show that the proposed dynamic balanced partitioning approach can effectively optimize the total completion time of the application and improve Spark parallel computing performance. The proposed cache fast recovery strategy can effectively improve the computational speed of data recovery and the computational rate of Spark.

LCS: an Efficient Data Eviction Strategy for Spark.

An Improved Memory Cache Management Study Based on Spark

A Request Skew Aware Heterogeneous Distributed Storage System Based on Cassandra

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

MCS: Memory Constraint Strategy for Unified Memory Manager in Spark.

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

Adaptive memory reservation strategy for heavy workloads in the Spark environment

Memory optimization of Spark parallel computing framework

<i>SA-LSM</i>: Optimize Data Layout for LSM-tree Based Storage using Survival Analysis

SAC: Dynamic Caching Upon Sketch for In-Memory Big Data Analytics

SP-Cache: Load-Balanced, Redundancy-Free Cluster Caching with Selective Partition

SA-LSM

Achieving Load-Balanced, Redundancy-Free Cluster Caching with Selective Partition

Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

Data Object Cache in Spark Computing Engine

Accelerating Big Data Applications on Tiered Storage System with Various Eviction Policies.

Efficient SSD Cache for Cloud Block Storage via Leveraging Block Reuse Distances

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

A data affinity based garbage collector for multi-bank flash-memory storage system

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

In-Memory Indexed Caching for Distributed Data Processing