Abstract:Both data shuffling and cache recovery are essential parts of the Spark system, and they directly affect Spark parallel computing performance. Existing dynamic partitioning schemes to solve the data skewing problem in the data shuffle phase suffer from poor dynamic adaptability and insufficient granularity. To address the above problems, this paper proposes a dynamic balanced partitioning method for the shuffle phase based on reservoir sampling. The method mitigates the impact of data skew on Spark performance by sampling and preprocessing intermediate data, predicting the overall data skew, and giving the overall partitioning strategy executed by the application. In addition, an inappropriate failure recovery strategy increases the recovery overhead and leads to an inefficient data recovery mechanism. To address the above issues, this paper proposes a checkpoint-based fast recovery strategy for the RDD cache. The strategy analyzes the task execution mechanism of the in-memory computing framework and forms a new failure recovery strategy using the failure recovery model plus weight information based on the semantic analysis of the code to obtain detailed information about the task, so as to improve the efficiency of the data recovery mechanism. The experimental results show that the proposed dynamic balanced partitioning approach can effectively optimize the total completion time of the application and improve Spark parallel computing performance. The proposed cache fast recovery strategy can effectively improve the computational speed of data recovery and the computational rate of Spark.

SAC: Dynamic Caching Upon Sketch for In-Memory Big Data Analytics

TSCache

SAC: Accelerating and Structuring Self-Attention Via Sparse Adaptive Connection.

DAG-aware harmonizing job scheduling and data caching for disaggregated analytics frameworks

An Improved Memory Cache Management Study Based on Spark

Data Object Cache in Spark Computing Engine

Sbac: A Statistics Based Cache Bypassing Method For Asymmetric-Access Caches

Application-centric SSD Cache Allocation for Hadoop Applications.

In-memory big data analytics under space constraints using dynamic programming.

Sampling-based Caching for Low Latency in Distributed Coded Storage Systems

MCS: Memory Constraint Strategy for Unified Memory Manager in Spark.

Overview of Caching Mechanisms to Improve Hadoop Performance

GSC: Greedy Shard Caching Algorithm for Improved I/O Efficiency in GraphChi

“Anti-Caching”-based elastic memory management for Big Data

MiniTasking: improving cache performance for multiple query workloads

An Enhanced Active Caching Strategy for Data-Intensive Computations in Distributed GIS

Hierarchical Sketch: an Efficient, Scalable and Latency-aware Content Caching Design for Content Delivery Networks

Data balancing-based intermediate data partitioning and check point-based cache recovery in Spark environment

Efficient Execution of Multiple Queries on Deep Memory Hierarchy

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

Adaptive memory reservation strategy for heavy workloads in the Spark environment