Abstract:Spark is widely used due to its high performance caching mechanism and high scalability, which still causes uneven workloads and produces useless intermediate caching results when faced with data-intensive applications. A data placement strategy based on an improved reservoir sampling algorithm is proposed to solve the problem of intermediate data tilt in the shuffle stage of Spark. Compared with the traditional sampling algorithm, the amount of intermediate data is accumulated while sampling. The data skew measurement model is used to classify data into skewed data, and non-skewed and coarse-grained, and fine-grained placement algorithms are designed. To further improve Spark's system memory utilization and cache hit rate, an adaptive cache replacement algorithm is proposed to maximize cache gain. We analyze the operational dependencies and propose a cache gain model. Compared with the traditional method, the two known and unknown job arrival rates are considered separately to obtain an online adaptive cache replacement strategy that maximizes cache gain. Experimental results show that our data placement strategy effectively reduces Spark applications' execution time and improves the load balance of reduce tasks. Meanwhile, the proposed adaptive cache replacement strategy effectively reduces Spark's average completion time and improves the memory utilization and cache hit rate.

Performance Evaluation And Optimization Of Join Operation In Spark For Big Data Processing

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Performance Evaluation for Distributed Join Based on MapReduce.

Distributed High-Dimension Matrix Operation Optimization on Spark

An Effective High-Performance Multiway Spatial Join Algorithm with Spark

Research on Query Analysis and Optimization Based on Spark

Optimization of collaborative filtering algorithm based on DAG Spark scheduling

Optimization of Spark Storage Solutions

An Efficient Theta-Join Query Processing in Distributed Environment

Performance optimization of Item Based recommendation algorithm based on Spark

The Optimization of Cost-Model for Join Operator on Spark SQL Platform

Optimization of Data Distribution Strategy in Theta-join Process Based on Spark

Benchmarking of Distributed Computing Engines Spark and GraphLab for Big Data Analytics

Improving Spark Performance with Zero-Copy Buffer Management and RDMA

Workload Driven Comparison And Optimization Of Hive And Spark Sql

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

Efficiency Optimization Method for MapReduce Similarity Computing Based on Spark

Efficient shuffle management with SCache for DAG computing frameworks.

Memory or Time: Performance Evaluation for Iterative Operation on Hadoop and Spark

OPTIMIZATION FOR SPARK MISSION PERFORMANCE BASED ON DATA CHARACTERISTICS

Intermediate data placement and cache replacement strategy under Spark platform