Abstract:The distributed data analytic system -- Spark is a common choice for processing massive volumes of heterogeneous data, while it is challenging to tune its parameters to achieve high performance. Recent studies try to employ auto-tuning techniques to solve this problem but suffer from three issues: limited functionality, high overhead, and inefficient search. In this paper, we present a general and efficient Spark tuning framework that can deal with the three issues simultaneously. First, we introduce a generalized tuning formulation, which can support multiple tuning goals and constraints conveniently, and a Bayesian optimization (BO) based solution to solve this generalized optimization problem. Second, to avoid high overhead from additional offline evaluations in existing methods, we propose to tune parameters along with the actual periodic executions of each job (i.e., online evaluations). To ensure safety during online job executions, we design a safe configuration acquisition method that models the safe region. Finally, three innovative techniques are leveraged to further accelerate the search process: adaptive sub-space generation, approximate gradient descent, and meta-learning method. We have implemented this framework as an independent cloud service, and applied it to the data platform in Tencent. The empirical results on both public benchmarks and large-scale production tasks demonstrate its superiority in terms of practicality, generality, and efficiency. Notably, this service saves an average of 57.00% memory cost and 34.93% CPU cost on 25K in-production tasks within 20 iterations, respectively.

Optimization of Spark Storage Solutions

Optimization Factor Analysis Of Large-Scale Join Queries On Different Platforms

Distributed High-Dimension Matrix Operation Optimization on Spark

OPTIMIZATION FOR SPARK MISSION PERFORMANCE BASED ON DATA CHARACTERISTICS

Memory optimization of Spark parallel computing framework

Towards General and Efficient Online Tuning for Spark

An Improved Memory Cache Management Study Based on Spark

Improving Spark Performance with Zero-Copy Buffer Management and RDMA

Towards Optimizing Storage Costs on the Cloud

SparkRDF: Elastic Discreted RDF Graph Processing Engine with Distributed Memory

QHB+: Accelerated Configuration Optimization for Automated Performance Tuning of Spark SQL Applications

The Optimization of Cost-Model for Join Operator on Spark SQL Platform

A Survey on Spark Ecosystem for Big Data Processing

Design and Implementation of Parallel DBSCAN Algorithm Based on Spark

Adaptive memory reservation strategy for heavy workloads in the Spark environment

Research on Optimization of Random Forest Algorithm Based on Spark

TR-Spark

Optimizing data locality by executor allocation in spark computing environment

Rethinking Storage Management for Data Processing Pipelines in Cloud Data Centers

A Real-Time Partition Generation Mechanism for Data Skew Mitigation in Spark Computing Environment

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study