Abstract:Recently, experiment-driven machine-learning (ML) based configuration tuning for in-memory data analytics such as Apache Spark become popular because they can achieve high speedups. However, experiment-driven ML-based approaches naturally need a large number of iterations and each iteration generates a configuration with a probabilistic strategy and executes the program on a real cluster with the configuration. It therefore takes a long time to optimize the performance of an in-memory data analytics program, and thereby hinders these approaches from being widely used in practice. To address this issue, we propose a novel as well as simple approach dubbed Terminating-It-Early (TIE) to reduce the time needed to perform the experiment executions but to achieve speedups similar to those obtained by experiment-driven ML-based approaches. The key idea is that, during the process of searching for the optimal configuration which produces the shortest execution time for a program, we terminate an experiment program execution with a trial configuration as soon as possible when we find its execution time is longer than a predefined threshold (e.g., the shortest execution time thus far). In contrast, traditional experiment-driven ML-based approaches always run all experiment executions completely. We employ 19 Apache Spark programs running on a physical cluster as well as a virtual cluster to evaluate TIE. We compare the tuning time used to find the optimal configuration of a program and the optimized execution time of a program obtained by TIE against those obtained by CherryPick and a reinforcement learning (RL) based approach. The experimental results show that on physical machines, TIE reduces the tuning time used by CherryPick and the RL-based approach by factors of 2.39× and 1.68× on average, respectively. On virtual machines, the corresponding factors are 2.79× and 1.71×. Moreover, the average optimized execution time of the 19 programs tuned by TIE is slightly shorter than those tuned by CherryPick and the RL-based approach.

Auto-Tuning Spark Configurations Based On Neural Network

BestConfig: Tapping the Performance Potential of Systems Via Automatic Configuration Tuning

Towards General and Efficient Online Tuning for Spark

Adaptive Code Learning for Spark Configuration Tuning

MespaConfig: Memory-Sparing Configuration Auto-Tuning for Co-Located In-Memory Cluster Computing Jobs

Performance Improvement of Distributed Systems by Autotuning of the Configuration Parameters

A Spark Optimizer for Adaptive, Fine-Grained Parameter Tuning

Neural-based Modeling for Performance Tuning of Spark Data Analytics

Parallel computing based parameter auto-tuning algorithm for optimization solvers

Autonomic Architecture for Big Data Performance Optimization

Learning-based Automatic Parameter Tuning for Big Data Analytics Frameworks

DeepCAT+: A Low-Cost and Transferrable Online Configuration Auto-Tuning Approach for Big Data Frameworks

Optimizing Resource Allocation for Data-Parallel Jobs Via GCN-Based Prediction

Performance optimization of Spark MLlib workloads using cost efficient RICG model on exponential projective sampling

MetaTune: Meta-Learning Based Cost Model for Fast and Efficient Auto-tuning Frameworks

ytopt: Autotuning Scientific Applications for Energy Efficiency at Large Scales

Implementation of Artificial Neural Networks in MapReduce Optimization

TIE: Fast Experiment-driven ML-based Configuration Tuning for In-memory Data Analytics

QHB+: Accelerated Configuration Optimization for Automated Performance Tuning of Spark SQL Applications

Performance Optimization using Multimodal Modeling and Heterogeneous GNN