Abstract:Background In the data era, big data systems have emerged as pivotal tools, underscoring the importance of performance prediction in enhancing the efficiency of big data clusters. Numerous performance models have been proposed, often grounded in artificial intelligence or simulation methodologies. While the bulk of research focuses on refining prediction precision and minimizing overhead, limited attention has been given to the consignation and standardization of these models. Objectives To bridge this gap between model developers and end‐users, this paper introduces AMORA—a novel versatile framework tailored for predicting the performance of big data systems. Methods Leveraging the identified behavior descriptions‐computation submodels (BD‐CS) pattern that is prevalent among various big data job performance models, AMORA allows access to different plugins accommodating different performance models' implementations. This framework also integrates a novel mutable computation graph technique to facilitate backtracking computation. Furthermore, AMORA's functionality extends to comprehensive end‐to‐end usability by enabling the acceptance of origin configuration files from diverse big data systems and presenting easily interpretable prediction reports. Results This work demonstrates AMORA's efficacy in producing an accurate trace of Hadoop job through the selection of appropriate performance model plugins and parameter adjustments and showcasing the application of the proposed mutable computation graph technique in calculating the starting moment of an early‐start reducer. Additionally, two validation experiments are conducted, involving the implementation of various Hadoop and Spark performance models, respectively. The experiment results manifest the prediction precision and overheads of these performance models. Conclusion These experiments exhibit AMORA's role as a benchmark platform for implementing various types of big data job performance models catered to diverse big data systems.

Fixed-point Iteration Approach to Spark Scalable Performance Modeling and Evaluation

Hybrid Performance Modeling And Analyzing Of Parallel Systems

Performance Evaluation for Sdn Deployment: an Approach Based on Stochastic Network Calculus

Neural-based Modeling for Performance Tuning of Spark Data Analytics

An Analytical Performance Model of MapReduce

A Stack-Centric Processing Model for Iterative Processing

Towards General and Efficient Online Tuning for Spark

Reliable Estimation of Execution Time of MapReduce Program

OPTIMIZATION FOR SPARK MISSION PERFORMANCE BASED ON DATA CHARACTERISTICS

The Tiny-Tasks Granularity Trade-Off: Balancing overhead vs. performance in parallel systems

HybridTune: Spatio-temporal Data and Model Driven Performance Diagnosis for Big Data Systems

Architectural Impact on Performance of In-memory Data Analytics: Apache Spark Case Study

A Benchmarking Study to Evaluate Apache Spark on Large-Scale Supercomputers

AMORA: An Advanced Malleable and Operational Framework for Performance Prediction of Big Data Systems

Machine Learning for Performance Prediction of Spark Cloud Applications

NoStop: A Novel Configuration Optimization Scheme for Spark Streaming

Modeling and Simulation of Spark Streaming

Benchmarking and Performance Modelling of MapReduce Communication Pattern

Apache Spark Streaming, Kafka and HarmonicIO: A Performance Benchmark and Architecture Comparison for Enterprise and Scientific Computing

Adaptive memory reservation strategy for heavy workloads in the Spark environment

Model Averaging in Distributed Machine Learning: a Case Study with Apache Spark