Abstract:Predicting the performance of an application running on parallel computing platforms is increasingly becoming important because of its influence on development time and resource management. However, predicting the performance with respect to parallel processes is complex for iterative and multi-stage applications. This research proposes a performance approximation approach FiM to predict the calculation time with FiM-Cal and communication time with FiM-Com of an application running on a distributed framework. FiM-Cal consists of two key components that are coupled with each other: (1) a Stochastic Markov Model to capture non-deterministic runtime that often depends on parallel resources, e.g., number of processes, and (2) a machine-learning model that extrapolates the parameters for calibrating our Markov model when we have changes in application parameters such as dataset. Along with the parallel calculation time, parallel computing platforms consume some data transfer time to communicate among different nodes. FiM-Com consists of a simulation queuing model to quickly estimate communication time. Our new modeling approach considers different design choices along multiple dimensions, namely (i) process-level parallelism, (ii) distribution of cores on multi-processor platform, (iii) application related parameters, and (iv) characteristics of datasets. The major contribution of our prediction approach is that FiM can provide an accurate prediction of parallel processing time for the datasets that have a much larger size than that of the training datasets. We evaluate our approach with NAS Parallel Benchmarks and real iterative data processing applications. We compare the predicted results (e.g., end-to-end execution time) with actual experimental measurements on a real distributed platform. We also compare our work with an existing prediction technique based on machine learning. We rank the number of processes according to the actual and predicted results from FiM and calculate the correlation between the actual and predicted rankings. Our results show that FiM obtains a high correlation in the range of 0.80 to 0.99, which indicates considerable accuracy of our technique. Such prediction provides data analysts a useful insight of optimal configuration of parallel resources (e.g., number of processes and number of cores) and also helps system designers to investigate the impact of changes in application parameters on system performance.

Using Small-Scale History Data to Predict Large-Scale Performance of HPC Application

Automated Performance Modeling of HPC Applications Using Machine Learning.

HPC Application Performance Prediction with Machine Learning on New Architectures

Machine Learning Based Performance Analysis and Prediction of Jobs on a HPC Cluster

Performance and power modeling and prediction using MuMMI and 10 machine learning methods

Multi-Parameter Performance Modeling Based on Machine Learning with Basic Block Features

Performance Prediction Of Hpc Applications On Intel Processors

Automatic Multi-Parameter Performance Modeling of HPC Applications on a New Sunway Supercomputer

Using Hardware Counter-Based Performance Model to Diagnose Scaling Issues of HPC Applications.

Runtime prediction of big data jobs: performance comparison of machine learning algorithms and analytical models

Performance Modeling for MPI Applications with Low Overhead Fine-Grained Profiling.

Performance Modeling and Prediction of Big Data Workflows: an Exploratory Analysis.

Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Optimizing Job Scheduling by Using Broad Learning to Predict Execution Times on HPC Clusters

FIM: Performance Prediction for Parallel Computation in Iterative Data Processing Applications

A Collaborative Filtering Based Approach To Performance Prediction For Parallel Applications

New Performance Modeling Methods for Parallel Data Processing Applications.

Improving HPC System Performance by Predicting Job Resources via Supervised Machine Learning

Predictive performance and scalability modeling of a large-scale application

Analytics of Longitudinal System Monitoring Data for Performance Prediction

Learning with Analytical Models