Abstract:The growing complexity and size of High Performance Computing systems (HPCs) lead to frequent job failures, which may cause significant performance degradation. In order to provide high performance and reliable computing services, an in-depth understanding of the characteristics of HPC job failures is essential. In this paper, we present an empirical study on job failures of 10 public workload data sets collected from 8 large-scale HPCs all over the world. Multiple analysis methods are applied to provide a comprehensive and in-depth understanding of job failures. In order to facilitate design, testing and management of HPCs, we study properties of job failures from the following four aspects: proportion in workload and resource consumption, submission inter-arrival time, locality, and runtime. Our analysis results show that job failure rates are significant in most HPCs, and on average, a failed job often consumes more computational resources than a successful job. We also observe that the submission inter-arrival time of failed jobs is better fit by Generalized Pareto and Lognormal distributions, and the probability of failed job submission follows a “V” shape: decreasing during the first 100 seconds right after the submission of the last failed job and increasing afterward. The majority of job failures come from a small number of users and applications, and furthermore these users are the primary factor related to job failures compared with these applications. We find evidence that failed jobs’ lifetime accuracy (runtime / request time) always follows the “bathtub curve”. Moreover, job failures exhibit strong locality properties that can support the prediction of failed jobs’ occurrence and runtime. Most of these findings are new contributions from the research community, and some findings also reveal important properties of job failures that were misunderstood or poorly understood before. The wide range of studies in this paper can directly and thoroughly facilitate fault tolerant, scheduling, workload modeling, etc. in HPCs, and lead to better system utility while reducing costs.

A Taxonomy of Error Sources in HPC I/O Machine Learning Models

I/O in Machine Learning Applications on HPC Systems: A 360-degree Survey

A Numerical Model Oriented Large-scale Parallel I/O Optimization Method.

Prediction of High-Performance Computing Input/Output Variability and Its Application to Optimization for System Configurations

Performance Evaluation and Modeling of HPC I/O on Non-Volatile Memory

An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems

I/O Bottleneck Detection and Tuning: Connecting the Dots using Interactive Log Analysis

Towards Enabling I/O Awareness in Task-based Programming Models

Performance and power modeling and prediction using MuMMI and 10 machine learning methods

Online Fault Classification in HPC Systems Through Machine Learning

Performance and Power Modeling and Prediction Using MuMMI and Ten Machine Learning Methods

Common Mistakes when Applying Computational Intelligence and Machine Learning to Stock Market modelling

I/O Burst Prediction for HPC Clusters using Darshan Logs

Design Strategies and Approximation Methods for High-Performance Computing Variability Management

Diagnosing applications' I/O behavior through system call observability

An Empirical Roofline Model for Extreme-Scale I/O Workload Analysis

A Visual Comparison of Silent Error Propagation

Generic and ML Workloads in an HPC Datacenter: Node Energy, Job Failures, and Node-Job Analysis

Job Failures in High Performance Computing Systems: A Large-Scale Empirical Study

Bioinformatics Computational Cluster Batch Task Profiling with Machine Learning for Failure Prediction

An empirical study of major page faults for failure diagnosis in cluster systems