Abstract:The growing complexity and size of High Performance Computing systems (HPCs) lead to frequent job failures, which may cause significant performance degradation. In order to provide high performance and reliable computing services, an in-depth understanding of the characteristics of HPC job failures is essential. In this paper, we present an empirical study on job failures of 10 public workload data sets collected from 8 large-scale HPCs all over the world. Multiple analysis methods are applied to provide a comprehensive and in-depth understanding of job failures. In order to facilitate design, testing and management of HPCs, we study properties of job failures from the following four aspects: proportion in workload and resource consumption, submission inter-arrival time, locality, and runtime. Our analysis results show that job failure rates are significant in most HPCs, and on average, a failed job often consumes more computational resources than a successful job. We also observe that the submission inter-arrival time of failed jobs is better fit by Generalized Pareto and Lognormal distributions, and the probability of failed job submission follows a “V” shape: decreasing during the first 100 seconds right after the submission of the last failed job and increasing afterward. The majority of job failures come from a small number of users and applications, and furthermore these users are the primary factor related to job failures compared with these applications. We find evidence that failed jobs’ lifetime accuracy (runtime / request time) always follows the “bathtub curve”. Moreover, job failures exhibit strong locality properties that can support the prediction of failed jobs’ occurrence and runtime. Most of these findings are new contributions from the research community, and some findings also reveal important properties of job failures that were misunderstood or poorly understood before. The wide range of studies in this paper can directly and thoroughly facilitate fault tolerant, scheduling, workload modeling, etc. in HPCs, and lead to better system utility while reducing costs.

Full Lifecycle Data Analysis on a Large-scale and Leadership Supercomputer: What Can We Learn from It?

Analysis and MPI Implementation of LQCD Dslash on Sunway TaihuLight*

End-to-end I/O Monitoring on Leading Supercomputers

Power Profile Monitoring and Tracking Evolution of System-Wide HPC Workloads

The Sunway TaihuLight supercomputer: system and applications

Design and Implementation of the Tianhe-2 Data Storage and Management System

Job Failures in High Performance Computing Systems: A Large-Scale Empirical Study

Heterogeneous Parallel Algorithm Design and Performance Optimization for WENO on the Sunway TaihuLight Supercomputer

Enabling Large-Scale Simulation of CAM on the Sunway TaihuLight Supercomputer

5 ExaFlop/s HPL-MxP Benchmark with Linear Scalability on the 40-Million-Core Sunway Supercomputer.

Output Performance Study on a Production Petascale Filesystem.

Enabling and Scaling the HPCG Benchmark on the Newest Generation Sunway Supercomputer with 42 Million Heterogeneous Cores

Application software beyond exascale: challenges and possible trends

Automatic Multi-Parameter Performance Modeling of HPC Applications on a New Sunway Supercomputer

An Overview of Thermal and Mechanical Design, Control, and Testing of the World's Most Powerful and Fastest Supercomputer

Big Data Analytics on Traditional HPC Infrastructure Using Two-Level Storage

A Multi-Level, Multi-Scale Visual Analytics Approach to Assessment of Multifidelity HPC Systems

Failure Analysis and Quantification for Contemporary and Future Supercomputers

An End-to-end and Adaptive I/O Optimization Tool for Modern HPC Storage Systems

Massively Scaling Seismic Processing on Sunway TaihuLight Supercomputer

HadaFS: A File System Bridging the Local and Shared Burst Buffer for Exascale Supercomputers