Event-Driven Fault Tolerance for Building Nonstop Active Message Programs

Chao Li,Changhai Zhao,Haihua Yan,Jianlei Zhang
DOI: https://doi.org/10.1109/HPCC.and.EUC.2013.62
2013-01-01
Abstract:With the decreasing Mean Time Between Failures (MTBF) of high performance computing systems, process failure has become a normal phenomenon rather than an exception. The failures in high frequency lead to fault tolerance, a key feature of high performance applications. To provide fault tolerance interfaces for active message programs, this paper proposes a novel model called event-driven fault tolerance. The model converts each detected process failure into an event containing detailed failure information of the execution context, and then schedules the event up to application layer by executing user-directed event handlers to drive the program to recover from faults. Based on events, the model can provide applications with dynamic process groups and fault tolerant communication interfaces. We present an implementation of the model called EDFT (Event Driven Fault Tolerance) and describe its architecture, principle, components and application programming interfaces (API). To evaluate this model, we incorporate EDFT into a scientific application, PreStack Depth Migration (PSDM). Experiments are conducted by injecting various kinds of faults into PSDM when it is running. Experimental results show that for active message programs that demand high performance, event-driven fault tolerance model promises strong robustness, low overhead and high scalability.
What problem does this paper attempt to address?