Dependency-Aware Rollback and Checkpoint-Restart for Distributed Task-Based Runtimes
Kiril Dichev,Herbert Jordan,Konstantinos Tovletoglou,Thomas Heller,Dimitrios S. Nikolopoulos,Georgios Karakonstantis,Charles Gillan
DOI: https://doi.org/10.48550/arXiv.1705.10208
2017-05-29
Abstract:With the increase in compute nodes in large compute platforms, a proportional increase in node failures will follow. Many application-based checkpoint/restart (C/R) techniques have been proposed for MPI applications to target the reduced mean time between failures. However, rollback as part of the recovery remains a dominant cost even in highly optimised MPI applications employing C/R techniques. Continuing execution past a checkpoint (that is, reducing rollback) is possible in message-passing runtimes, but extremely complex to design and implement. Our work focuses on task-based runtimes, where task dependencies are explicit and message passing is implicit. We see an opportunity for reducing rollback for such runtimes: we explore task dependencies in the rollback, which we call dependency-aware rollback. We also design a new C/R technique, which is influenced by recursive decomposition of tasks, and combine it with dependency-aware rollback. We expect the dependency-aware rollback to cancel and recompute less tasks in the presence of node failures. We describe, implement and validate the proposed protocol in a simulator, which confirms these expectations. In addition, we consistently observe faster overall execution time for dependency-aware rollback in the presence of faults, despite the fact that reduced task cancellation does not guarantee reduced overall execution time.
Distributed, Parallel, and Cluster Computing