FT-Offload: A Scalable Fault-Tolerance Programing Model on MIC Cluster

Cheng Chen,Yunfei Du,Zhen Xu,Canqun Yang
DOI: https://doi.org/10.1007/978-3-319-27140-8_1
2015-01-01
Abstract:Massively heterogeneous architectures are popular for modern petascale and future exascale systems. Fault-tolerance is key to the increased number of components and the complexity of these heterogeneous systems. However, standard offload programming models have traditionally been developed for supporting high performance rather than reliability. Naive fault-tolerance protocols are incapable of serving distributed MPI applications that tuned for CPU-MIC heterogeneous clusters. To address these problems, we design and implement a framework of fault tolerance programming model (FT-Offload). This enhances the reliability of heterogeneous supercomputers and retains the convenient of popular Intel Offload programming model. The effectiveness of the framework is demonstrated via numerical analysis and by porting both benchmarks and real-world applications to large-scale CPU-MIC nodes on the Tianhe-2 supercomputer. Our experimental results show that the current solution, which involves checkpoints, can efficiently strength the long running and reduce checkpointing overhead.
What problem does this paper attempt to address?