Parallel Compression Checkpointing for Socket-Level Heterogeneous Systems

Yongpeng Liu,Hong Zhu,Yongyan Liu,Feng Wang,Baohua Fan
DOI: https://doi.org/10.1109/hpcc.2011.68
2011-01-01
Abstract:Check pointing is an effective fault tolerant technique to improve the reliability of large scale parallel computing systems. However, check pointing causes a large number of computation nodes to store a huge amount of data into file system simultaneously. It does not only require a huge storage space to store system state, but also brings a tremendous pressure on the communication network and I/O subsystem because a massive demand of accesses are concentrated in a short period of time. Data compression can reduce the size of checkpoint data to be saved in the file system and to go through the communication network. However, compression induces a huge time overhead especially in large scale parallel systems, which is the main technical barrier of its practical usability. In this paper, we propose a parallel compression check pointing technique to reduce the time overhead in socket-level heterogeneous architectures. It integrates a number of parallel processing techniques, including transmitting checkpoint data between CPU, GPU and file system in double buffered pipelines, aggregating file write operations, SIMD parallel compression algorithm running on GPU, etc. The paper also reports an implementation of the technique on the Tianhe-1 supercomputer system and the evaluation experiments with the system. The experiment data show that the technique is efficient and practically usable.
What problem does this paper attempt to address?