CRState: In-Kernel Checkpoint/Restart of OpenCL Program Execution on GPU

Genlang Chen,Jiajian Zhang,Qiuru Lin,Hai Jiang,Chaoyi Pang
DOI: https://doi.org/10.1109/ICPADS47876.2019.00054
2019-01-01
Abstract:Checkpoint/restart is an important mechanism to achieve fault tolerance, load balancing and resources sharing in a preemptive system. As Graphics Processing Unit (GPU) becomes quite popular in high performance computing as well as OpenCL programs are portable across various CPUs and GPUs, checkpoint/restart of OpenCL programs on GPUs is in demand. However, due to the intricacy of computation states inside GPUs, there is no effective checkpoint/restart scheme for heterogeneous devices now. This paper proposes a feasible system, CRState, to achieve checkpoint/restart in GPU kernels. With the assistant of a pre-compiler, the primitives are inserted into programs. In run-time, the computation state existing in the underlying hardware is concretized and reconstructed at application level and is ported to heterogeneous devices. Comprehensive experiments have been conducted to demonstrate CRState's feasibility and effectiveness. The experimental results also indicate that CRState has the potential to reschedule resources and balance workload across heterogeneous devices.
What problem does this paper attempt to address?