Heterogeneous 1-out-of-n Warm Standby Systems with Online Checkpointing

Gregory Levitin,Liudong Xing,Yuanshun Dai
DOI: https://doi.org/10.1016/j.ress.2017.08.011
IF: 7.247
2017-01-01
Reliability Engineering & System Safety
Abstract:As a common practice in computing-related applications, checkpointing is used to facilitate an effective system recovery in the case of the occurrence of failures. Checkpoints are performed to save data associated with completed portion of a mission task. In the case of a failure, through rollback and data retrieval the system can resume the mission task from the last successful checkpoint instead of from the very beginning of the mission, saving time and cost. This paper models and optimizes 1-out-of-N: G warm standby systems subject to uneven online checkpointing, where checkpoints can be performed in parallel with execution of the primary mission task for improving efficiency of computing elements. Both data checkpoint and retrieval take dynamic time, depending on the amount of work completed. System elements can be heterogeneous in the time-to-failure distribution, performance, and level of readiness to take over the mission task during the warm standby mode. A numerical method is first suggested to evaluate mission performance indices including mission success probability, expected mission completion time, and expected mission operation cost. Examples are provided to demonstrate influence of mission deadline and element resource sharing parameter (i.e., CPU time distribution between the checkpointing procedure and the primary mission task) on the mission performance metrics. The optimal checkpoint distribution and optimal element activation sequencing problems are considered for different combinations of optimization objectives and constraints. A co-optimization problem is further addressed, which aims to find the optimal combination of checkpoint distribution and element activation sequence. Example optimization solutions illustrate the tradeoff among the three mission requirements (reliability, completion time, operation cost) for warm standby systems with online checkpoints.
What problem does this paper attempt to address?