A Simulation Algorithm of Cluster System's Availabilitywithin Repair Time Constrains

高文,祝明发,徐志伟
DOI: https://doi.org/10.3321/j.issn:0254-4164.2001.08.016
2001-01-01
Chinese Journal of Computers
Abstract:In order to achieve a high probability of run success, the cluster system requires low failure rates for critical nodes whose failure can be corrected before a run abort or run degradation. During the cluster system's running process, failures and repairs can occur at any time and even simultaneously, so in run repair or replacement of nodes whose subject to random times to failure must be made before the maximum node downtime is exceeded. Also, in run maintenance must take into account such factors as time to repair, failure rates, repair rates, and distributions of the preceding factors. It is usually supposed that system's failure rates and repair rates are constant while general theory is used to analysze of cluster system's availability. Although such assumptions are convenient for theory analysis, the deterministic approach to obtain cluster system's availability is often too difficult to be applied within budget constraints; in particular, the analysis results are not accurate under some circumstances. Moreover, it is complicated to describe all combinations of nodes states and to determine the distributions of failure time and repair time. Therefore, it is necessary to study cluster system's availability through computer simulation. In this paper, we first define the cluster system's running state as a function within repair time constraints. Then, we quantitatively describe all combinations of nodes states within cluster system. Since the cluster system availability mainly lies on nodes' probability distribution of Time Between Failure and Time To Repair, we give random variable's probability distribution and sampling method of nodes in the cluster system. On the basis of identifying the failure state and repair state of nodes, we can determine the cluster system working state. According to failure and repair experience, this paper gives simulation result when Time Between Failure is assumed exponentially distribution(a single parameter distribution), while the Time To Repair is considered to be lognormally distributed(a two parameter distribution). However, other distribution can easily be substituted. Thus, the objectives, methodology, logic, and significant features of the simulation are described in detail, and the algorithm and computer program is developed to estimate the availability through simulation. It can provide a quantitative foundation for the analysis and design of cluster system's availability.
What problem does this paper attempt to address?