Speculation-Based Distributed *simulation for Dependability and Performance Analysis

Ravishankar K. Iyer,Yiqing Huang
1999-01-01
Abstract:This research focuses on developing new methods for fast simulation of large complex systems for dependability and performance analysis. A key contribution of the thesis is the idea of using speculation to speed up the overall system simulation while conducting detailed simulation of subsystems that is essential to obtain an accurate overall result. A second contribution is to propose error recovery mechanisms for ServerNetTM network interface software to improve its dependability. A third contribution is the study of two examples based on real systems. The proposed simulation methods are demonstrated in the two examples that analyze ServerNet-based systems. Dependability simulation of a reliable cluster system using the speculation-based method shows that detailed simulation is important to obtain accurate results such as measures like error detection/correction latency. Performance simulation of the same system provides results such as cache access time. Simulation time is greatly reduced and speed up is obtained ranging from 40% to 200% compared with no speculation. Error detection and recovery schemes are important for software design. Software-based error recovery schemes are proposed for ServerNet system area network (ServerNet SAN) to improve the network interface software dependability. The detection and recovery focuses on memory faults that corrupt network interface software. The techniques proposed are by no means limited to ServerNet architecture, it is useful to the recovery of the new generation high-performance Virtual Interface Architecture (VIA). Dependability analysis of such complicated architecture requires an efficient technique to assess the schemes. The proposed speculation-based method is adopted to simulate the architecture behavior including recovery in the presence of faults. The technique allows efficient simulation without incurring much additional run-time overhead to obtain reliability results. Experimental results include error coverage, error latency distribution, and effective round-trip time.
What problem does this paper attempt to address?