Towards Fault-tolerant HLA-based Distributed Simulations

Dan Chen,Stephen J. Turner, Wentong Cai
DOI: https://doi.org/10.1177/0037549708095518
2008-01-01
Abstract:Large scale High Level Architecture (HLA)-based simulations are built to study complex problems, and they often involve a large number of federates and vast computing resources. Simulation federates running at different locations are subject to failure. The failure of one federate can lead to the crash of the overall simulation execution. Such risk increases with the scale of a distributed simulation. Hence, fault tolerance is required to support runtime robustness. This paper introduces a framework for robust HLA-based distributed simulations using a 'Decoupled Federate Architecture'. The framework provides a generic fault-tolerant model, which deals with failure with a dynamic substitution approach. A sender-based method is designed to ensure reliable in-transit message delivery, which is coupled with a novel algorithm to perform effective fossil collection. The fault-tolerant model also avoids any unnecessary repeated computation when handling failure. Using a middleware approach, the framework supports reusability of legacy federate code and it is platform-neutral and independent of federate modeling approaches. Experiments have been carried out to validate and benchmark the fault-tolerant federates using an example of a supply-chain simulation. The experimental results show that the framework provides correct failure recovery. The results also indicate that the framework only incurs minimal overhead for facilitating fault tolerance and has a promising scalability.
What problem does this paper attempt to address?