Fault Tolerant Placement of Stateful VNFs and Dynamic Fault Recovery in Cloud Networks

Guochang Yuan,Zichuan Xu,Binxu Yang,Weifa Liang,Wei Koong Chai,Daphne Tuncer,Alex Galis,George Pavlou,Guowei Wu
DOI: https://doi.org/10.1016/j.comnet.2019.106953
IF: 5.493
2020-01-01
Computer Networks
Abstract:Traditional network functions such as firewalls and Intrusion Detection Systems (IDS) are implemented in costly dedicated hardware, making the networks expensive to manage and inflexible to changes. Network function virtualization enables flexible and inexpensive operation of network functions, by implementing virtual network functions (VNFs) as software in virtual machines (VMs) that run in commodity servers. However, VNFs are vulnerable to various faults such as software and hardware failures. Without efficient and effective fault tolerant mechanisms, the benefits of deploying VNFs in networks can be traded-off. In this paper, we investigate the problem of fault tolerant VNF placement in cloud networks, by proactively deploying VNFs in stand-by VM instances when necessary. It is challenging because VNFs are usually stateful. This means that stand-by instances require continuous state updates from active instances during their operation, and the fault tolerant methods need to carefully handle such states. Specifically, the placement of active/stand-by VNF instances, the request routing paths to active instances, and state transfer paths to stand-by instances need to be jointly considered. To tackle this challenge, we devise an efficient heuristic algorithm for the fault tolerant VNF placement. We also propose two bicriteria approximation algorithms with provable approximation ratios for the problem without compute or bandwidth constraints. We then consider the dynamic fault recovery problem given that some placed active instances of VNFs may go faulty, for which we propose an approximation algorithm that dynamically switches traffic processing from faulty VNFs to stand-by instances. Simulations with realistic settings show that our algorithms can significantly improve the request admission rate compared to conventional approaches. We finally evaluate the performance of the proposed algorithm for the dynamic fault recovery problem in a real test-bed consisting of both physical and virtual switches, and results demonstrate that our algorithms have potentials of being applied in real scenarios.
What problem does this paper attempt to address?