FRAVaR: A Fast Failure Recovery Framework for Inter-DC Network
Haoqiang Huang,Yuchao Zhang,Ran Wang,Qiao Xiang,Wendong Wang,Xirong Que,Ke Xu
DOI: https://doi.org/10.1109/wcnc55385.2023.10119088
2023-01-01
Abstract:Along with the development of 5G and IoT technologies in recent years, Inter Data Center (Inter-DC) network is facing an explosive growth of geographically distributed user data, which needs to be duplicated among DCs in a real-time manner. Transmission-based applications require high availability that is going beyond 99.99%. However, with the expansion of Inter-DC network scale, link failures are also growing, which seriously affects data transmission efficiency, so fast link failure recovery is then urgently needed. Many previous works have been done to achieve fast failure recovery, but most of them ignore two key points, 1) the cost of deploying recovery strategies, and 2) the side-effect of re-transmission to network availability. These two factors make the existing failure recovery process too slow to be practical in real-time online industrial environments. To achieve realistic fast recovery from Inter-DC network failures, we propose a failure recovery framework FRAVaR, which achieves high network availability with very little deployment overhead. Particularly, FRAVaR reduces the deployment overhead by a novel incremental routing strategy to isolate link failures. In other words, it only needs to shuffle a tiny amount of traffic within a small failure isolation domain. On this base, FRAVaR further adopts a risk assessment theory named Value-at-Risk (VaR) to control flow re-transmission. We implement a prototype of FRAVaR and conduct a series of experiments on 4 real InterDC network topologies (ATT North America, IBM, GlobalCenter, AGIS). Experiment results show that FRAVaR outperforms state-of-the-art solutions on the recovery speed by 70.2%. 1