Rewiring 2 Links is Enough: Accelerating Failure Recovery in Production Data Center Networks.

Guo Chen,Youjian Zhao,Dan Pei,Dan Li
DOI: https://doi.org/10.1109/icdcs.2015.64
2015-01-01
Abstract:Failures are not uncommon in production data center networks (DCNs) nowadays, and it takes long time for the network to recover from a failure and find new forwarding paths, significantly impacting real time and interactive applications at the upper layer. The slow failure recovery is due to two primary reasons. First, there lacks immediate backup paths for downward links in DCN with multi-rooted tree topology. Second, distributed routing protocols in DCN take time to converge after failures. In this paper, we present a fault-tolerant DCN solution, called F2Tree, that can significantly improve the failure recovery time in current DCNs, only through a small amount of link rewiring and switch configuration changes. Because F2Tree does not change any existing software or hardware, it is readily deployed in production DCNs, where other existing proposals fail to achieve. Through testbed and emulation experiments, we show that F2Tree can greatly reduce the time of failure recovery by 78%. Our experimental results also show that, for partition-aggregate applications (popular in DCN) under various failure conditions, F2Tree reduces the ratio of deadline-missing requests by more than 96% compared to current DCNs.
What problem does this paper attempt to address?