DFAR: Dynamic-threshold Fault-tolerant Adaptive Routing for Fat Tree Networks

Binyan Lan,Fei Lei,Dezun Dong,Ke Wu,Xiaoyun Zhang
DOI: https://doi.org/10.1109/icpads60453.2023.00110
2023-01-01
Abstract:The routing algorithm is important for the design of high-performance interconnection networks. With the increasing size of high-performance computing (HPC) systems, the possibility of network component failures increases simultaneously. Fault tolerance becomes a more critical consideration for the routing algorithms, as failed network devices, mostly network links, corrupt the regularity of the topology. However, existing routing algorithms focus on load balancing, ignoring that when the network is faulty. Similarly, current fault-tolerant routing algorithms ensure the correct functionality when failures exist without considering the more challenging post-failure load balancing. In this paper, we co-design the load balancing and fault tolerance for adaptive routing algorithms in fat tree networks and propose DFAR, Dynamic-threshold Fault-tolerant Adaptive Routing. DFAR prioritizes D-Mod-K deterministic output ports by adding thresholds to other available candidates. We adopt a simulated gradient descent algorithm to dynamically update the thresholds according to the network state changes. More state information other than local queue occupancies is used in the thresholds optimization. Experiments with synthetic load show that DFAR improves throughput by up to 25% for network with considerable failures. For realistic MPI workloads, DFAR also improves performance by up to 22%.
What problem does this paper attempt to address?