MUSE: A Runtime Incrementally Reconfigurable Network Adapting to HPC Real-Time Traffic

Zijian Li,Zixuan Chen,Yiying Tang,Xin Ai,Yuanyi Zhu,Zhigao Zhao,Jiang Shao,Guowei Liu,Sen Liu,Bin Liu,Yang Xu
DOI: https://doi.org/10.1109/ipdps57955.2024.00073
2024-01-01
Abstract:Interconnection network in HPC is becoming a bottleneck due to increasing traffic load. We model adaptive routing mechanisms and prove that even with advanced adaptive routing, static networks like Dragonfly cannot handle non-uniform traffic efficiently, let alone the frequently changing non-uniform traffic. Therefore, it requires architectural changes for network-wide improvements, e.g., reconfigurable networks.Existing reconfigurable networks hardly support agile reaction to traffic changes with little impact on network. Therefore, we propose MUSE 1 , a Dragonfly-based runtime incrementally reconfigurable network to enable a small number of link adjustments for agility and little impact on transmitting flows during every reconfiguration with optical circuit switch (OCS).Simulations with both synthetic traffic and real-world workloads prove that MUSE can prevent saturation under typical traffic patterns that cause congestion in static Dragonfly. MUSE is 30-55% better than static Dragonfly and Flexfly w.r.t commonly used performance metrics like flow completion time (FCT). We also build a MUSE prototype and demonstrate that MUSE enables 20-30% less application finish time (AFT).
What problem does this paper attempt to address?