Enabling Load Balancing for Lossless Datacenters
Jinbin Hu,Chaoliang Zeng,Zilong Wang,Junxue Zhang,Kun Guo,Hong Xu,Jiawei Huang,Kai Chen
DOI: https://doi.org/10.1109/icnp59255.2023.10355615
2023-01-01
Abstract:Various datacenter network (DCN) load balancing schemes have been proposed in the past decade. Unfortunately, most of these solutions designed for lossy DCNs do not work well for Priority Flow Control (PFC) enabled lossless DCNs, primarily due to the reason that the individual congestion signals used in these solutions, e.g., link load, queue length, Round Trip Time (RTT) and Explicit Congestion Notification (ECN), may not be able to correctly or timely reflect the hop-by-hop PFC pausing. This paper first reveals the above problems via extensive experiments, and then based on the insights learned, we present Proteus, a PFC-aware load balancing scheme that is resilient to PFC pausing by exploring a combination of multi-level congestion signals. At its heart, Proteus leverages RTT-Ievel signals (i.e., RTT and link utilization) to detect path status for initial routing decision, and exploits sub-RTT level signal (i.e., cumulative sojourn time) to reflect instantaneous PFC pausing and make timely rerouting choices based on the idea of better-late-than-never. We have implemented Proteus in the hardware programmable switch. Our testbed experiments as well as large-scale simulations show that Proteus can effectively handle PFC pausing under realistic workloads and achieve up to 35 %, 31 %, 28%, 22% and 46 %, 42 %, 34 %, 29 % better average FCT and 99 th percentile FCT than CONGA, DRILL, Hermes and MP-RDMA, respectively.