CACC: A Congestion-Aware Control Mechanism to Reduce INT Overhead and PFC Pause Delay
Xiwen Jie,Jiangping Han,Guanglei Chen,Hang Wang,Peilin Hong,Kaiping Xue
DOI: https://doi.org/10.1109/tnsm.2024.3449699
2024-01-01
IEEE Transactions on Network and Service Management
Abstract:Nowadays, Remote Direct Memory Access (RDMA) is gaining popularity in data centers for low CPU overhead, high throughput, and ultra-low latency. As one of the state-ofthe-art RDMA Congestion Control (CC) mechanisms, HPCC leverages the In-band Network Telemetry (INT) features to achieve accurate control and significantly shortens the Flow Completion Time (FCT) for short flows. However, there exists redundant INT information increasing the processing latency at switches and affecting flows throughput. Besides, its end-to-end feedback mechanism is not timely enough to help senders cope well with bursty traffic, and there still exists a high probability of triggering Priority-based Flow Control (PFC) pauses under large-scale incast. In this paper, we propose a Congestion-Aware (CA) control mechanism called CACC, which attempts to push CC to the theoretical low INT overhead and PFC pause delay. CACC introduces two CA algorithms to quantize switch buffer and egress port congestion, separately, along with a fine-grained window size adjustment algorithm at the sender. Specifically, the buffer CA algorithm perceives large-scale congestion that may trigger PFC pauses and provides early feedback, significantly reducing the PFC pause delay. The egress port CA algorithm perceives the link state and selectively inserts useful INT data, achieving lower queue sizes and reducing the average overhead per packet from 42 bytes to 2 bits. In our evaluation, compared with HPCC, PINT, and Bolt, CACC shortens the average and tail FCT by up to 27% and 60.1%, respectively.