Enabling ECN for Datacenter Networks with RTT Variations

Junxue Zhang,Wei Bai,Kai Chen
DOI: https://doi.org/10.1109/tcc.2022.3204988
IF: 5.697
2022-01-01
IEEE Transactions on Cloud Computing
Abstract:ECN has been widely employed in production datacenters to deliver high throughput low latency communications. Despite being successful, prior ECN-based transports have an important drawback: they adopt a fixed RTT value in calculating instantaneous ECN marking threshold while overlooking the RTT variations in practice. In this article, we reveal that the current practice of using a fixed high-percentile RTT for ECN threshold calculation can lead to persistent queue buildups, significantly increasing packet latency. On the other hand, directly adopting lower percentile RTTs results in throughput degradation. To handle the problem, we introduce ECN$^\sharp$♯, a simple yet effective solution to enable ECN for RTT variations. At its heart, ECN$^\sharp$♯ inherits the current instantaneous ECN marking (based on a high-percentile RTT) to achieve high throughput and burst tolerance, while further marking packets (conservatively) upon detecting long-term queue buildups to eliminate unnecessary queueing delay without degrading throughput. We implement ECN$^\sharp$♯ on a Barefoot Tofino switch and evaluate it through extensive testbed experiments and large-scale simulations. Our evaluation confirms that ECN$^\sharp$♯ can effectively reduce latency without hurting throughput. For example, compared to the current practice, ECN$^\sharp$♯ achieves up to $23.4\%$23.4% ($31.2\%$31.2%) lower average (99th percentile) flow completion time (FCT) for short flows while delivering similar FCT for large flows under production workloads.
computer science, information systems, theory & methods
What problem does this paper attempt to address?