Toward Full-Coverage and Low-Overhead Profiling of Network-Stack Latency

Xiang Chen,Hongyan Liu,Wenbin Zhang,Qun Huang,Dong Zhang,Haifeng Zhou,Xuan Liu,Chunming Wu
DOI: https://doi.org/10.1109/tnet.2024.3421327
2024-01-01
Abstract:In modern data center networks (DCNs), network-stack processing denotes a large portion of the end-to-end latency of TCP flows. So profiling network-stack latency anomalies has been considered as a crucial part in DCN performance diagnosis and troubleshooting. In particular, such profiling requires full coverage (i.e., profiling every TCP packet) and low overhead (i.e., profiling should avoid high CPU consumption in end-hosts). However, existing solutions rely on system calls or tracepoints in end-hosts to implement network-stack latency profiling, leading to either low coverage or high overhead. We propose Torp, a framework that offers full-coverage and low-overhead profiling of network-stack latency. Our key idea is to offload as much of the profiling from costly system calls or tracepoints to the Torp agent built on eBPF modules, and further to include a Torp handler on the ToR switch to accelerate the remaining profiling operations. Torp efficiently coordinates the ToR switch and the Torp agent on end-hosts to jointly execute the entire latency profiling task. We have implemented Torp on $32\times 100$ Gbps Tofino switches. Testbed experiments indicate that Torp achieves full coverage and orders of magnitude lower host-side overhead compared to other solutions.
What problem does this paper attempt to address?