Distributed Network Telemetry with Resource Efficiency and Full Accuracy

Haifeng Sun,Qun Huang,Patrick P. C. Lee,Wei Bai,Feng Zhu,Yungang Bao
DOI: https://doi.org/10.1109/tnet.2023.3327345
2024-01-01
Abstract:Network telemetry is essential for administrators to monitor massive data traffic in a network-wide manner. Existing telemetry solutions often face the dilemma between resource efficiency (i.e., low CPU, memory, and bandwidth overhead) and full accuracy (i.e., error-free and holistic measurement). We break this dilemma via a network-wide architectural design, which simultaneously achieves resource efficiency and full accuracy in flow-level telemetry for large-scale data centers. carefully coordinates the collaboration among different types of entities in the whole network to execute telemetry operations, such that the resource constraints of each entity are satisfied without compromising full accuracy. It further addresses consistency in network-wide epoch synchronization and accountability in error-free packet loss inference. We prototype in DPDK and P4. Testbed experiments on commodity servers and Tofino switches demonstrate the effectiveness of over state-of-the-art solutions.
What problem does this paper attempt to address?