RD-Probe: Scalable Monitoring with Sufficient Coverage in Complex Datacenter Networks

Rui Ding,Xunpeng Liu,Shibo Yang,Qun Huang,Baoshu Xie,Ronghua Sun,Zhi Zhang,Bolong Cui
DOI: https://doi.org/10.1145/3651890.3672256
2024-01-01
Abstract:Ensuring service availability in large-scale datacenters hinges on network monitoring. For monitoring quality, it is essential to attain sufficient coverage of all physical components. However, given the ever-evolving complexity of industrial environments, even measuring coverage metrics becomes challenging, let alone attaining sufficient coverage. In fact, insufficient coverage widely existed in our production datacenters and caused many missed failures. To address this, we design RD-Probe, an industrial monitoring system with coverage and scalability guarantees. Specifically, it first constructs a network topology encoding the industrial complexity. Then, it combines Randomized and Deterministic methods to efficiently generate probe tasks that meet the coverage requirement. We have deployed RD-Probe in three large production regions in Huawei Cloud. Within the first month, RD-Probe improved coverage from 80.9% to 99.5% and unearthed several previously unnoticed issues while tolerating numerous faults. Large-scale simulation of four industry solutions shows that RD-Probe is the only one achieving both sufficient coverage and scalability in complex datacenter networks. We plan to expand RD-Probe to other regions soon.
What problem does this paper attempt to address?