Zero+: Monitoring Large-Scale Cloud-Native Infrastructure Using One-Sided RDMA

Zhuo Song,Jiejian Wu,Teng Ma,Zhe Wang,Linghe Kong,Zhenzao Wen,Jingxuan Li,Yang Lu,Yong Yang,Tao Ma,Zheng Liu,Guihai Chen
DOI: https://doi.org/10.1109/tnet.2024.3394514
2024-08-25
IEEE/ACM Transactions on Networking
Abstract:Cloud services have shifted from monolithic designs to microservices running on cloud-native infrastructure with monitoring systems to ensure service level agreements (SLAs). However, traditional monitoring systems no longer meet the demands of cloud-native monitoring. In Alibaba's "double eleven" shopping festival, it is observed that the monitor occupies resources of the monitored infrastructure and even disrupts services. In this paper, we propose a novel monitoring system named Zero+ for cloud-native monitoring. Zero+ achieves zero overhead in collecting raw metrics using one-sided remote direct memory access (RDMA) and remedies network congestion by adopting a receiver-driven flow control scheme. Zero+ also features a priority queue mechanism to meet different quality of service requirements and an efficient batch processing design to relieve CPU occupation. Zero+ has been deployed and evaluated in four different clusters with heterogeneous RDMA NIC devices and architectures in Alibaba Cloud. Results show that Zero+ achieves no CPU occupation at the monitored host and supports hosts with sampling interval using a single thread for network I/O. Zero+ significantly relieves the incast issue and maintains of bandwidth utilization in several clusters when monitoring hosts. Zero+ also ensures services with high priority accomplish collecting metrics earlier than low priority ones by at least when monitoring hosts.
telecommunications,computer science, theory & methods,engineering, electrical & electronic, hardware & architecture
What problem does this paper attempt to address?