Flow Event Telemetry on Programmable Data Plane.
Yu Zhou,Chen Sun,Hongqiang Harry Liu,Rui Miao,Shi Bai,Bo Li,Zhilong Zheng,Lingjun Zhu,Zhen Shen,Yongqing Xi,Pengcheng Zhang,Dennis Cai,Ming Zhang,Mingwei Xu
DOI: https://doi.org/10.1145/3387514.3406214
2020-01-01
Abstract:Network performance anomalies (NPAs), e.g. long-tailed latency, bandwidth decline, etc., are increasingly crucial to cloud providers as applications are getting more sensitive to performance. The fundamental difficulty to quickly mitigate NPAs lies in the limitations of state-of-the-art network monitoring solutions - coarse-grained counters, active probing, or packet telemetry either cannot provide enough insights on flows or incur too much overhead. This paper presents NetSeer, a flow event telemetry (FET) monitor which aims to discover and record all performance-critical data plane events, e.g. packet drops, congestion, path change, and packet pause. NetSeer is efficiently realized on the programmable data plane. It has a high coverage on flow events including inter-switch packet drop/corruption which is critical but also challenging to retrieve the original flow information, with novel intra- and inter-switch event detection algorithms running on data plane; NetSeer also achieves high scalability and accuracy with innovative designs of event aggregation, information compression, and message batching that mainly run on data plane, using switch CPU as complement. NetSeer has been implemented on commodity programmable switches and NICs. With real case studies and extensive experiments, we show NetSeer can reduce NPA mitigation time by 61%-99% with only 0.01% overhead of monitoring traffic.