TraceMesh: Scalable and Streaming Sampling for Distributed Traces

Zhuangbin Chen,Zhihan Jiang,Yuxin Su,Michael R. Lyu,Zibin Zheng
2024-06-11
Abstract:Distributed tracing serves as a fundamental element in the monitoring of cloud-based and datacenter systems. It provides visibility into the full lifecycle of a request or operation across multiple services, which is essential for understanding system dependencies and performance bottlenecks. To mitigate computational and storage overheads, most tracing frameworks adopt a uniform sampling strategy, which inevitably captures overlapping and redundant information. More advanced methods employ learning-based approaches to bias the sampling toward more informative traces. However, existing methods fall short of considering the high-dimensional and dynamic nature of trace data, which is essential for the production deployment of trace sampling. To address these practical challenges, in this paper we present TraceMesh, a scalable and streaming sampler for distributed traces. TraceMesh employs Locality-Sensitivity Hashing (LSH) to improve sampling efficiency by projecting traces into a low-dimensional space while preserving their similarity. In this process, TraceMesh accommodates previously unseen trace features in a unified and streamlined way. Subsequently, TraceMesh samples traces through evolving clustering, which dynamically adjusts the sampling decision to avoid over-sampling of recurring traces. The proposed method is evaluated with trace data collected from both open-source microservice benchmarks and production service systems. Experimental results demonstrate that TraceMesh outperforms state-of-the-art methods by a significant margin in both sampling accuracy and efficiency.
Distributed, Parallel, and Cluster Computing,Software Engineering
What problem does this paper attempt to address?
The paper aims to address the sampling problem in distributed tracing systems, specifically how to sample abnormal or rare tracing data in a scalable and streaming manner while maintaining a low storage budget. Specifically, the paper proposes a new method called TRACE MESH, which uses Locality-Sensitive Hashing (LSH) to project high-dimensional tracing data into a low-dimensional space, thereby improving sampling efficiency and preserving data similarity. TRACE MESH identifies and samples abnormal tracing data through dynamic clustering and can seamlessly adapt to newly emerging tracing features without changing the input dimensions or model structure. Experimental results show that TRACE MESH has significant advantages in sampling accuracy and efficiency compared to existing methods.