Abstract:As Internet applications continue to scale up, microservice architecture has become increasingly popular due to its flexibility and logical structure. Anomaly detection in traces that record inter-microservice invocations is essential for diagnosing system failures. Deep learning-based approaches allow for accurate modeling of structural features (i.e., call paths) and latency features (i.e., call response time), which can determine the anomaly of a particular trace sample. However, the point-wise manner employed by these methods results in substantial system detection overhead and impracticality, given the massive volume of traces (billion-level). Furthermore, the point-wise approach lacks high-level information, as identical sub-structures across multiple traces may be encoded differently. In this paper, we introduce the first Group-wise Trace anomaly detection algorithm, named GTrace. This method categorizes the traces into distinct groups based on their shared sub-structure, such as the entire tree or sub-tree structure. A group-wise Variational AutoEncoder (VAE) is then employed to obtain structural representations. Moreover, the innovative "predicting latency with structure" learning paradigm facilitates the association between the grouped structure and the latency distribution within each group. With the group-wise design, representation caching, and batched inference strategies can be implemented, which significantly reduces the burden of detection on the system. Our comprehensive evaluation reveals that GTrace outperforms state-of-the-art methods in both performances (2.64% to 195.45% improvement in AUC metrics and 2.31% to 40.92% improvement in best F-Score) and efficiency (21.9x to 28.2x speedup). We have deployed and assessed the proposed algorithm on eBay's microservices cluster, and our code is available at https://github.com/NetManAIOps/GTrace.git.

Magnifier: Online Detection of Performance Problems in Large-Scale Cloud Computing Systems

Hierarchical Diagnostic Approach for Performance Problems in Cloud Computing Platforms

Visual Analysis of Cloud Computing Performance Using Behavioral Lines

P-Tracer: Path-Based Performance Profiling in Cloud Computing Systems

Localizing Root Causes of Performance Anomalies in Cloud Computing Systems by Analyzing Request Trace Logs

An Online Service-Oriented Performance Profiling Tool for Cloud Computing Systems

CloudDet: Interactive Visual Analysis of Anomalous Performances in Cloud Computing Systems

Automatic Detecting Performance Bugs in Cloud Computing Systems via Learning Latency Specification Model

Diagnosing Performance Issues for Large-Scale Microservice Systems with Heterogeneous Graph

Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services

Seer: leveraging big data to navigate the complexity of cloud debugging

Toward Fine-Grained, Unsupervised, Scalable Performance Diagnosis for Production Cloud Computing Systems

Online System Problem Detection by Mining Patterns of Console Logs

Seer: Leveraging Big Data to Navigate the Complexity of Performance Debugging in Cloud Microservices.

Lightweight and Adaptive Service API Performance Monitoring in Highly Dynamic Cloud Environment

A fine-grained robust performance diagnosis framework for run-time cloud applications

Sage: Leveraging ML to Diagnose Unpredictable Performance in Cloud Microservices

Performance Issue Diagnosis for Online Service Systems

Root Cause Analysis of Anomalies of Multitier Services in Public Clouds.

From Point-wise to Group-wise: A Fast and Accurate Microservice Trace Anomaly Detection Approach

Sage: Using Unsupervised Learning for Scalable Performance Debugging in Microservices