Abstract:As Internet applications continue to scale up, microservice architecture has become increasingly popular due to its flexibility and logical structure. Anomaly detection in traces that record inter-microservice invocations is essential for diagnosing system failures. Deep learning-based approaches allow for accurate modeling of structural features (i.e., call paths) and latency features (i.e., call response time), which can determine the anomaly of a particular trace sample. However, the point-wise manner employed by these methods results in substantial system detection overhead and impracticality, given the massive volume of traces (billion-level). Furthermore, the point-wise approach lacks high-level information, as identical sub-structures across multiple traces may be encoded differently. In this paper, we introduce the first Group-wise Trace anomaly detection algorithm, named GTrace. This method categorizes the traces into distinct groups based on their shared sub-structure, such as the entire tree or sub-tree structure. A group-wise Variational AutoEncoder (VAE) is then employed to obtain structural representations. Moreover, the innovative "predicting latency with structure" learning paradigm facilitates the association between the grouped structure and the latency distribution within each group. With the group-wise design, representation caching, and batched inference strategies can be implemented, which significantly reduces the burden of detection on the system. Our comprehensive evaluation reveals that GTrace outperforms state-of-the-art methods in both performances (2.64% to 195.45% improvement in AUC metrics and 2.31% to 40.92% improvement in best F-Score) and efficiency (21.9x to 28.2x speedup). We have deployed and assessed the proposed algorithm on eBay's microservices cluster, and our code is available at https://github.com/NetManAIOps/GTrace.git.

Unsupervised Detection of Microservice Trace Anomalies Through Service-Level Deep Bayesian Networks

BertHTLG: Graph-Based Microservice Anomaly Detection Through Sentence-Bert Enhancement.

DeepTraLog: Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning

Unsupervised Anomaly Detection on Microservice Traces through Graph VAE

Efficient and Robust Trace Anomaly Detection for Large-Scale Microservice Systems

From Point-wise to Group-wise: A Fast and Accurate Microservice Trace Anomaly Detection Approach

TraceGra: A Trace-Based Anomaly Detection for Microservice Using Graph Deep Learning

ServiceAnomaly: An anomaly detection approach in microservices using distributed traces and profiling metrics

Microservice Anomaly Detection Based on Tracing Data Using Semi-supervised Learning

Anomaly detection in microservice environments using distributed tracing data analysis and NLP

PUTraceAD: Trace Anomaly Detection with Partial Labels based on GNN and PU Learning

Deep Attentive Anomaly Detection for Microservice Systems with Multimodal Time-Series Data

TraceStream: Anomalous Service Localization Based on Trace Stream Clustering with Online Feedback

An Intelligent Anomaly Detection Scheme for Micro-Services Architectures with Temporal and Spatial Data Analysis.

Few-Shot Cross-System Anomaly Trace Classification for Microservice-based systems

AutoMAP: Diagnose Your Microservice-based Web Applications Automatically.

Graph neural network based robust anomaly detection at service level in SDN driven microservice system

Twin Graph-based Anomaly Detection via Attentive Multi-Modal Learning for Microservice System

Approach to Anomaly Detection in Microservice System with Multi-Source Data Streams

Locating Anomaly Clues for Atypical Anomalous Services: An Industrial Exploration

Practical Anomaly Detection over Multivariate Monitoring Metrics for Online Services