Abstract:As Internet applications continue to scale up, microservice architecture has become increasingly popular due to its flexibility and logical structure. Anomaly detection in traces that record inter-microservice invocations is essential for diagnosing system failures. Deep learning-based approaches allow for accurate modeling of structural features (i.e., call paths) and latency features (i.e., call response time), which can determine the anomaly of a particular trace sample. However, the point-wise manner employed by these methods results in substantial system detection overhead and impracticality, given the massive volume of traces (billion-level). Furthermore, the point-wise approach lacks high-level information, as identical sub-structures across multiple traces may be encoded differently. In this paper, we introduce the first Group-wise Trace anomaly detection algorithm, named GTrace. This method categorizes the traces into distinct groups based on their shared sub-structure, such as the entire tree or sub-tree structure. A group-wise Variational AutoEncoder (VAE) is then employed to obtain structural representations. Moreover, the innovative "predicting latency with structure" learning paradigm facilitates the association between the grouped structure and the latency distribution within each group. With the group-wise design, representation caching, and batched inference strategies can be implemented, which significantly reduces the burden of detection on the system. Our comprehensive evaluation reveals that GTrace outperforms state-of-the-art methods in both performances (2.64% to 195.45% improvement in AUC metrics and 2.31% to 40.92% improvement in best F-Score) and efficiency (21.9x to 28.2x speedup). We have deployed and assessed the proposed algorithm on eBay's microservices cluster, and our code is available at https://github.com/NetManAIOps/GTrace.git.

PUTraceAD: Trace Anomaly Detection with Partial Labels based on GNN and PU Learning

TraceGra: A Trace-Based Anomaly Detection for Microservice Using Graph Deep Learning

Unsupervised Detection of Microservice Trace Anomalies Through Service-Level Deep Bayesian Networks

Robust KPI Anomaly Detection for Large-Scale Software Services with Partial Labels

From Point-wise to Group-wise: A Fast and Accurate Microservice Trace Anomaly Detection Approach

Few-shot Network Anomaly Detection via Cross-network Meta-learning

Pull & Push: Leveraging Differential Knowledge Distillation for Efficient Unsupervised Anomaly Detection and Localization

Positive unlabeled learning‐based anomaly detection in videos

DeepTraLog: Trace-Log Combined Microservice Anomaly Detection through Graph-based Deep Learning

PU-Detector: A PU Learning-based Framework for Real Money Trading Detection in MMORPG

Accurate Anomaly Detection Leveraging Knowledge-enhanced GAT

Online Malicious Domain Name Detection with Partial Labels for Large-Scale Dependable Systems

Towards Improved Illicit Node Detection with Positive-Unlabelled Learning

Graph Anomaly Detection with Noisy Labels by Reinforcement Learning

Multi-representations Space Separation based Graph-level Anomaly-aware Detection

TA-Detector: A GNN-based Anomaly Detector via Trust Relationship

Prototypical Residual Networks for Anomaly Detection and Localization

UniGAD: Unifying Multi-level Graph Anomaly Detection

Graph Pre-Training Models Are Strong Anomaly Detectors

Masked Graph Neural Networks for Unsupervised Anomaly Detection in Multivariate Time Series

Unsupervised Anomaly Detection on Microservice Traces through Graph VAE