Multivariate Log-based Anomaly Detection for Distributed Database

Lingzhe Zhang,Tong Jia,Mengxi Jia,Ying Li,Yong Yang,Zhonghai Wu
DOI: https://doi.org/10.1145/3637528.3671725
2024-06-12
Abstract:Distributed databases are fundamental infrastructures of today's large-scale software systems such as cloud systems. Detecting anomalies in distributed databases is essential for maintaining software availability. Existing approaches, predominantly developed using Loghub-a comprehensive collection of log datasets from various systems-lack datasets specifically tailored to distributed databases, which exhibit unique anomalies. Additionally, there's a notable absence of datasets encompassing multi-anomaly, multi-node logs. Consequently, models built upon these datasets, primarily designed for standalone systems, are inadequate for distributed databases, and the prevalent method of deeming an entire cluster anomalous based on irregularities in a single node leads to a high false-positive rate. This paper addresses the unique anomalies and multivariate nature of logs in distributed databases. We expose the first open-sourced, comprehensive dataset with multivariate logs from distributed databases. Utilizing this dataset, we conduct an extensive study to identify multiple database anomalies and to assess the effectiveness of state-of-the-art anomaly detection using multivariate log data. Our findings reveal that relying solely on logs from a single node is insufficient for accurate anomaly detection on distributed database. Leveraging these insights, we propose MultiLog, an innovative multivariate log-based anomaly detection approach tailored for distributed databases. Our experiments, based on this novel dataset, demonstrate MultiLog's superiority, outperforming existing state-of-the-art methods by approximately 12%.
Software Engineering
What problem does this paper attempt to address?
The paper aims to address the issue of anomaly detection in distributed databases, particularly focusing on the shortcomings of existing methods when dealing with distributed database logs. Specifically: 1. **Lack of log anomaly datasets specifically for distributed databases**: Existing log anomaly detection datasets (such as Loghub) mainly come from standalone systems or distributed file systems and are not specifically designed for distributed databases. Therefore, they cannot fully reflect the unique anomalies of distributed databases. 2. **Lack of datasets containing multi-type, multi-node logs**: Most existing datasets fail to cover multiple types of anomaly injections and usually only contain logs from a single source, which cannot reflect the interconnected nature of multi-node distributed databases. 3. **Limitations of existing models in applying to distributed databases**: Current models are mainly designed for standalone systems. When applied to distributed databases, they typically determine whether the entire cluster is abnormal through single-point classification, which can lead to a high false positive rate. To address these issues, the authors constructed a new large-scale dataset and proposed a multivariate log anomaly detection method called MultiLog. MultiLog collects sequential information, quantitative information, and semantic information from each node, encodes them using an LSTM enhanced with a self-attention mechanism, and finally determines the state of the entire cluster through a cluster classifier that combines an AutoEncoder and a meta-classifier. Experimental results show that MultiLog improves performance by approximately 12% in multi-node classification tasks and over 16% in single-node anomaly detection compared to existing methods.