Can Tree Based Approaches Surpass Deep Learning in Anomaly Detection? A Benchmarking Study

Santonu Sarkar,Shanay Mehta,Nicole Fernandes,Jyotirmoy Sarkar,Snehanshu Saha
2024-02-26
Abstract:Detection of anomalous situations for complex mission-critical systems holds paramount importance when their service continuity needs to be ensured. A major challenge in detecting anomalies from the operational data arises due to the imbalanced class distribution problem since the anomalies are supposed to be rare events. This paper evaluates a diverse array of machine learning-based anomaly detection algorithms through a comprehensive benchmark study. The paper contributes significantly by conducting an unbiased comparison of various anomaly detection algorithms, spanning classical machine learning including various tree-based approaches to deep learning and outlier detection methods. The inclusion of 104 publicly available and a few proprietary industrial systems datasets enhances the diversity of the study, allowing for a more realistic evaluation of algorithm performance and emphasizing the importance of adaptability to real-world scenarios. The paper dispels the deep learning myth, demonstrating that though powerful, deep learning is not a universal solution in this case. We observed that recently proposed tree-based evolutionary algorithms outperform in many scenarios. We noticed that tree-based approaches catch a singleton anomaly in a dataset where deep learning methods fail. On the other hand, classical SVM performs the best on datasets with more than 10% anomalies, implying that such scenarios can be best modeled as a classification problem rather than anomaly detection. To our knowledge, such a study on a large number of state-of-the-art algorithms using diverse data sets, with the objective of guiding researchers and practitioners in making informed algorithmic choices, has not been attempted earlier.
Computer Science
What problem does this paper attempt to address?
The core problem that this paper attempts to solve is how to select the most effective algorithm in anomaly detection, especially when facing different types of abnormal data distributions. Specifically, the paper focuses on the following aspects: 1. **Imbalanced Class Distribution Problem**: In practical applications, abnormal situations are usually rare events, resulting in a serious class imbalance problem in the dataset. This poses a challenge to the construction of machine - learning models, especially for classification models, and the detection of the minority class (i.e., anomalies) is more crucial. 2. **Comparison between Deep Learning and Traditional Methods**: Although deep learning has achieved remarkable success in many fields, whether it is always superior to classical machine - learning methods (such as tree - based methods) in anomaly detection tasks remains an open question. Through extensive benchmarking research, the paper aims to evaluate and compare different types of anomaly detection algorithms, including deep learning and traditional machine - learning methods. 3. **Handling of High - Proportion Anomaly Data**: When the proportion of anomalies in the dataset is high, are traditional anomaly detection methods still applicable? The paper explores the possibility of transforming the problem into a classification problem on high - proportion anomaly datasets and evaluates the performance of different algorithms in this scenario. ### Main Contributions of the Paper 1. **Performance Evaluation**: The paper conducts an empirical study on a series of state - of - the - art (SOTA) machine - learning algorithms, including Local Outlier Factor (LOF), Isolation Forest, One - Class SVM, Autoencoders, Deep Autoencoding Gaussian Mixture Model (DAGMM), Long - Short - Term Memory Network (LSTM), Quantized LSTM (q - LSTM), Deep Quantile Regression, Elliptic Envelope, Deviation Network (DevNet), Generative Adversarial Network (GAN), Graph Neural Network (GNN), and Binary - Tree - Based Algorithms (MGBTAI and d - BTAI). The performance of these algorithms on multiple public and proprietary industrial system datasets is compared. 2. **Generalization Ability of Tree - Based Unsupervised Methods**: The research finds that tree - based evolutionary algorithms show significant adaptability in dealing with rare and large numbers of anomalies. The author improves the knee/elbow method to determine the inflection point in the anomaly score distribution, thereby effectively identifying anomalies. 3. **Breaking the Deep Learning Myth**: By evaluating a variety of the latest deep - learning algorithms, the research shows that although deep learning is powerful, its performance does not always exceed other mature methods. In particular, in some scenarios, tree - based methods can capture single anomalies that deep - learning methods cannot identify. 4. **Optimal Anomaly Detection Settings**: Experiments show that when the proportion of anomalies is low (less than 10%), the anomaly detection method performs better. As the proportion of anomalies increases, the precision, recall rate, and F1 - score become more representative, reflecting the true anomaly detection ability of the model. ### Summary Through large - scale benchmarking research, the paper provides a detailed performance comparison of different anomaly detection algorithms on various datasets. The research shows that in some cases, tree - based methods may be superior to deep learning, especially when dealing with datasets with a small number of anomalies or a high proportion of anomalies. In addition, the paper also emphasizes the importance of transforming high - proportion anomaly problems into classification problems, providing valuable guidance for future research and practice.