Unsupervised Incremental Learning with Dual Concept Drift Detection for Identifying Anomalous Sequences

Jin Li,Kleanthis Malialis,Christos G. Panayiotou,Marios M. Polycarpou
2024-03-12
Abstract:In the contemporary digital landscape, the continuous generation of extensive streaming data across diverse domains has become pervasive. Yet, a significant portion of this data remains unlabeled, posing a challenge in identifying infrequent events such as anomalies. This challenge is further amplified in non-stationary environments, where the performance of models can degrade over time due to concept drift. To address these challenges, this paper introduces a new method referred to as VAE4AS (Variational Autoencoder for Anomalous Sequences). VAE4AS integrates incremental learning with dual drift detection mechanisms, employing both a statistical test and a distance-based test. The anomaly detection is facilitated by a Variational Autoencoder. To gauge the effectiveness of VAE4AS, a comprehensive experimental study is conducted using real-world and synthetic datasets characterized by anomalous rates below 10\% and recurrent drift. The results show that the proposed method surpasses both robust baselines and state-of-the-art techniques, providing compelling evidence for their efficacy in effectively addressing some of the challenges associated with anomalous sequence detection in non-stationary streaming data.
Computational Engineering, Finance, and Science
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the detection of abnormal sequences in non - stationary environments. Specifically, the researchers face the following challenges: 1. **A large amount of unlabeled streaming data**: In the contemporary digital environment, continuously generated streaming data widely exists in different fields, but a large part of this data is unlabeled, which makes it difficult to identify rare events (such as anomalies). 2. **Concept Drift**: In non - stationary environments, the performance of a model may decline over time because the data distribution has changed. This phenomenon is called concept drift, which will cause the model to gradually become ineffective. 3. **Distinguishing between abnormal sequences and concept drift**: Even in a supervised situation, correctly distinguishing between abnormal sequences and concept drift remains a key research challenge. In this paper, the authors pay special attention to this problem in an unsupervised environment. To solve these problems, the paper proposes a new method - VAE4AS (Variational Autoencoder for Anomalous Sequences), which combines incremental learning and a dual - concept - drift - detection mechanism to address the above challenges. Specifically, VAE4AS has the following characteristics: - **Variational Autoencoder (VAE)**: Used for anomaly detection. - **Incremental learning**: Able to adapt to changes in data over time without retraining the entire model. - **Dual - concept - drift - detection mechanism**: - **Statistical test**: Based on the Kolmogorov - Smirnov (KS) test, used to detect changes in the distribution of the latent layer. - **Distance test**: Based on the Euclidean distance, used to compare the differences between reference abnormal instances and classified abnormal instances. Through these techniques, VAE4AS can effectively detect abnormal sequences in non - stationary environments and does not need to rely on labeled data. Experimental results show that this method outperforms existing baselines and state - of - the - art methods on both real - world and synthetic datasets. ### Formula Summary 1. **KL Divergence**: \[ l_{\text{KL}}(x)=\text{KL}(q(z|x)\|N(0, I_k)) = \frac{1}{2}\sum_{i = 1}^k\left(\mu_i^2+\sigma_i^2-\log(\sigma_i^2)-1\right) \] 2. **Total Loss Function**: \[ l_{\text{VAE}}(x,\hat{x})=l_{\text{AE}}(x,\hat{x})+\beta\cdot l_{\text{KL}}(x) \] 3. **Anomaly Threshold**: \[ \theta_t=\text{mean}(L_t)+2\cdot\text{std}(L_t) \] 4. **Calculation of p - value in KS Test**: \[ p - value = 2\sum_{i = 1}^{\infty}(-1)^{i - 1}e^{-2i^2\gamma^2} \] where, \[ \gamma=\sqrt{\frac{N_{\text{eff}}+0.12}{0.11N_{\text{eff}}}} \] \[ KS_{\text{dis}}=\max|F(\text{reflatent}_i)-F(\text{mov latent}_i)| \] \[ N_{\text{eff}}=\frac{W_{\text{drift}}^2}{2W_{\text{drift}}} \] 5. **Euclidean Distance**: \[ DIS(\text{refdisx},\text{mov AN})=\sqrt{}