Abstract:Real-world applications have been dealing with large amounts of data that arrive over time and generally present changes in their underlying joint probability distribution, i.e., concept drift. Concept drift can be subdivided into two types: virtual drift, which affects the unconditional probability distribution p(x), and real drift, which affects the conditional probability distribution p(y|x). Existing works focuses on real drift. However, strategies to cope with real drift may not be the best suited for dealing with virtual drift, since the real class boundaries remain unchanged. We provide the first in depth analysis of the differences between the impact of virtual and real drifts on classifiers' suitability. We propose an approach to handle both drifts called On-line Gaussian Mixture Model With Noise Filter For Handling Virtual and Real Concept Drifts (OGMMF-VRD). Experiments with 7 synthetic and 3 real-world datasets show that OGMMF-VRD obtained the best results in terms of average accuracy, G-mean and runtime compared to existing approaches. Moreover, its accuracy over time suffered less performance degradation in the presence of drifts.
What problem does this paper attempt to address?
This paper attempts to solve the problem of the influence of virtual concept drift and real concept drift in data streams on the performance of classifiers. Specifically:
1. **Types of Concept Drift and Their Influences**:
- Concept drift can be divided into two types: virtual drift and real drift.
- **Virtual Drift**: It refers to the change in the unconditional probability distribution \(p(x)\), that is, the change in the input data distribution, while the class boundaries remain unchanged.
- **Real Drift**: It refers to the change in the conditional probability distribution \(p(y|x)\), that is, the change in class boundaries.
- Existing research mainly focuses on real drift because this type of drift directly changes the real decision boundaries of the problem, thus reducing the performance of classifiers. However, although virtual drift does not change the real decision boundaries, it also affects the performance of classifiers because it may affect the applicability of the learned decision boundaries.
2. **Limitations of Existing Methods**:
- Existing data stream learning methods usually use strategies for dealing with real drift to handle virtual drift, which may lead to unnecessary resource waste and performance degradation. For example, creating new classifiers to learn new concepts may waste the previously learned effective knowledge and be easily affected by noise.
3. **Research Objectives**:
- This paper aims to deeply analyze the different influences of virtual drift and real drift on the performance of classifiers and propose a new method - Online Gaussian Mixture Model with Noise Filter (OGMMF - VRD) that can handle both types of drift simultaneously, in order to improve the performance and robustness of classifiers in the presence of drift.
4. **Proposed Solutions**:
- The **OGMMF - VRD** method deals with virtual and real drift in the following ways:
- For virtual drift, when the correlation between new observations and existing Gaussian distributions is lower than a threshold, new Gaussian distributions are created to adapt to the data in new regions.
- For non - severe real drift, the parameters of existing Gaussian distributions are adjusted to adapt to decision - boundary changes within a small range.
- A noise filtering mechanism is introduced to avoid misjudgments and unnecessary model updates caused by noise.
- The pool of past Gaussian mixture models is utilized to accelerate the adaptation to recurring or similar concepts.
5. **Experimental Verification**:
- Through experiments on 7 synthetic datasets and 3 real - world datasets, it is proved that OGMMF - VRD is superior to existing methods in terms of average accuracy, G - mean, and running time, and has less performance degradation in the presence of drift.
In conclusion, this paper solves the problem of how to effectively distinguish and handle virtual concept drift and real concept drift and proposes a more robust and efficient solution.