Christofer Fellicious,Lorenz Wendlinger,Mario Gancarski,Jelena Mitrovic,Michael Granitzer
Abstract:Supervised machine learning often encounters concept drift, where the data distribution changes over time, degrading model performance. Existing drift detection methods focus on identifying these shifts but often overlook the challenge of acquiring labeled data for model retraining after a shift occurs. We present the Strategy for Drift Sampling (SUDS), a novel method that selects homogeneous samples for retraining using existing drift detection algorithms, thereby enhancing model adaptability to evolving data. SUDS seamlessly integrates with current drift detection techniques. We also introduce the Harmonized Annotated Data Accuracy Metric (HADAM), a metric that evaluates classifier performance in relation to the quantity of annotated data required to achieve the stated performance, thereby taking into account the difficulty of acquiring labeled data. Our contributions are twofold: SUDS combines drift detection with strategic sampling to improve the retraining process, and HADAM provides a metric that balances classifier performance with the amount of labeled data, ensuring efficient resource utilization. Empirical results demonstrate the efficacy of SUDS in optimizing labeled data use in dynamic environments, significantly improving the performance of machine learning applications in real-world scenarios. Our code is open source and available at <a class="link-external link-https" href="https://github.com/cfellicious/SUDS/" rel="external noopener nofollow">this https URL</a>
What problem does this paper attempt to address?
The problem that this paper attempts to solve is that in supervised machine learning, when the data distribution changes over time (i.e., concept drift), the model performance will decline. Existing drift - detection methods mainly focus on identifying these changes, but often overlook the challenge of obtaining labeled data to retrain the model after the drift occurs. For this reason, the authors propose the "Unsupervised Drift Sampling Strategy (SUDS)", a novel method that selects homogeneous samples for retraining through existing drift - detection algorithms, thereby enhancing the model's adaptability to changing data.
In addition, the authors also introduce the "Harmonious Annotated Data Accuracy Metric (HADAM)", a new indicator for evaluating classifier performance. It not only considers the accuracy of the classifier but also the amount of labeled data required to achieve this performance, thus comprehensively evaluating the model performance and resource utilization efficiency.
### Specific Problem Description
1. **The Impact of Concept Drift**:
- In many fields, such as credit card fraud detection, energy consumption prediction, and production prediction, etc., the data distribution will gradually change over time, and this change is called concept drift.
- Concept drift will lead to a decline in model performance, and the model needs to be updated to adapt to the new data distribution.
- Obtaining labeled data for retraining is usually very expensive, especially in cases where expert manual labeling is required.
2. **Limitations of Existing Methods**:
- Although existing drift - detection methods can identify changes in the data distribution, obtaining sufficient labeled data for retraining after the drift occurs is a major challenge.
- Some methods rely on semi - supervised or unsupervised techniques, but the effectiveness of these methods in practical applications is limited.
3. **The Proposed New Methods**:
- **SUDS**: By selecting homogeneous samples for retraining, reduce the negative impact of heterogeneous data on model performance.
- **HADAM**: Provide an indicator that comprehensively evaluates model performance and the amount of required labeled data to ensure the effective utilization of resources.
### Solutions
- **SUDS**:
- Utilize existing drift - detection algorithms to select homogeneous samples for retraining after detecting the drift.
- By analyzing the data when the drift is detected, use the most recent data to generate a more homogeneous sample set, improving the model's adaptability and performance.
- **HADAM**:
- Propose a new performance evaluation indicator that combines classifier performance and the amount of required labeled data.
- Ensure that when evaluating model performance, the cost and difficulty of obtaining labeled data are taken into account.
### Experimental Results
The experimental results show that the SUDS method performs excellently in optimizing the use of labeled data, especially significantly improving the performance of machine - learning applications in dynamic environments. The HADAM indicator also proves that SUDS is superior to other methods on most datasets, especially performing particularly prominently in real - world datasets.
In summary, through proposing SUDS and HADAM, this paper aims to solve the retraining problem after concept - drift detection and provides a more efficient way of resource utilization.