Abstract:A key aspect of automating predictive machine learning entails the capability of properly triggering the update of the trained model. To this aim, suitable automatic solutions to self-assess the prediction quality and the data distribution drift between the original training set and the new data have to be devised. In this paper, we propose a novel methodology to automatically detect prediction-quality degradation of machine learning models due to class-based concept drift, i.e., when new data contains samples that do not fit the set of class labels known by the currently-trained predictive model. Experiments on synthetic and real-world public datasets show the effectiveness of the proposed methodology in automatically detecting and describing concept drift caused by changes in the class-label data distributions.

What problem does this paper attempt to address?

The problem that this paper attempts to solve is how to automatically detect the decline in the prediction quality of machine - learning models due to class - based concept drift. Specifically, when new data contains samples that do not conform to the known class labels of the currently trained model, how to automatically trigger model updates to adapt to these changes. ### Problem Background In many application scenarios, as time passes, the nature of the collected data will change, such as in device maintenance, road topology changes, configuration updates, and environmental factors. However, collecting a historical training set that contains all possible class labels can be very difficult, expensive, or even infeasible. Therefore, when making predictions, new data with unseen class labels may appear at some point in the future, which will lead to incorrect predictions. Frequently updating the prediction model to expand the training set to new data can be computationally intensive and may require the intervention of domain experts and data scientists to transform the changes in the phenomenon into appropriate prediction task selections. Therefore, simply frequently retraining the model is usually infeasible or at least sub - optimal. ### Paper Solution To this end, the authors propose a new method to automatically detect the decline in prediction quality due to class - based concept drift. The core steps of this method include: 1. **Model Degradation Self - evaluation**: Evaluate the degradation of the prediction model over time by a new unsupervised method. 2. **Semi - supervised Data Labeling**: Assign labels to newly discovered data classes, where a small number of representative samples are manually checked by domain experts. 3. **Automated KDD (Knowledge Discovery Process)**: Build a new prediction model to correctly fit the new data distribution and classification labels. This step can be automatically triggered according to the results of the previous two steps, for example, when the model degradation exceeds a given threshold. ### Specific Methods - **Baseline Calculation**: Calculate unsupervised quality indicators on the training set. - **Self - evaluation**: Regularly recalculate the same indicators on new data and compare them with the baseline. - **Silhouette Index**: Used to quantify the similarity (cohesion) of each sample in its predicted class and the difference (separation) from other classes. The Silhouette value ranges from - 1 to +1, and a higher value indicates a better match with the assigned class and a worse match with other classes. - **Model Degradation Estimation**: Quantify the change by comparing two quality index values (baseline and current value), and use MAAPE (Mean Arctangent Absolute Percentage Error) to measure the shift of the Silhouette curve. ### Experimental Results The experimental results show that this method can correctly evaluate model degradation when concept drift is introduced. For two datasets (a synthetic dataset D1 and a real - world dataset D2 containing Wikipedia articles), this method can detect the arrival of new classes, and the overall degradation rate increases significantly, proving its effectiveness. ### Summary This paper proposes a novel strategy to self - evaluate the degradation of prediction models through unsupervised indicators such as the Silhouette index, thereby detecting concept drift caused by new samples not conforming to the data distribution at training time. This method shows promising experimental results on two datasets. ### Future Work Future directions include comparing the efficiency with existing technologies, introducing other unsupervised indicators, improving the self - evaluation trigger mechanism, and further experiments to evaluate the generality and performance of this method on different real - world datasets.

Automating concept-drift detection by self-evaluating predictive model degradation

Time to Retrain? Detecting Concept Drifts in Machine Learning Systems

Automatic Learning to Detect Concept Drift

Enhancing Model Adaptability Using Concept Drift Detection for Short-Term Load Forecast

Unsupervised Concept Drift Detection from Deep Learning Representations in Real-time

Driftage: a multi-agent system framework for concept drift detection

On the Reliable Detection of Concept Drift from Streaming Unlabeled Data

Autoregressive based Drift Detection Method

Are Concept Drift Detectors Reliable Alarming Systems? -- A Comparative Study

MORPH: Towards Automated Concept Drift Adaptation for Malware Detection

Tiny Machine Learning for Concept Drift

Detecting and Responding to Concept Drift in Business Processes

Unsupervised Unlearning of Concept Drift with Autoencoders

Concept Drift Detection and Adaptation with Weak Supervision on Streaming Unlabeled Data

Handling Concept Drifts in Regression Problems -- the Error Intersection Approach

Counteracting Concept Drift by Learning with Future Malware Predictions

Class Distribution Monitoring for Concept Drift Detection

A novel framework for concept drift detection using autoencoders for classification problems in data streams

Efficiently Mitigating the Impact of Data Drift on Machine Learning Pipelines

How to Sustainably Monitor ML-Enabled Systems? Accuracy and Energy Efficiency Tradeoffs in Concept Drift Detection

A Neighbor-Searching Discrepancy-based Drift Detection Scheme for Learning Evolving Data