Automating concept-drift detection by self-evaluating predictive model degradation

Tania Cerquitelli,Stefano Proto,Francesco Ventura,Daniele Apiletti,Elena Baralis
DOI: https://doi.org/10.48550/arXiv.1907.08120
2019-07-18
Abstract:A key aspect of automating predictive machine learning entails the capability of properly triggering the update of the trained model. To this aim, suitable automatic solutions to self-assess the prediction quality and the data distribution drift between the original training set and the new data have to be devised. In this paper, we propose a novel methodology to automatically detect prediction-quality degradation of machine learning models due to class-based concept drift, i.e., when new data contains samples that do not fit the set of class labels known by the currently-trained predictive model. Experiments on synthetic and real-world public datasets show the effectiveness of the proposed methodology in automatically detecting and describing concept drift caused by changes in the class-label data distributions.
Machine Learning
What problem does this paper attempt to address?
The problem that this paper attempts to solve is how to automatically detect the decline in the prediction quality of machine - learning models due to class - based concept drift. Specifically, when new data contains samples that do not conform to the known class labels of the currently trained model, how to automatically trigger model updates to adapt to these changes. ### Problem Background In many application scenarios, as time passes, the nature of the collected data will change, such as in device maintenance, road topology changes, configuration updates, and environmental factors. However, collecting a historical training set that contains all possible class labels can be very difficult, expensive, or even infeasible. Therefore, when making predictions, new data with unseen class labels may appear at some point in the future, which will lead to incorrect predictions. Frequently updating the prediction model to expand the training set to new data can be computationally intensive and may require the intervention of domain experts and data scientists to transform the changes in the phenomenon into appropriate prediction task selections. Therefore, simply frequently retraining the model is usually infeasible or at least sub - optimal. ### Paper Solution To this end, the authors propose a new method to automatically detect the decline in prediction quality due to class - based concept drift. The core steps of this method include: 1. **Model Degradation Self - evaluation**: Evaluate the degradation of the prediction model over time by a new unsupervised method. 2. **Semi - supervised Data Labeling**: Assign labels to newly discovered data classes, where a small number of representative samples are manually checked by domain experts. 3. **Automated KDD (Knowledge Discovery Process)**: Build a new prediction model to correctly fit the new data distribution and classification labels. This step can be automatically triggered according to the results of the previous two steps, for example, when the model degradation exceeds a given threshold. ### Specific Methods - **Baseline Calculation**: Calculate unsupervised quality indicators on the training set. - **Self - evaluation**: Regularly recalculate the same indicators on new data and compare them with the baseline. - **Silhouette Index**: Used to quantify the similarity (cohesion) of each sample in its predicted class and the difference (separation) from other classes. The Silhouette value ranges from - 1 to +1, and a higher value indicates a better match with the assigned class and a worse match with other classes. - **Model Degradation Estimation**: Quantify the change by comparing two quality index values (baseline and current value), and use MAAPE (Mean Arctangent Absolute Percentage Error) to measure the shift of the Silhouette curve. ### Experimental Results The experimental results show that this method can correctly evaluate model degradation when concept drift is introduced. For two datasets (a synthetic dataset D1 and a real - world dataset D2 containing Wikipedia articles), this method can detect the arrival of new classes, and the overall degradation rate increases significantly, proving its effectiveness. ### Summary This paper proposes a novel strategy to self - evaluate the degradation of prediction models through unsupervised indicators such as the Silhouette index, thereby detecting concept drift caused by new samples not conforming to the data distribution at training time. This method shows promising experimental results on two datasets. ### Future Work Future directions include comparing the efficiency with existing technologies, introducing other unsupervised indicators, improving the self - evaluation trigger mechanism, and further experiments to evaluate the generality and performance of this method on different real - world datasets.