A PREDICTING METHOD OF PUPYLATION SITES FROM IMBALANCED TRAINING DATA

Xiaofen Tang,Libo Liu,Qian Liu
2020-01-01
Journal of nonlinear and convex analysis
Abstract:The accurate identification of protein pupylation sites will make an easy understanding of the mechanism of protein pupylation. The number of pupylation sites as determined by experimental methods and the number of nonpupylation sites deduced are imbalanced, and the classification of imbalanced data is a challenging issue for the machine learning methods that have been proposed to date. In view of the fact that most of the current prediction models used for to protein pupylation sites are based on training samples after sampling balance, there is research value in establishing training sets based on the original sample sets, and developing high-performance prediction models for practical application issues in the fields of bioinformatics and medical diagnosis. The purpose of this study is to propose a prediction model for pupylation sites based on unevenly distributed natural samples. In this study, an ensemble weighted extreme learning machine classifier is proposed. This classifier not only measures the classification performance of the classifier by the classification errors of the two types of samples, but also considers the noise contained in the samples. Based on this classifier, a prediction model for the pupylation sites is then established. Furthermore, pupylated proteins without annotation were used to enlarge the training sample set; the prediction model is retrained on the enlarged training samples, then the model is tested on the independent test set. The experimental results demonstrate that the performance of the proposed pupylation site model performs significantly better than the two pupylation site prediction models currently proposed.
What problem does this paper attempt to address?