An Efficient Computational Model for Class Imbalance Problem in Self-Interaction Proteins Prediction

Ji-Yong An,Yong Zhou,Zi-Ji Yan,Yu-Jun Zhao
DOI: https://doi.org/10.21203/rs.3.rs-36603/v1
2020-01-01
Abstract:Background: Self-interaction Proteins (SIPs) play a key role in a variety of biological activities of organisms. In consideration of the time-consuming and expensive of high-throughput methods, and the number of positive and negative samples is very imbalanced in SIPs datasets. How to develop accurate and efficient computational approaches for assisting and accelerating the study of identifying SIPs is a challenging task.Results:In the work, we proposed a new computational method called WELM-SURF for predicting SIPs. More specifically, for exploiting protein sequence feature, Position Specific Scoring Matrix (PSSM) is applied to capturing protein evolutionary information and Speed up robot features (SURF) is employed to extract key feature of protein sequence from PSSM. Take account of the advantage that the Weighted Extreme Learning Machine (WELM) has short training time, good generalization ability, and most importantly ability to efficiently execute classification for imbalanced class samples by optimizing the loss function of weight matrix. Therefore, the WELM classifier is used to perform classification based on extracted features for predicting SIPs. A large number of experiments show that the average accuracy of WELM-SURF is 95.25% and 98.79% on yeast and human dataset, respectively. We also compared our performance with Extreme Learning Machine (ELM), the state-of-the-art Support Vector Machine (SVM), and other existing methods. Compared with the experimental results, the performance of WELM-SURF in this domain is obviously better than ELM, SVM and other previous methods.Conclusion: These experimental results proved that the proposed WELM-SURF model is competent for predicting SIPs with high accuracy and robustness. It is anticipated that the WELM-SURF method is a useful computational tool to facilitate widely bioinformatics studies related to SIPs prediction. For further encouraging future proteomics research, we developed a freely available web server called WELM-SURF-SIPs. It is available at http://219.219.62.123:8888/WELMSURF/ and includes SIPs datasets and source code.
What problem does this paper attempt to address?