Abstract:Background The recognition of protein interaction sites is of great significance in many biological processes, signaling pathways and drug designs. However, most sites on protein sequences cannot be defined as interface or non-interface sites because only a small part of protein interactions had been identified, which will cause the lack of prediction accuracy and generalization ability of predictors in protein interaction sites prediction. Therefore, it is necessary to effectively improve prediction performance of protein interaction sites using large amounts of unlabeled data together with small amounts of labeled data and background knowledge today. Results In this work, three semi-supervised support vector machine–based methods are proposed to improve the performance in the protein interaction sites prediction, in which the information of unlabeled protein sites can be involved. Herein, five features related with the evolutionary conservation of amino acids are extracted from HSSP database and Consurf Sever, i.e., residue spatial sequence spectrum, residue sequence information entropy and relative entropy, residue sequence conserved weight and residual Base evolution rate, to represent the residues within the protein sequence. Then three predictors are built for identifying the interface residues from protein surface using three types of semi-supervised support vector machine algorithms. Conclusion The experimental results demonstrated that the semi-supervised approaches can effectively improve prediction performance of protein interaction sites when unlabeled information is involved into the predictors and one of them can achieve the best prediction performance, i.e., the accuracy of 70.7%, the sensitivity of 62.67% and the specificity of 78.72%, respectively. With comparison to the existing studies, the semi-supervised models show the improvement of the predication performance.

Stochastic Sensitivity Tree Boosting for Imbalanced Prediction Problems of Protein-Ligand Interaction Sites

Imbalanced Data Sets Classification Method Based on Over-Sampling Technique

Improving Protein-Atp Binding Residues Prediction by Boosting SVMs with Random Under-Sampling

A New Supervised Over-Sampling Algorithm with Application to Protein-Nucleotide Binding Residue Prediction.

Developing Computational Model to Predict Protein-Protein Interaction Sites Based on the XGBoost Algorithm

SXGBsite: Prediction of Protein-Ligand Binding Sites Using Sequence Information and Extreme Gradient Boosting.

Imbalance Data Processing Strategy for Protein Interaction Sites Prediction

Boosting Granular Support Vector Machines for the Accurate Prediction of Protein-Nucleotide Binding Sites

Prediction of protein-protein interaction sites through eXtreme gradient boosting with kernel principal component analysis

Protein-Protein Interaction Sites Prediction Based on an Under-Sampling Strategy and Random Forest Algorithm.

Boosting Prediction Performance of Protein-Protein Interaction Hot Spots by Using Structural Neighborhood Properties

Boosting: An Ensemble Learning Tool for Compound Classification and QSAR Modeling

Protein-protein Interaction Sites Prediction by Ensembling SVM and Sample-Weighted Random Forests

Semi-supervised prediction of protein interaction sites from unlabeled sample information

Using Ensemble Methods to Deal with Imbalanced Data in Predicting Protein-Protein Interactions

Prediction of Protein-Protein Interaction Sites Using an Ensemble Method

Prediction of drug-target interaction based on protein features using undersampling and feature selection techniques with boosting.

Enhancing Protein-Atp and Protein-Adp Binding Sites Prediction Using Supervised Instance-Transfer Learning

Sequence-based Prediction of Protein-Protein Interaction Sites by Simplified Long Short-Term Memory Network

A Novel Sequence-Based Prediction Method For Atp-Binding Sites Using Fusion Of Smote Algorithm And Random Forests Classifier

A semi-supervised boosting SVM for predicting hot spots at protein-protein Interfaces