Protein-Protein Interaction Sites Prediction Based on an Under-Sampling Strategy and Random Forest Algorithm.

Minjie Li,Ziheng Wu,Wenyan Wang,Kun Lu,Jun Zhang,Yuming Zhou,Zhaoquan Chen,Dan Li,Shicheng Zheng,Peng Chen,Bing Wang
DOI: https://doi.org/10.1109/tcbb.2021.3123269
2021-01-01
IEEE/ACM Transactions on Computational Biology and Bioinformatics
Abstract:The computational methods of protein-protein interaction sites prediction can effectively avoid the shortcomings of high cost and time in traditional experimental approaches. However, the serious class imbalance between interface and non-interface residues on the protein sequences limits the prediction performance of these methods. This work therefore proposed a new strategy, NearMiss-based under-sampling for unbalancing datasets and Random Forest classification (NM-RF), to predict protein interaction sites. Herein, the residues on protein sequences were represented by the PSSM-derived features, hydropathy index (HI) and relative solvent accessibility (RSA). In order to resolve the class imbalance problem, an under-sampling method based on NearMiss algorithm is adopted to remove some non-interface residues, and then the random forest algorithm is used to perform binary classification on the balanced feature datasets. Experiments show that the accuracy of NM-RF model reaches 87.6% and 84.3% on Dtestset72 and PDBtestset164 respectively, which demonstrate the effectiveness of the proposed NM-RF method in differentiating the interface or non-interface residues.
What problem does this paper attempt to address?