Stochastic Sensitivity Tree Boosting for Imbalanced Prediction Problems of Protein-Ligand Interaction Sites

Wing W. Y. Ng,Yuda Zhang,Jianjun Zhang,Debby D. Wang,Fu Lee Wang
DOI: https://doi.org/10.1109/tetci.2019.2922340
2021-01-01
IEEE Transactions on Emerging Topics in Computational Intelligence
Abstract:Prediction of protein–protein interaction sites plays an important role for understanding the protein interactions and functions. However, in the protein–protein interaction site prediction problem, the number of binding-site residues is usually much less than that of other amino acid residues in a protein chain, which would lead to the performance drop of standard machine learning methods on minority class, i.e., the binding-site residues. Therefore, to improve the prediction performance on binding-site residues, we propose in this paper a new boosting algorithm (SSTBoost) that consists of stochastic sensitivity measure-based undersampling method and AdaBoost algorithm. Stochastic sensitivity measure-based undersampling method aims to re-balance the dataset by selecting those samples with the highest probability to be incorrectly labeled, and AdaBoost algorithm aims to improve the performance of base hypotheses by making them to be complementary and be conjunction with each other. Twenty UCI datasets are first used to evaluate the robustness and effectiveness of the SSTBoost. After that, the SSTBoost is tested on twenty-two practical protein–protein interaction sites prediction problems. Experimental results show that the SSTBoost significantly improves the performances against state-of-the-art methods by $\text{57.3}\%$, $\text{88.2}\%$, and $\text{78.2}\%$ out of 110 cases in terms of Recall, F-score, and G-mean, respectively. This shows its potential to handle other bioinformatic applications in near future.
What problem does this paper attempt to address?