Improving Human Essential Protein Prediction Using Only Protein Sequences Via Ensemble Learning.

Min Zeng,Nian Wang,Yifan Wu,Yiming Li,Fang-Xiang Wu,Min Li
DOI: https://doi.org/10.1109/bibm52615.2021.9669606
2021-01-01
Abstract:Accurate prediction of essential proteins by using computational methods can effectively reduce the cost of wet-lab experiments. Existing computational methods usually rely on constructed protein-protein interaction (PPI) networks with different kinds of biological data. However, high-quality PPI networks and other biological data are not available for all proteins. Thus, it is very necessary and valuable to develop accurate methods for fast and effective prediction of essential proteins by using only protein sequences. We propose EPGBDT, a machine learning ensemble model, to improve the performance of essential protein prediction by using only protein sequences. EP-GBDT has an ensemble structure that combines multiple Gradient Boosting Decision Tree (GBDT) base classifiers. In addition, to reduce the effects of imbalanced dataset, EP-GBDT uses a sampling technique. The results show that EP-GBDT outperforms state-of-the-art sequence-based methods and network-based centrality measures. The source code and datasets can be downloaded from https://github.com/CSUBioGroup/EP-GBDT.
What problem does this paper attempt to address?