Predicting Golgi-Resident Protein Types Using Conditional Covariance Minimization with XGBoost Based on Multiple Features Fusion

Hongyan Zhou,Cheng Chen,Minghui Wang,Qin Ma,Bin Yu
DOI: https://doi.org/10.1109/access.2019.2938081
IF: 3.9
2019-01-01
IEEE Access
Abstract:The Golgi apparatus is a key organelle for protein synthesis in eukaryotic cell. Any dysfunction of Golgi-resident proteins can lead to different diseases, especially neurodegenerative and inherited diseases, such as diabetes, cancer, and cystic fibrosis, and so on. Therefore, the accurate classification of Golgi-resident proteins may contribute to drug development and further to drug therapy. This paper presents a novel Golgi-resident protein types prediction method called Golgi-XGBoost. First, the feature vectors of protein sequence are extracted by fusing pseudo-amino acid composition (PseAAC), dipeptide composition (DC), pseudo-position specific scoring matrix (PsePSSM) and encoding based on grouped weight (EBGW). Secondly, the conditional covariance minimization (CCM) is used to reduce the dimension of the feature vectors. Then, we adopt the synthetic minority over sampling technique (SMOTE) to balance the samples. Finally, the optimal feature vectors are input into the extreme gradient boosting (XGBoost) classifier to predict the type of Golgi-resident protein. The overall prediction accuracy is 92.1% on training set via jackknife test, which achieves better performance than other state-of-the-art methods. The accuracy of independent testing dataset is 86.5%. And the results show that this paper provides a new method for predicting the type of Golgi-resident protein. The source code and all datasets are available at https://github.com/QUST-AIBBDRC/Golgi-XGBoost/.
What problem does this paper attempt to address?