Prediction of druggable proteins using machine learning and functional enrichment analysis: a focus on cancer-related proteins and RNA-binding proteins
Andrés López-Cortés,Alejandro Cabrera-Andrade,Carlos M. Cruz-Segundo,Julian Dorado,Alejandro Pazos,Humberto Gonzáles-Díaz,César Paz-y-Miño,Yunierkis Pérez-Castillo,Eduardo Tejera,Cristian R. Munteanu
DOI: https://doi.org/10.1101/825513
2019-10-31
Abstract:ABSTRACT Background Druggable proteins are a trending topic in drug design. The druggable proteome can be defined as the percentage of proteins that have the capacity to bind an antibody or small molecule with adequate chemical properties and affinity. The screening and in silico modeling are critical activities for the reduction of experimental costs. Methods The current work proposes a unique prediction model for druggable proteins using amino acid composition descriptors of protein sequences and 13 machine learning linear and non-linear classifiers. After feature selection, the best classifier was obtained using the support vector machine method and 200 tri-amino acid composition descriptors. Results The high performance of the model is determined by an area under the receiver operating characteristics (AUROC) of 0.975 ± 0.003 and accuracy of 0.929 ± 0.006 (3-fold cross-validation). Regarding the prediction of cancer-associated proteins using this model, the best ranked druggable predicted proteins in the breast cancer protein set were CDK4, AP1S1, POLE, HMMR, RPL5, PALB2, TIMP1, RPL22, NFKB1 and TOP2A; in the cancer-driving protein set were TLL2, FAM47C, SAGE1, HTR1E, MACC1, ZFR2, VMA21, DUSP9, CTNNA3 and GABRG1; and in the RNA-binding protein set were PLA2G1B, CPEB2, NOL6, LRRC47, CTTN, CORO1A, SCAF11, KCTD12, DDX43 and TMPO. Conclusions This powerful model predicts several druggable proteins which should be deeply studied to find better therapeutic targets and thus improve clinical trials. The scripts are freely available at https://github.com/muntisa/machine-learning-for-druggable-proteins .