BERT-Kcr: prediction of lysine crotonylation sites by a transfer learning method with pre-trained BERT models

Yanhua Qiao,Xiaolei Zhu,Haipeng Gong
DOI: https://doi.org/10.1093/bioinformatics/btab712
IF: 5.8
2021-10-13
Bioinformatics
Abstract:Abstract Motivation As one of the most important post-translational modifications (PTMs), protein lysine crotonylation (Kcr) has attracted wide attention, which involves in important physiological activities, such as cell differentiation and metabolism. However, experimental methods are expensive and time-consuming for Kcr identification. Instead, computational methods can predict Kcr sites in silico with high efficiency and low cost. Results In this study, we proposed a novel predictor, BERT-Kcr, for protein Kcr sites prediction, which was developed by using a transfer learning method with pre-trained bidirectional encoder representations from transformers (BERT) models. These models were originally used for natural language processing (NLP) tasks, such as sentence classification. Here, we transferred each amino acid into a word as the input information to the pre-trained BERT model. The features encoded by BERT were extracted and then fed to a BiLSTM network to build our final model. Compared with the models built by other machine learning and deep learning classifiers, BERT-Kcr achieved the best performance with AUROC of 0.983 for 10-fold cross validation. Further evaluation on the independent test set indicates that BERT-Kcr outperforms the state-of-the-art model Deep-Kcr with an improvement of about 5% for AUROC. The results of our experiment indicate that the direct use of sequence information and advanced pre-trained models of NLP could be an effective way for identifying PTM sites of proteins. Availability and implementation The BERT-Kcr model is publicly available on http://zhulab.org.cn/BERT-Kcr_models/. Supplementary information Supplementary data are available at Bioinformatics online.
biochemical research methods,biotechnology & applied microbiology,mathematical & computational biology
What problem does this paper attempt to address?