BERT-TFBS: a novel BERT-based model for predicting transcription factor binding sites by transfer learning

Kai Wang,Xuan Zeng,Jingwen Zhou,Fei Liu,Xiaoli Luan,Xinglong Wang
DOI: https://doi.org/10.1093/bib/bbae195
IF: 9.5
2024-05-05
Briefings in Bioinformatics
Abstract:Transcription factors (TFs) are proteins essential for regulating genetic transcriptions by binding to transcription factor binding sites (TFBSs) in DNA sequences. Accurate predictions of TFBSs can contribute to the design and construction of metabolic regulatory systems based on TFs. Although various deep-learning algorithms have been developed for predicting TFBSs, the prediction performance needs to be improved. This paper proposes a bidirectional encoder representations from transformers (BERT)-based model, called BERT-TFBS, to predict TFBSs solely based on DNA sequences. The model consists of a pre-trained BERT module (DNABERT-2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT-TFBS model utilizes the pre-trained DNABERT-2 module to acquire the complex long-term dependencies in DNA sequences through a transfer learning approach, and applies the CNN module and the CBAM to extract high-order local features. The proposed model is trained and tested based on 165 ENCODE ChIP-seq datasets. We conducted experiments with model variants, cross-cell-line validations and comparisons with other models. The experimental results demonstrate the effectiveness and generalization capability of BERT-TFBS in predicting TFBSs, and they show that the proposed model outperforms other deep-learning models. The source code for BERT-TFBS is available at https://github.com/ZX1998-12/BERT-TFBS.
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to improve the accuracy of transcription factor binding site (TFBSs) prediction. Transcription factors (TFs) regulate gene transcription by binding to specific regions in DNA sequences - transcription factor binding sites (TFBSs). Accurate prediction of TFBSs is crucial for designing and constructing metabolic regulatory systems based on TFs. Although a variety of deep - learning algorithms have been developed for predicting TFBSs, the prediction performance still needs to be improved. For this reason, this paper proposes a new model based on BERT (Bidirectional Encoder Representations from Transformers) - BERT - TFBS, for predicting TFBSs solely based on DNA sequences. This model consists of a pre - trained BERT module (DNABERT - 2), a convolutional neural network (CNN) module, a convolutional block attention module (CBAM) and an output module. The BERT - TFBS model utilizes the pre - trained DNABERT - 2 module to obtain complex long - distance dependencies in DNA sequences through the transfer learning method, and applies the CNN module and CBAM to extract high - order local features. The main contributions of the paper include: 1. Proposing a new deep - learning model (BERT - TFBS), which combines a pre - trained BERT model, a CNN module, CBAM and an output module. This is the first study to use a pre - trained model for TFBSs prediction. 2. Demonstrating the contributions of the CNN module and CBAM to BERT - TFBS through comparative experiments with two variant models. 3. Conducting cross - cell line validation experiments to evaluate the generalization ability and robustness of BERT - TFBS in predicting TFBSs. 4. The experimental results show that the proposed model outperforms existing models in predicting TFBSs.