Improving language model of human genome for DNA–protein binding prediction based on task-specific pre-training

DOI: https://doi.org/10.1007/s12539-022-00537-9
2022-09-23
Abstract:The DNA–protein binding plays a pivotal role in regulating gene expression and evolution, and computational identification of DNA–protein has drawn more and more attention in bioinformatics. Recently, variants of BERT are also used to capture the semantic information of DNA sequences for predicting DNA–protein bindings. In this study, we leverage a task-specific pre-training strategy on BERT using large-scale multi-source DNA–protein binding data and present TFBert. TFBert treats DNA sequences as natural sentences and k-mer nucleotides as words. It can effectively extract upstream and downstream nucleotide context information by pre-training the 690 unlabeled ChIP-seq datasets. Experiments show that the pre-trained model can achieve promising performance on every single dataset in the 690 ChIP-seq datasets after simple fine tuning, especially on small datasets. The average AUC is 94.7%, outperforming existing popular methods. In conclusion, this study provides a variant of BERT based on pre-training and achieved state-of-the-art results in predicting DNA–protein bindings. We believe that TFBert can provide insights into other biological sequence classification problems.
What problem does this paper attempt to address?