Pre-training with pseudo-labeling compares favorably with large language models for regulatory sequence prediction

Raphaël Mourad
DOI: https://doi.org/10.1101/2023.12.21.572780
2024-05-05
Abstract:Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding SNPs identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. The approach is very flexible and can be used to train any neural architecture including state-of-the-art models, and shows in certain situations strong predictive performance improvements compared to standard supervised learning in most cases. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.
Bioinformatics
What problem does this paper attempt to address?
This paper mainly discusses how to improve the prediction of molecular processes, especially for the deep learning methods of regulatory sequence prediction, using semi-supervised learning (SSL). Most current deep learning methods rely on supervised learning and require a large amount of DNA sequence data with functional annotations, which is limited by the size of the human genome. In contrast, the number of DNA sequences in mammals is increasing, but lacks functional annotations. The authors propose a novel cross-species pseudo-labeling method, which increases the available labeled data by mapping regulatory sequences of known species (such as humans) to other related species. This method allows the use of a large amount of unlabeled data in the pre-training phase and fine-tuning on the original labeled data subsequently. Using this approach, they found that the predictive performance of the model could be improved, especially in the case of specific transcription factors (TFs). In the paper, the authors use various deep learning models, including shallow and deep convolutional neural networks (CNN), as well as a large language model DNABERT2 based on Transformer, and demonstrate performance improvements on different datasets. In some cases, the small SSL models even perform similarly or better than the large language model DNABERT2. Furthermore, they evaluate the ability of SSL to predict the functional impacts of single nucleotide polymorphisms (SNPs) and find that SSL can significantly improve the prediction performance, especially for TFs with smaller datasets such as CTCF and ANDR. In conclusion, this paper addresses the problem of limited labeled data in deep learning for bioinformatics and proposes a method that utilizes semi-supervised learning and cross-species pseudo-labeling techniques to improve model performance and prediction ability.