Pre-training with pseudo-labeling compares favorably with large language models for regulatory sequence prediction

Raphaël Mourad

DOI: https://doi.org/10.1101/2023.12.21.572780

2024-05-05

Abstract:Predicting molecular processes using deep learning is a promising approach to provide biological insights for non-coding SNPs identified in genome-wide association studies. However, most deep learning methods rely on supervised learning, which requires DNA sequences associated with functional data, and whose amount is severely limited by the finite size of the human genome. Conversely, the amount of mammalian DNA sequences is growing exponentially due to ongoing large-scale sequencing projects, but in most cases without functional data. To alleviate the limitations of supervised learning, we propose a novel semi-supervised learning based on pseudo-labeling, which allows to exploit unlabeled DNA sequences from numerous genomes during model pre-training. The approach is very flexible and can be used to train any neural architecture including state-of-the-art models, and shows in certain situations strong predictive performance improvements compared to standard supervised learning in most cases. Moreover, small models trained by SSL showed similar or better performance than large language model DNABERT2.

Bioinformatics

What problem does this paper attempt to address?

This paper mainly discusses how to improve the prediction of molecular processes, especially for the deep learning methods of regulatory sequence prediction, using semi-supervised learning (SSL). Most current deep learning methods rely on supervised learning and require a large amount of DNA sequence data with functional annotations, which is limited by the size of the human genome. In contrast, the number of DNA sequences in mammals is increasing, but lacks functional annotations. The authors propose a novel cross-species pseudo-labeling method, which increases the available labeled data by mapping regulatory sequences of known species (such as humans) to other related species. This method allows the use of a large amount of unlabeled data in the pre-training phase and fine-tuning on the original labeled data subsequently. Using this approach, they found that the predictive performance of the model could be improved, especially in the case of specific transcription factors (TFs). In the paper, the authors use various deep learning models, including shallow and deep convolutional neural networks (CNN), as well as a large language model DNABERT2 based on Transformer, and demonstrate performance improvements on different datasets. In some cases, the small SSL models even perform similarly or better than the large language model DNABERT2. Furthermore, they evaluate the ability of SSL to predict the functional impacts of single nucleotide polymorphisms (SNPs) and find that SSL can significantly improve the prediction performance, especially for TFs with smaller datasets such as CTCF and ANDR. In conclusion, this paper addresses the problem of limited labeled data in deep learning for bioinformatics and proposes a method that utilizes semi-supervised learning and cross-species pseudo-labeling techniques to improve model performance and prediction ability.

Pre-training with pseudo-labeling compares favorably with large language models for regulatory sequence prediction

Semi-supervised learning with pseudo-labeling compares favorably with large language models for regulatory sequence prediction

Semi-supervised learning improves regulatory sequence prediction with unlabeled sequences

Semi-supervised deep learning with graph neural network for cross-species regulatory sequence prediction

Predicting the sequence specificities of DNA-binding proteins by DNA Fine-tuned Language Model with decaying learning rates

Evaluating the representational power of pre-trained DNA language models for regulatory genomics

DNAHLM -- DNA sequence and Human Language mixed large language Model

Improving the performance of supervised deep learning for regulatory genomics using phylogenetic augmentation

Graphylo: A deep learning approach for predicting regulatory DNA and RNA sites from whole-genome multiple alignments

Sequential Labelling and DNABERT For Splice Site Prediction in Homo Sapiens DNA

A semi-supervised deep learning approach for predicting the functional effects of genomic non-coding variations

Self-Distillation Improves DNA Sequence Inference

DNAGPT: A Generalized Pre-trained Tool for Versatile DNA Sequence Analysis Tasks

DNAGPT: A Generalized Pre-trained Tool for Multiple DNA Sequence Analysis Tasks

Toward Understanding BERT-Like Pre-Training for DNA Foundation Models

Deciphering RNA regulation with a foundation language model

Idna-Abf: Multi-Scale Deep Biological Language Learning Model for the Interpretable Prediction of DNA Methylations

DART-Eval: A Comprehensive DNA Language Model Evaluation Benchmark on Regulatory DNA

Distinguishing word identity and sequence context in DNA language models

Self-supervised learning on millions of primary RNA sequences from 72 vertebrates improves sequence-based RNA splicing prediction

Enhancing recognition and interpretation of functional phenotypic sequences through fine-tuning pre-trained genomic models