Investigating the performance of foundation models on human 3’UTR sequences

Sergey Vilov,Matthias Heinig
DOI: https://doi.org/10.1101/2024.02.09.579631
2024-02-12
Abstract:Foundation models, such as DNABERT and Nucleotide Transformer have recently shaped a new direction in DNA research. Trained in an unsupervised manner on a vast quantity of genomic data, they can be used for a variety of downstream tasks, such as promoter prediction, DNA methylation prediction, gene network prediction or functional variant prioritization. However, these models are often trained and evaluated on entire genomes, neglecting genome partitioning into different functional regions. In our study, we investigate the efficacy of various unsupervised approaches, including genome-wide and 3’UTR-specific foundation models on human 3’UTR regions. Our evaluation includes downstream tasks specific for RNA biology, such as recognition of binding motifs of RNA binding proteins, detection of functional genetic variants, prediction of expression levels in massively parallel reporter assays, and estimation of mRNA half-life. Remarkably, models specifically trained on 3’UTR sequences demonstrate superior performance when compared to the established genome-wide foundation models in three out of four downstream tasks. Our results underscore the importance of considering genome partitioning into functional regions when training and evaluating foundation models.
Bioinformatics
What problem does this paper attempt to address?
The paper aims to explore the performance of foundational models (such as DNABERT and Nucleotide Transformer) on human 3' untranslated region (3'UTR) sequences. Although these models have been trained in an unsupervised manner on large amounts of genomic data and are widely applied to various downstream tasks (such as promoter prediction, DNA methylation prediction, etc.), they are typically trained and evaluated on the entire genome, neglecting different functional regions of the genome. Therefore, this paper evaluates the effectiveness of different unsupervised methods through a series of specific downstream tasks related to RNA biology (e.g., identifying binding motifs of RNA-binding proteins, detecting functional genetic variants, predicting expression levels in massively parallel reporter gene experiments, and estimating mRNA half-life), including foundational models based on the whole genome and those specifically trained on 3'UTR sequences. The study results indicate that in 3 out of 4 downstream tasks, models specifically trained on 3'UTR sequences exhibit superior performance compared to existing whole-genome foundational models. This finding underscores the importance of considering genomic functional partitioning when training and evaluating foundational models.