A long context RNA foundation model for predicting transcriptome architecture

Ali Saberi,Benedict Choi,Sean Wang,Aldo Hernandez-Corchado,Mohsen Naghipourfar,Arsham Namini,Vijay Ramani,Amin Emad,Hamed S Najafabadi,Hani Goodarzi
DOI: https://doi.org/10.1101/2024.08.26.609813
2024-10-21
Abstract:Linking DNA sequence to genomic function remains one of the grand challenges in genetics and genomics. Here, we combine large-scale single-molecule transcriptome sequencing of diverse cancer cell lines with cutting-edge machine learning to build LoRNASH, an RNA foundation model that learns how the nucleotide sequence of unspliced pre-mRNA dictates transcriptome architecture--the relative abundances and molecular structures of mRNA isoforms. Owing to its use of the StripedHyena architecture, LoRNASH handles extremely long sequence inputs (~65 kilobase pairs), allowing for quantitative, zero-shot prediction of all aspects of transcriptome architecture, including isoform abundance, isoform structure, and the impact of DNA sequence variants on transcript structure and abundance. We anticipate that our public data release and proof-of-concept model will accelerate varying aspects of RNA biotechnology. More broadly, we envision the use of LoRNASH as a foundation for fine-tuning of any transcriptome-related downstream prediction task, including cell-type specific gene expression, splicing, and general RNA processing.
Bioinformatics
What problem does this paper attempt to address?