OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models

Heng Yang,Ke Li
2024-09-19
Abstract:The alignment between RNA sequences and structures in foundation models (FMs) has yet to be thoroughly investigated. Existing FMs have struggled to establish sequence-structure alignment, hindering the free flow of genomic information between RNA sequences and structures. In this study, we introduce OmniGenome, an RNA FM trained to align RNA sequences with respect to secondary structures based on structure-contextualised modelling. The alignment enables free and bidirectional mappings between sequences and structures by utilising the flexible RNA modelling paradigm that supports versatile input and output modalities, i.e., sequence and/or structure as input/output. We implement RNA design and zero-shot secondary structure prediction as case studies to evaluate the Seq2Str and Str2Seq mapping capacity of OmniGenome. Results on the EternaV2 benchmark show that OmniGenome solved 74% of puzzles, whereas existing FMs only solved up to 3% of the puzzles due to the oversight of sequence-structure alignment. We leverage four comprehensive in-silico genome modelling benchmarks to evaluate performance across a diverse set of genome downstream tasks, where the results show that OmniGenome achieves state-of-the-art performance on RNA and DNA benchmarks, even without any training on DNA genomes.
Genomics,Computation and Language
What problem does this paper attempt to address?
### Problems Addressed by the Paper The paper primarily addresses the following two key issues: 1. **Alignment between RNA sequences and secondary structures**: - Existing Foundation Models (FMs) face difficulties in establishing bidirectional information flow between RNA sequences and their secondary structures. This alignment issue hinders the free flow of genomic information between RNA sequences and structures. Given that the function and stability of RNA are closely related to its complex secondary structure, this problem is particularly significant. 2. **Sequence-structure alignment in RNA design**: - In existing foundation models, the neglect of sequence-structure alignment leads to poor performance in RNA design tasks. For example, the latest RNA foundation models like RNA-FM and RNA-MSM can only solve 3% of the challenges in virtual RNA design tasks. To address these issues, the authors propose a new model, OmniGenome, which enables bidirectional mapping between RNA sequences and their secondary structures (i.e., Seq2Str and Str2Seq), allowing genomic information to flow freely between sequences and structures. By introducing a flexible RNA modeling paradigm that supports multiple input-output modes (such as sequence or structure as input/output), OmniGenome excels in various genomic downstream tasks and solves 74% of the challenges in the EternaV2 benchmark, significantly outperforming existing models. Additionally, the model achieves state-of-the-art performance in other genomic benchmarks.