DGRNA: a long-context RNA foundation model with bidirectional attention Mamba2

Ye Yuan Sr.,Quhshuo Chen Sr.,Xiaoyong Pan Sr.
DOI: https://doi.org/10.1101/2024.10.31.621427
2024-11-03
Abstract:Ribonucleic acid (RNA) is an important biomolecule with diverse functions i.e. genetic information transfer, regulation of gene expression and cellular functions. In recent years, the rapid development of sequencing technology has significantly enhanced our understanding of RNA biology and advanced RNA-based therapies, resulting in a huge volume of RNA data. Data-driven methods, particularly unsupervised large language models, have been used to automatically hidden semantic information from these RNA data. Current RNA large language models are primarily based on Transformer architecture, which cannot efficiently process long RNA sequences, while the Mamba architecture can effectively alleviate the quadratic complexity associated with Transformers. In this study, we propose a large foundational model DGRNA based on the bidirectional Mamba trained on 100 million RNA sequences, which has demonstrated exceptional performance across six RNA downstream tasks compared to existing RNA language models.
Bioinformatics
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the efficiency and performance issues of existing RNA language models when processing long RNA sequences. Specifically: 1. **Long RNA Sequence Processing**: Current RNA language models based on the Transformer architecture have the problem of high computational complexity (quadratic growth) when processing long RNA sequences, resulting in large resource consumption and difficulty in efficiently processing long RNA sequences. To solve this problem, the paper proposes a new RNA - based model, DGRNA, which is based on the Mamba2 architecture. 2. **Improving Model Performance**: By utilizing the bidirectional attention mechanism of the Mamba2 architecture, DGRNA can more effectively capture important information in long RNA sequences while avoiding the computational complexity problems of traditional Transformer models. This enables DGRNA to perform excellently in multiple RNA downstream tasks and achieve significant performance improvements compared to existing RNA language models. ### Specific Problems and Solutions - **Problem**: When processing long RNA sequences, existing RNA language models have high computational complexity due to the limitations of the Transformer architecture, resulting in low processing efficiency. - **Solution**: Introduce the Mamba2 architecture. This architecture effectively captures important information in long sequences through global receptive fields and dynamic weighting strategies while avoiding quadratic computational complexity. In addition, the Mamba2 architecture also combines the State Space Dual (SSD) framework to further improve training efficiency and model performance. ### Performance in Downstream Tasks In the paper, the DGRNA model was evaluated in the following six RNA downstream tasks and performed excellently in all of them: 1. **Non - coding RNA Classification**: On the non - coding_s1 and nRC datasets, the F1 scores of DGRNA are 0.98 and the best performance respectively. 2. **5' UTR Regression Task**: On the Random7600 and Human7600 datasets, the R² scores of DGRNA are both 0.93, which is comparable to RiNALMo but with fewer parameters. 3. **RNA - RNA Interaction Prediction**: On the DeepMirTar dataset, the F1 score of DGRNA is 1% higher than that of existing methods, and it also performs well in terms of accuracy, precision, and AUC. 4. **RNA - Protein Binding Site Identification**: Among 17 RBP datasets, DGRNA performs best on 14 datasets, with an average AUPRC of 0.889. 5. **Translation Efficiency Prediction**: In 10 - fold cross - validation, the Spearman correlation coefficient of DGRNA is 0.78, which is significantly better than other models. 6. **Splice Site Prediction**: On four independent datasets, DGRNA performs best on three datasets and is slightly lower than SpliceBERT on the Arabidopsis Acceptor dataset. ### Conclusion The DGRNA model has successfully solved the efficiency problem of existing RNA language models when processing long RNA sequences by adopting the Mamba2 architecture and has shown excellent performance in multiple downstream tasks. In the future, with more pre - trained RNA sequences and an increase in parameters, DGRNA is expected to be further improved and become a powerful tool in the field of RNA research.