Character-level Tokenizations as Powerful Inductive Biases for RNA Foundational Models

Adrián Morales-Pastor,Raquel Vázquez-Reza,Miłosz Wieczór,Clàudia Valverde,Manel Gil-Sorribes,Bertran Miquel-Oliver,Álvaro Ciudad,Alexis Molina
2024-11-06
Abstract:RNA is a vital biomolecule with numerous roles and functions within cells, and interest in targeting it for therapeutic purposes has grown significantly in recent years. However, fully understanding and predicting RNA behavior, particularly for applications in drug discovery, remains a challenge due to the complexity of RNA structures and interactions. While foundational models in biology have demonstrated success in modeling several biomolecules, especially proteins, achieving similar breakthroughs for RNA has proven more difficult. Current RNA models have yet to match the performance observed in the protein domain, leaving an important gap in computational biology. In this work, we present ChaRNABERT, a suite of sample and parameter-efficient RNA foundational models, that through a learnable tokenization process, are able to reach state-of-the-art performance on several tasks in established benchmarks. We extend its testing in relevant downstream tasks such as RNA-protein and aptamer-protein interaction prediction. Weights and inference code for ChaRNABERT-8M will be provided for academic research use. The other models will be available upon request.
Quantitative Methods,Artificial Intelligence,Machine Learning,Biomolecules
What problem does this paper attempt to address?
The main problem that this paper attempts to solve is the gap between current RNA modeling techniques and protein modeling techniques in terms of performance and universality. Specifically, although protein language models (such as ESM) have achieved remarkable success in fields such as drug discovery and protein engineering, RNA modeling still faces challenges, especially in predicting the behavior, structure, and function of RNA. These problems mainly stem from the complexity and diversity of RNA and the limitations of existing RNA models when handling different tasks. ### Paper Goals 1. **Establish a powerful RNA - based model**: The authors propose ChaRNABERT, an RNA - based model capable of performing multiple downstream tasks. 2. **Utilize a learnable tokenization strategy**: By introducing a character - level tokenization method, the bias caused by artificially selected motifs is avoided, enabling the model to better capture the details of RNA sequences. 3. **Improve the generalization ability of the model**: By training on multiple types of RNA, it is ensured that the model can perform well in different tasks. ### Main Contributions - **Innovative tokenization method**: Using Gradient - Based Subsequence Tokenization (GBST) and character - level tokenization methods, the most suitable subsequence blocks are dynamically selected, thereby enhancing the model's ability to understand RNA sequences. - **Bidirectional BERT encoder**: Incorporates an improved BERT architecture, including the SwiGLU non - linear activation function, Rotary Position Encoding (ROPE), Query - Key Normalization (QKNorm), and Flash Attention 2, to enhance the performance and stability of the model. - **Multi - task pre - training**: Adopts the UL2 (Unifying Language Learning) framework and uses multiple masking strategies (short - span masking, long - span masking, and retrieval - enhanced masking) to improve the model's context understanding and generalization ability. - **Large - scale data set**: Trains using a large number of non - coding and coding RNA sequences from the RNAcentral and RefSeq databases to ensure that the model has extensive data support. ### Experimental Design - **Model size**: Multiple models with different numbers of parameters (8M, 33M, 50M, 100M, 150M, 650M) are trained to evaluate the impact of model size on performance. - **Data set**: Two data sets are used for training. One is a data set containing 31 million non - coding RNA sequences, and the other is a combined data set containing 31 million non - coding RNA sequences and 31 million coding RNA sequences. - **Evaluation metrics**: The performance of the model is evaluated through MLM/UL2 loss performance, downstream task effectiveness, and generalization ability. ### Conclusion Through these methods, ChaRNABERT has achieved state - of - the - art performance in multiple benchmark tests and has performed excellently in some key downstream tasks (such as RNA - protein interaction prediction). This indicates that through innovative tokenization strategies and improved model architectures, the performance and generalization ability of RNA modeling can be significantly improved, thereby providing strong support for RNA research and drug development.