A Comparative Review of Deep Learning Methods for RNA Tertiary Structure Prediction

Ivona Martinovic,Tin Vlasic,Yang Li,Bryan Hooi,Yang Zhang,Mile Sikic
DOI: https://doi.org/10.1101/2024.11.27.625779
2024-12-03
Abstract:Several deep learning-based tools for RNA 3D structure prediction have recently emerged, including DRfold, DeepFoldRNA, RhoFold, RoseTTAFoldNA, trRosettaRNA, and AlphaFold3. In this study, we systematically evaluate these six models on three datasets: RNA Puzzles, CASP15 RNA targets, and a newly generated large dataset of sequentially distinct RNAs, which serves as a benchmark for generalization capabilities. To ensure a robust evaluation, we also introduce a fourth, more stringent dataset that contains both sequentially and structurally distinct RNAs. We observed that each model predicts the best structure for certain RNAs, and evaluated whether commonly used scoring functions, Rosetta score and ARES, can reliably identify the most accurate structure from the predictions. Finally, since many RNA chains in the Protein Data Bank are part of complexes, we compare the performance of RoseTTAFoldNA and AlphaFold3 in predicting RNA structures within complexes versus isolated RNA chains extracted from these complexes. This comprehensive evaluation highlights the strengths and limitations of current deep learning-based tools and provides valuable insights for advancing RNA 3D structure prediction.
Bioinformatics
What problem does this paper attempt to address?
The problem that this paper attempts to solve is to evaluate the performance of current deep - learning - based RNA tertiary structure prediction tools and explore the generalization ability of these tools on different datasets. Specifically, the paper systematically compares the performance of six deep - learning models (DRfold, DeepFoldRNA, RhoFold, RoseTTAFoldNA, trRosettaRNA and AlphaFold3) on three main datasets, which are: 1. **RNA Puzzles**: A widely - used benchmark dataset that contains 37 RNA puzzles. To prevent overlap with the second dataset, Puzzle 35 and 36 of CASP15 targets are excluded. 2. **CASP15 RNA targets**: CASP15 introduced RNA targets for the first time, with a total of 12, including natural and synthetic RNA. 3. **Newly generated large - scale dataset**: 190 sequences are screened from RNA sequences published in PDB after April 2022, ensuring that these RNAs have not appeared in the training dataset, so as to evaluate the generalization ability of the model. In addition, in order to test the generalization ability of the model more strictly, the paper also creates a fourth dataset. The RNAs in this dataset are different not only in sequence but also in structure from those in the training dataset. This dataset is screened through the RNA3DB pipeline and finally contains 140 RNA sequences. ### Main research questions: 1. **Performance evaluation**: Compare the prediction performance of six deep - learning models on different datasets, and use multiple evaluation metrics (such as RMSD, TM - score, INF, clash score and lDDT) to comprehensively evaluate the accuracy of the models. 2. **Generalization ability**: Evaluate the performance of these models when dealing with unseen and structurally novel RNA sequences, especially on the fourth dataset. 3. **Effectiveness of scoring functions**: Evaluate whether commonly used scoring functions (such as ARES and Rosetta score) can reliably identify the most accurate predicted structures. 4. **Complex structure prediction**: For RNA strands from complexes, compare the performance of these models when predicting the entire complex structure with that when predicting only single - strand RNA. ### Research contributions: - **Comprehensive comparison**: A systematic comparison of six models is carried out, covering multiple datasets. - **Generalization ability evaluation**: By creating the fourth dataset, the generalization ability of the model is strictly tested. - **Scoring function evaluation**: Evaluate the effectiveness of commonly used scoring functions and explore how to combine scoring functions to improve prediction performance. - **Context - dependent performance comparison**: For RNA strands from complexes, compare the performance differences between predicting the entire complex structure and predicting single - strand RNA. Through these studies, the paper aims to provide in - depth understanding and evaluation of current deep - learning - based RNA tertiary structure prediction tools, and provide valuable references for future research and development.