Knowledge from Large-Scale Protein Contact Prediction Models Can Be Transferred to the Data-Scarce RNA Contact Prediction Task

Yiren Jian,Chongyang Gao,Chen Zeng,Yunjie Zhao,Soroush Vosoughi
2024-01-19
Abstract:RNA, whose functionality is largely determined by its structure, plays an important role in many biological activities. The prediction of pairwise structural proximity between each nucleotide of an RNA sequence can characterize the structural information of the RNA. Historically, this problem has been tackled by machine learning models using expert-engineered features and trained on scarce labeled datasets. Here, we find that the knowledge learned by a protein-coevolution Transformer-based deep neural network can be transferred to the RNA contact prediction task. As protein datasets are orders of magnitude larger than those for RNA contact prediction, our findings and the subsequent framework greatly reduce the data scarcity bottleneck. Experiments confirm that RNA contact prediction through transfer learning using a publicly available protein model is greatly improved. Our findings indicate that the learned structural patterns of proteins can be transferred to RNAs, opening up potential new avenues for research.
Quantitative Methods,Machine Learning
What problem does this paper attempt to address?
### Problems Addressed by the Paper This paper aims to address the issue of data scarcity in RNA contact prediction. Specifically: 1. **Background and Challenges**: - The function of RNA largely depends on its structure, so predicting the pairing relationships between nucleotides in an RNA sequence (i.e., the contact map) is crucial for understanding RNA structure. - Existing RNA contact prediction methods typically rely on machine learning models that are based on expert-designed features and trained on small labeled datasets. - In contrast, protein datasets are much larger than those required for RNA contact prediction, leading to significant progress in protein contact prediction, while RNA contact prediction has been slow due to insufficient data. 2. **Research Objectives**: - To improve RNA contact prediction tasks by leveraging the knowledge from pre-trained protein co-evolution Transformer models (such as CoT). - To apply the knowledge learned from protein contact prediction models to RNA contact prediction through transfer learning, thereby alleviating the bottleneck caused by RNA data scarcity. - To experimentally demonstrate that using publicly available protein language models for transfer learning can significantly enhance the performance of RNA contact prediction. 3. **Method Overview**: - Use the pre-trained protein language model CoT as the base model and adapt it to the RNA dataset. - Extract features through a multi-layer attention mechanism and use convolutional networks for classification. - Compare the effects of different transfer learning strategies and ultimately propose a model design that combines multi-layer attention features. 4. **Experimental Results**: - The proposed transfer learning model significantly outperforms existing methods in RNA contact prediction tasks, especially in terms of Top-L, Top-0.5L, and Top-0.3L accuracy metrics. - Experiments show that by integrating multi-layer attention features and making appropriate parameter adjustments, good generalization can be achieved on small-scale training datasets.