Abstract:Motivation: Recent advancements in high-throughput sequencing technology have significantly increased the focus on non-coding RNA (ncRNA) research within the life sciences. Despite this, the functions of many ncRNAs remain poorly understood. Research suggests that ncRNAs within the same family typically share similar functions, underlining the importance of understanding their roles. There are two primary methods for predicting ncRNA families: biological and computational. Traditional biological methods are not suitable for large-scale data prediction due to the significant human and resource requirements. Concurrently, most existing computational methods either rely solely on ncRNA sequence data or are exclusively based on the secondary structure of ncRNA molecules. These methods fail to fully utilize the rich multimodal information available from ncRNAs, thereby preventing them from learning more comprehensive and in-depth feature representations. Results: To tackle these problems, we proposed MM-ncRNAFP, a multi-modal contrastive learning framework for ncRNA family prediction. We first used a pre-trained language model to encode the primary sequences of a large mammalian ncRNA dataset. Then, we adopted a contrastive learning framework with an attention mechanism to fuse the secondary structure information obtained by graph neural networks. The MM-ncRNAFP method can effectively fuse multi-modal information. Experimental comparisons with several competitive baselines demonstrated that MM-ncRNAFP can achieve more comprehensive representations of ncRNA features by integrating both sequence and structural information. This integration significantly enhances the performance of ncRNA family prediction. Ablation experiments and qualitative analyses were performed to verify the effectiveness of each component in our model. Moreover, since our model is pre-trained on a large amount of ncRNA data, it has the potential to bring significant improvements to other ncRNA-related tasks. Availability and implementation: MM-ncRNAFP and the datasets are available at https://github.com/xuruiting2/MM-ncRNAFP.

Multiple sequence alignment-based RNA language model and its application to structural inference

Accurate RNA 3D Structure Prediction Using a Language Model-Based Deep Learning Approach

Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

ERNIE-RNA: An RNA Language Model with Structure-enhanced Representations

Language models enable zero-shot prediction of RNA secondary structure including pseudoknots

Deciphering RNA regulation with a foundation language model

RiNALMo: General-Purpose RNA Language Models Can Generalize Well on Structure Prediction Tasks

Comprehensive benchmarking of large language models for RNA secondary structure prediction

Prediction of the RNA Tertiary Structure Based on a Random Sampling Strategy and Parallel Mechanism

RNA-TorsionBERT: leveraging language models for RNA 3D torsion angles prediction

An algorithm for rapid noncoding RNA sequence-structure alignment

Rm-LR: A long-range-based deep learning model for predicting multiple types of RNA modifications

Predicting RNA sequence-structure likelihood via structure-aware deep learning

Predicting Distance matrix with large language models

Interpretable Multi-Scale Deep Learning for RNA Methylation Analysis across Multiple Species

Diverse Database and Machine Learning Model to Narrow the Generalization Gap in RNA Structure Prediction

Machine learning in RNA structure prediction: Advances and challenges

ProtRNA: A Protein-derived RNA Language Model by Cross-Modality Transfer Learning

Improving ncRNA family prediction using multi-modal contrastive learning of sequence and structure

RNAformer: A Simple yet Effective Model for Homology-Aware RNA Secondary Structure Prediction