Abstract:Motivation: Recent advancements in high-throughput sequencing technology have significantly increased the focus on non-coding RNA (ncRNA) research within the life sciences. Despite this, the functions of many ncRNAs remain poorly understood. Research suggests that ncRNAs within the same family typically share similar functions, underlining the importance of understanding their roles. There are two primary methods for predicting ncRNA families: biological and computational. Traditional biological methods are not suitable for large-scale data prediction due to the significant human and resource requirements. Concurrently, most existing computational methods either rely solely on ncRNA sequence data or are exclusively based on the secondary structure of ncRNA molecules. These methods fail to fully utilize the rich multimodal information available from ncRNAs, thereby preventing them from learning more comprehensive and in-depth feature representations. Results: To tackle these problems, we proposed MM-ncRNAFP, a multi-modal contrastive learning framework for ncRNA family prediction. We first used a pre-trained language model to encode the primary sequences of a large mammalian ncRNA dataset. Then, we adopted a contrastive learning framework with an attention mechanism to fuse the secondary structure information obtained by graph neural networks. The MM-ncRNAFP method can effectively fuse multi-modal information. Experimental comparisons with several competitive baselines demonstrated that MM-ncRNAFP can achieve more comprehensive representations of ncRNA features by integrating both sequence and structural information. This integration significantly enhances the performance of ncRNA family prediction. Ablation experiments and qualitative analyses were performed to verify the effectiveness of each component in our model. Moreover, since our model is pre-trained on a large amount of ncRNA data, it has the potential to bring significant improvements to other ncRNA-related tasks. Availability and implementation: MM-ncRNAFP and the datasets are available at https://github.com/xuruiting2/MM-ncRNAFP.

OmniNA: A foundation model for nucleotide sequences

OmniGenome: Aligning RNA Sequences with Secondary Structures in Genomic Foundation Models

SegmentNT: annotating the genome at single-nucleotide resolution with DNA foundation models

LucaOne: Generalized Biological Foundation Model with Unified Nucleic Acid and Protein Language

Benchmarking DNA Foundation Models for Genomic Sequence Classification

UNI-RNA: UNIVERSAL PRE-TRAINED MODELS REVOLUTIONIZE RNA RESEARCH

The Nucleotide Transformer: Building and Evaluating Robust Foundation Models for Human Genomics

DNABERT-2: Efficient Foundation Model and Benchmark For Multi-Species Genome

GENA-LM: A Family of Open-Source Foundational DNA Language Models for Long Sequences

Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

A long context RNA foundation model for predicting transcriptome architecture

Orthrus: Towards Evolutionary and Functional RNA Foundation Models

HyenaDNA: Long-Range Genomic Sequence Modeling at Single Nucleotide Resolution

DeepGene: An Efficient Foundation Model for Genomics based on Pan-genome Graph Transformer

Direct high-throughput deconvolution of unnatural bases via nanopore sequencing and bootstrapped learning

Accurate and General DNA Representations Emerge from Genome Foundation Models at Scale

dnaGrinder: a lightweight and high-capacity genomic foundation model

OMAnnotator: a novel approach to building an annotated consensus genome sequence

Large-Scale Multi-omic Biosequence Transformers for Modeling Peptide-Nucleotide Interactions

Improving ncRNA family prediction using multi-modal contrastive learning of sequence and structure