Abstract:Protein function prediction is crucial for understanding species evolution, including viral mutations. Gene ontology (GO) is a standardized representation framework for describing protein functions with annotated terms. Each ontology is a specific functional category containing multiple child ontologies, and the relationships of parent and child ontologies create a directed acyclic graph. Protein functions are categorized using GO, which divides them into three main groups: cellular component ontology, molecular function ontology, and biological process ontology. Therefore, the GO annotation of protein is a hierarchical multilabel classification problem. This hierarchical relationship introduces complexities such as mixed ontology problem, leading to performance bottlenecks in existing computational methods due to label dependency and data sparsity. To overcome bottleneck issues brought by mixed ontology problem, we propose ProFun-SOM, an innovative multilabel classifier that utilizes multiple sequence alignments (MSAs) to accurately annotate gene ontologies. ProFun-SOM enhances the initial MSAs through a reconstruction process and integrates them into a deep learning architecture. It then predicts annotations within the cellular component, molecular function, biological process, and mixed ontologies. Our evaluation results on three datasets (CAFA3, SwissProt, and NetGO2) demonstrate that ProFun-SOM surpasses state-of-the-art methods. This study confirmed that utilizing MSAs of proteins can effectively overcome the two main bottlenecks issues, label dependency and data sparsity, thereby alleviating the root problem, mixed ontology. A freely accessible web server is available at http://bliulab.net/ ProFun-SOM/.

OPUS-GO: An interpretable protein/RNA sequence annotation framework based on biological language model

A Deep Learning Framework for Gene Ontology Annotations with Sequence- and Network-Based Information

AnnoPRO: a strategy for protein function annotation based on multi-scale protein representation and a hybrid deep learning of dual-path encoding

NetGO 3.0: Protein Language Model Improves Large-scale Functional Annotations

RNA-GPT: Multimodal Generative System for RNA Sequence Understanding

Protein Function Prediction With Functional and Topological Knowledge of Gene Ontology

Interpretable RNA Foundation Model from Unannotated Data for Highly Accurate RNA Structure and Function Predictions

Prot2GO: Predicting GO Annotations from Protein Sequences and Interactions.

A Language Modeling Text Mining Approach to the Annotation of Protein Community

OntoProtein: Protein Pretraining With Gene Ontology Embedding

Language modelling for biological sequences – curated datasets and baselines

FAPM: Functional Annotation of Proteins using Multi-Modal Models Beyond Structural Modeling

PROTGOAT : Improved automated protein function predictions using Protein Language Models

Deciphering RNA regulation with a foundation language model

Multiple sequence alignment-based RNA language model and its application to structural inference

OmniNA: A foundation model for nucleotide sequences

ProtGO: A Transformer based Fusion Model for accurately predicting Gene Ontology (GO) Terms from full scale Protein Sequences

ProFun-SOM: Protein Function Prediction for Specific Ontology Based on Multiple Sequence Alignment Reconstruction

OPUS-Design: Designing Protein Sequence from Backbone Structure with 3DCNN and Protein Language Model

Interpreting Gene Ontology Annotations Derived from Sequence Homology Methods

Embeddings from deep learning transfer GO annotations beyond homology