PROTGOAT : Improved automated protein function predictions using Protein Language Models

Zong Ming Chua,Adarsh Rajesh,Sanju Sinha,Peter D. Adams
DOI: https://doi.org/10.1101/2024.04.01.587572
2024-04-02
Abstract:Accurate prediction of protein function is crucial for understanding biological processes and various disease mechanisms. Current methods for protein function prediction relies primarily on sequence similarities and often misses out on important aspects of protein function. New developments in protein function prediction methods have recently shown exciting progress via the use of large transformer-based Protein Language Models (PLMs) that allow for the capture of nuanced relationships between amino acids in protein sequences which are crucial for understanding their function. This has enabled an unprecedented level of accuracy in predicting the functions of previously little understood proteins. We here developed an ensemble method called PROTGOAT based on embeddings extracted from multiple and diverse pre-trained PLMs and existing text information about the protein in published literature. PROTGOAT outperforms most current state-of-the-art methods, ranking fourth in the Critical Assessment of Functional Annotation (CAFA 5), a global competition benchmarking such developments among 1600 methods tested. The high performance of our method demonstrates how protein function prediction can be improved through the use of an ensemble of diverse PLMs. PROTGOAT is publicly available for academic use and can be accessed here:
Bioinformatics
What problem does this paper attempt to address?
The paper focuses on improving protein function prediction, particularly through the use of multiple pre-trained protein language models (PLMs) and existing protein information from the literature. Current methods for protein function prediction mainly rely on sequence similarity, which may overlook certain crucial functional aspects. The paper presents an integrated method called PROTGOAT, which combines the outputs of multiple different PLMs along with protein literature and classification information to predict protein function. In the recent CAFA 5 (Critical Assessment of Functional Annotation) global competition, PROTGOAT ranked fourth out of 1600 tested methods, demonstrating its high accuracy in predicting protein function. The paper also validates the prediction efficacy by comparing PROTGOAT predictions with RNA-seq data of genes related to cellular senescence. Furthermore, the paper conducts an ablation study to investigate the impact of different training data on model performance, revealing different sensitivities of certain GO domains to changes in training data. The paper discusses possible future directions, including exploring deep learning architectures that can better capture protein functional hierarchy and graphical structure, as well as improving GO annotation methods to reflect the complexity of protein function. Overall, this paper introduces a new protein function prediction tool that utilizes multiple sources of information to improve prediction accuracy and validates its prediction relevance to biological processes through experiments, providing a powerful tool for protein function research.