Evaluating large language models for annotating proteins

Rosario Vitale,Leandro A Bugnon,Emilio Luis Fenoy,Diego H Milone,Georgina Stegmayer
DOI: https://doi.org/10.1093/bib/bbae177
IF: 9.5
2024-05-08
Briefings in Bioinformatics
Abstract:In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam
biochemical research methods,mathematical & computational biology
What problem does this paper attempt to address?
The problem that this paper attempts to solve is a key challenge in protein function annotation, that is, in large - scale protein databases, only a very small number of proteins have their functional domains manually annotated. Specifically, the UniProtKB database contains more than 251 million protein entries, but only 0.25% of them are annotated with one of more than 15,000 possible Pfam family domains. Current annotation protocols integrate knowledge from manually curated family domains through sequence alignment and Hidden Markov Models (HMMs). Although this method has been successful in automatically growing Pfam annotations, its growth rate is slow relative to the speed of protein discovery. In addition, for some family domains, it is difficult to estimate the probabilities of HMMs due to the scarcity of available examples. To address this challenge, the authors propose and evaluate a new protocol based on Transfer Learning (TL), which uses large - language models (LLMs) for self - supervised training to obtain sequence embeddings from unannotated large - data sets. These embeddings can then be used with a small, annotated data set for supervised learning to perform specific tasks. This method aims to take advantage of the ability of LLMs to be pre - trained on large - scale unannotated data and apply it to the prediction of protein functional domains through transfer learning, especially for those families with small sample sizes, thereby improving the accuracy of protein domain annotation. The main contribution of the paper lies in demonstrating how transfer learning and protein LLMs can significantly improve the performance of protein family classification. Compared with the standard method, the prediction error rate is reduced by 60%. This not only improves the efficiency of protein annotation but also provides an effective solution for dealing with those protein families with scarce samples.