Abstract:In UniProtKB, up to date, there are more than 251 million proteins deposited. However, only 0.25% have been annotated with one of the more than 15000 possible Pfam family domains. The current annotation protocol integrates knowledge from manually curated family domains, obtained using sequence alignments and hidden Markov models. This approach has been successful for automatically growing the Pfam annotations, however at a low rate in comparison to protein discovery. Just a few years ago, deep learning models were proposed for automatic Pfam annotation. However, these models demand a considerable amount of training data, which can be a challenge with poorly populated families. To address this issue, we propose and evaluate here a novel protocol based on transfer learningṪhis requires the use of protein large language models (LLMs), trained with self-supervision on big unnanotated datasets in order to obtain sequence embeddings. Then, the embeddings can be used with supervised learning on a small and annotated dataset for a specialized task. In this protocol we have evaluated several cutting-edge protein LLMs together with machine learning architectures to improve the actual prediction of protein domain annotations. Results are significatively better than state-of-the-art for protein families classification, reducing the prediction error by an impressive 60% compared to standard methods. We explain how LLMs embeddings can be used for protein annotation in a concrete and easy way, and provide the pipeline in a github repo. Full source code and data are available at https://github.com/sinc-lab/llm4pfam

What problem does this paper attempt to address?

The problem that this paper attempts to solve is a key challenge in protein function annotation, that is, in large - scale protein databases, only a very small number of proteins have their functional domains manually annotated. Specifically, the UniProtKB database contains more than 251 million protein entries, but only 0.25% of them are annotated with one of more than 15,000 possible Pfam family domains. Current annotation protocols integrate knowledge from manually curated family domains through sequence alignment and Hidden Markov Models (HMMs). Although this method has been successful in automatically growing Pfam annotations, its growth rate is slow relative to the speed of protein discovery. In addition, for some family domains, it is difficult to estimate the probabilities of HMMs due to the scarcity of available examples. To address this challenge, the authors propose and evaluate a new protocol based on Transfer Learning (TL), which uses large - language models (LLMs) for self - supervised training to obtain sequence embeddings from unannotated large - data sets. These embeddings can then be used with a small, annotated data set for supervised learning to perform specific tasks. This method aims to take advantage of the ability of LLMs to be pre - trained on large - scale unannotated data and apply it to the prediction of protein functional domains through transfer learning, especially for those families with small sample sizes, thereby improving the accuracy of protein domain annotation. The main contribution of the paper lies in demonstrating how transfer learning and protein LLMs can significantly improve the performance of protein family classification. Compared with the standard method, the prediction error rate is reduced by 60%. This not only improves the efficiency of protein annotation but also provides an effective solution for dealing with those protein families with scarce samples.

Evaluating large language models for annotating proteins

Design Proteins Using Large Language Models: Enhancements and Comparative Analyses

Protein-Protein Interaction Prediction is Achievable with Large Language Models

ProtLLM: An Interleaved Protein-Language LLM with Protein-as-Word Pre-Training

Language modelling for biological sequences – curated datasets and baselines

Fine-tuning protein language models boosts predictions across diverse tasks

A Fine-tuning Dataset and Benchmark for Large Language Models for Protein Understanding

PLM-interact: extending protein language models to predict protein-protein interactions

Bilingual Language Model for Protein Sequence and Structure

Modeling Protein Using Large-scale Pretrain Language Model

Large Language Models for Biomolecular Analysis: from Methods to Applications

ProLLM: Protein Chain-of-Thoughts Enhanced LLM for Protein-Protein Interaction Prediction

Linguistically inspired roadmap for building biologically reliable protein language models

LA4SR: illuminating the dark proteome with generative AI

When Protein Structure Embedding Meets Large Language Models

Comparative Performance Evaluation of Large Language Models for Extracting Molecular Interactions and Pathway Knowledge

A Comprehensive Evaluation of Large Language Models in Mining Gene Interactions and Pathway Knowledge

ProtChatGPT: Towards Understanding Proteins with Large Language Models

Multi-Modal Large Language Model Enables Protein Function Prediction

Large Language Models for Data Annotation: A Survey